page alignment for large malloc requests?

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* page alignment for large malloc requests?
       [not found] <CB025C55-0E07-48DE-89A1-02F1F78E355A@dilger.ca>
@ 2020-10-03 20:31 ` Andreas Dilger
  2020-10-04 19:38   ` Carlos O'Donell
  2020-10-05  7:29   ` Florian Weimer
  0 siblings, 2 replies; 10+ messages in thread
From: Andreas Dilger @ 2020-10-03 20:31 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

How easy/hard would it be for glibc malloc to automatically align larger allocations
(e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that
they are always properly aligned for O_DIRECT IO?

I _thought_ that was already being done by default, but much to my surprise that was
not the case.  For improved IO efficiency, I was looking at whether it would be possible
to transparently avoid doing a user->kernel data copy during large write() calls and
just submitting the IO directly to underlying flash storage, but since the input buffers
are not aligned properly, this isn't possible.

I'm of course aware of posix_memalign(), but I was wondering about "normal" applications
that are written by users that don't know anything about this, and just allocate memory
and use it to submit IO.

I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap
would be very efficient for malloc, but I am not really familiar with the details.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger
@ 2020-10-04 19:38   ` Carlos O'Donell
  2020-10-05  7:30     ` Florian Weimer
  2020-10-05 21:19     ` Andreas Dilger
  2020-10-05  7:29   ` Florian Weimer
  1 sibling, 2 replies; 10+ messages in thread
From: Carlos O'Donell @ 2020-10-04 19:38 UTC (permalink / raw)
  To: Andreas Dilger, libc-alpha

On 10/3/20 4:31 PM, Andreas Dilger wrote:
> How easy/hard would it be for glibc malloc to automatically align larger allocations
> (e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that
> they are always properly aligned for O_DIRECT IO?

Could you define what you mean by "easy" or "hard?"

It is *technically* easy to make such a change.

The deeper question is: Is it worth the increased RSS cost and memory fragmentation
to support an "everything is ready for O_DIRECT IO" use case?

You would be increasing fragmentation by not allowing the allocator to pick a chunck
that did not meet the alignment requirement.

My opinion is that if the application needs to use O_DIRECT it should use
aligned_alloc() or any of the suitable aligned allocation functions.

> I _thought_ that was already being done by default, but much to my surprise that was
> not the case.  For improved IO efficiency, I was looking at whether it would be possible
> to transparently avoid doing a user->kernel data copy during large write() calls and
> just submitting the IO directly to underlying flash storage, but since the input buffers
> are not aligned properly, this isn't possible.

If you lower the malloc mmap threshold then you'll go straight to mmap for allocations
and those will be page aligned, but you'll pay the performance cost for those allocations.

It might be possible to create a tunable to raise the minimum chunk alignment to any
arbitrary size, but you'd have a unique system with that setting set that high.

> I'm of course aware of posix_memalign(), but I was wondering about "normal" applications
> that are written by users that don't know anything about this, and just allocate memory
> and use it to submit IO.

Are you injecting O_DIRECT into the application open*() calls?

Can you please explain a bit more about your use case?

> I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap
> would be very efficient for malloc, but I am not really familiar with the details.

It is workload dependent. In general this will lead to huge fragmentation at the page
size, and for 64KiB page architectures there is going to be a lot of fragmentation.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger
  2020-10-04 19:38   ` Carlos O'Donell
@ 2020-10-05  7:29   ` Florian Weimer
  1 sibling, 0 replies; 10+ messages in thread
From: Florian Weimer @ 2020-10-05  7:29 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: libc-alpha

* Andreas Dilger:

> How easy/hard would it be for glibc malloc to automatically align
> larger allocations (e.g. say 4KB+ that are also multiples of 4KB) to a
> page address boundary, so that they are always properly aligned for
> O_DIRECT IO?

The current data structures are not set up for this at all, so it's
difficult to do this and retain good performance.

> I'm of course aware of posix_memalign(), but I was wondering about
> "normal" applications that are written by users that don't know
> anything about this, and just allocate memory and use it to submit IO.

posix_memalign is currently rather broken: If you free an allocation, in
many cases, glibc malloc cannot return the same allocation for a
subsequent posix_memalign request with the same parameters.  I guess
this is a form of fragmentation.  It can lead to poor memory utilization
with allocation patterns that happen in practice.

We have an idea to fix this (by making posix_memalign slower
unfortunately), but the patch hasn't been upstreamed & reviewed in
months.  This should give you an idea how difficult it is to make malloc
changes.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-04 19:38   ` Carlos O'Donell
@ 2020-10-05  7:30     ` Florian Weimer
  2020-10-06  1:52       ` Carlos O'Donell
  2020-10-05 21:19     ` Andreas Dilger
  1 sibling, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2020-10-05  7:30 UTC (permalink / raw)
  To: Carlos O'Donell via Libc-alpha; +Cc: Andreas Dilger, Carlos O'Donell

* Carlos O'Donell via Libc-alpha:

> If you lower the malloc mmap threshold then you'll go straight to mmap
> for allocations and those will be page aligned, but you'll pay the
> performance cost for those allocations.

This is not correct.  Due to the glibc malloc header, the pointers
returned to applications will NOT be page-aligned.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-04 19:38   ` Carlos O'Donell
  2020-10-05  7:30     ` Florian Weimer
@ 2020-10-05 21:19     ` Andreas Dilger
  2020-10-06  0:43       ` DJ Delorie
  2020-10-06  7:41       ` Florian Weimer
  1 sibling, 2 replies; 10+ messages in thread
From: Andreas Dilger @ 2020-10-05 21:19 UTC (permalink / raw)
  To: Carlos O'Donell, Florian Weimer; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 6015 bytes --]

On Oct 4, 2020, at 1:38 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> 
> On 10/3/20 4:31 PM, Andreas Dilger wrote:
>> How easy/hard would it be for glibc malloc to automatically align larger allocations
>> (e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that
>> they are always properly aligned for O_DIRECT IO?
> 
> Could you define what you mean by "easy" or "hard?"

I'm a kernel developer, and I'm not familiar with the glibc code at all,
so I was hoping to get some idea from people knowledgeable in this code
on whether it is possible for a glibc newbie, and whether it is practical
and desirable to make such a change.  Maybe even "someone wrote an old
patch for doing this, but ...".

> It is *technically* easy to make such a change.
> 
> The deeper question is: Is it worth the increased RSS cost and memory fragmentation
> to support an "everything is ready for O_DIRECT IO" use case?
> 
> You would be increasing fragmentation by not allowing the allocator to pick a chunck
> that did not meet the alignment requirement.

I was hoping that adding a new allocation policy to glibc might be
similar to allocating a new slab in the kernel, but with a different
allocation policy.  I've read about "arenas" in the mallinfo(3) and
malloc_info(3), but I don't really have much idea of the details of
"arenas", and reading StackOverflow and other postings such as
https://heap-exploitation.dhavalkapil.com/diving_into_glibc_heap/bins_chunks
are only really second-hand speculation.

If allocations are PAGE_SIZE multiples and/or powers-of-two, then
I'd _think_ having a separate "arena" (if I'm using it correctly)
that keeps these allocations aligned on the PAGE_SIZE boundaries
would not result in much overhead or fragmentation. These would not
IMHO be mixed in with small or random-sized allocations to keep
the allocations CPU/RAM efficient.

>> I _thought_ that was already being done by default, but much to my surprise that was
>> not the case.  For improved IO efficiency, I was looking at whether it would be possible
>> to transparently avoid doing a user->kernel data copy during large write() calls and
>> just submitting the IO directly to underlying flash storage, but since the input buffers
>> are not aligned properly, this isn't possible.
> 
> If you lower the malloc mmap threshold then you'll go straight to mmap for allocations
> and those will be page aligned, but you'll pay the performance cost for those allocations.
> 
> It might be possible to create a tunable to raise the minimum chunk alignment to any
> arbitrary size, but you'd have a unique system with that setting set that high.
> 
>> I'm of course aware of posix_memalign(), but I was wondering about "normal" applications
>> that are written by users that don't know anything about this, and just allocate memory
>> and use it to submit IO.
> 
> Are you injecting O_DIRECT into the application open*() calls?

No, because we can't know in advance whether all IO on the fd will
be properly aligned and sized for O_DIRECT.  This would be decided
on a per-write*() syscall basis, if it is large enough and aligned
properly.  We saw good improvements for some IO benchmarks with
automatic O_DIRECTifying large write calls, but then realized that
even large malloc() calls do not align on more than an 16-byte
boundary (256 bytes on MacOS because of alignment for vector ops).
I'd assumed this would be true for glibc, since kernel allocation
of PAGE_SIZE multiples are aligned on a PAGE_SIZE boundary).

> Can you please explain a bit more about your use case?
> 
> My opinion is that if the application needs to use O_DIRECT it should use
> aligned_alloc() or any of the suitable aligned allocation functions.

My goal is to try and optimize application IO performance (with
improved speed and/or reduced CPU usage) without having to rewrite
every application and library.  While it is possible to fix common
tools like "dd" and "copy" and "tar", it is not practical to do this
on a wider scale, and new code is continually being written by users
that are not at all familiar with good IO, so that is a losing battle.

>> I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap
>> would be very efficient for malloc, but I am not really familiar with the details.
> 
> It is workload dependent. In general this will lead to huge fragmentation at the page
> size, and for 64KiB page architectures there is going to be a lot of fragmentation.

Any mention of "4KB" is really meant to be PAGE_SIZE, since it would
be the internal alignment requirements to avoid data copies anyway.

On Oct 5, 2020, at 1:29 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> Andreas Dilger wrote:
>> I'm of course aware of posix_memalign(), but I was wondering about
>> "normal" applications that are written by users that don't know
>> anything about this, and just allocate memory and use it to submit IO.
> 
> posix_memalign is currently rather broken: If you free an allocation, in
> many cases, glibc malloc cannot return the same allocation for a
> subsequent posix_memalign request with the same parameters.  I guess
> this is a form of fragmentation.  It can lead to poor memory utilization
> with allocation patterns that happen in practice.
> 
> We have an idea to fix this (by making posix_memalign slower
> unfortunately), but the patch hasn't been upstreamed & reviewed in
> months.  This should give you an idea how difficult it is to make malloc
> changes.

That is unfortunate.  I'd _think_ that segregating large/aligned malloc
could be done efficiently, especially if there was a higher demand for
them if they were enabled by default.  That wouldn't affect the most
common case of small malloc.  Essentially, there would be a dedicated
area for the "auto-posix_memalign()" behavior of malloc() with large
allocations.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-05 21:19     ` Andreas Dilger
@ 2020-10-06  0:43       ` DJ Delorie
  2020-10-06  7:41       ` Florian Weimer
  1 sibling, 0 replies; 10+ messages in thread
From: DJ Delorie @ 2020-10-06  0:43 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: carlos, fweimer, libc-alpha

Andreas Dilger <adilger@dilger.ca> writes:
> I was hoping that adding a new allocation policy to glibc might be
> similar to allocating a new slab in the kernel, but with a different
> allocation policy.

The fundamental problem with trying this is that the ancient APIs we use
for memory allocation give almost no clues to us about what policies
make sense.  Most malloc implementations use heuristics and benchmarks
to decide what the "right" thing to do is.  Additionally, most
standards-compliant programs have little need of alignment other than
for base types (int, long, etc).

> I've read about "arenas" in the mallinfo(3) and malloc_info(3),

See https://sourceware.org/glibc/wiki/MallocInternals

Arenas are just memory pools we can create, resize, and discard as
needed.  There's no correlation between arenas and policies.  I've
thought about doing something along those lines though, with each arena
having a set of internal hooks for alloc/realloc/free and their own
policies and optimizations.  For example, a "page arena" would need to
have metadata stored separately so that the allocations could be doled
out one per page, without waste.  Our current algorithm stores metadata
between allocations, which would waste a whole page per page-aligned
allocation.

> That is unfortunate.  I'd _think_ that segregating large/aligned malloc
> could be done efficiently, especially if there was a higher demand for
> them if they were enabled by default.  That wouldn't affect the most
> common case of small malloc.  Essentially, there would be a dedicated
> area for the "auto-posix_memalign()" behavior of malloc() with large
> allocations.

We segregate large allocations into mmap-per-alloc, but applications
which need such large aligned allocations should call mmap directly
anyway.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-05  7:30     ` Florian Weimer
@ 2020-10-06  1:52       ` Carlos O'Donell
  0 siblings, 0 replies; 10+ messages in thread
From: Carlos O'Donell @ 2020-10-06  1:52 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell via Libc-alpha; +Cc: Andreas Dilger

On 10/5/20 3:30 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> If you lower the malloc mmap threshold then you'll go straight to mmap
>> for allocations and those will be page aligned, but you'll pay the
>> performance cost for those allocations.
> 
> This is not correct.  Due to the glibc malloc header, the pointers
> returned to applications will NOT be page-aligned.

You are absolutely right, I wasn't thinking straight here, you'd need two
pages for this to work, and that's really bad fragmentation again.

We would need to have a completely distinct allocator to make this work,
and at that point we're back to the discussion we had earlier about
arenas with "types" of memory [1].

-- 
Cheers,
Carlos.

[1] https://sourceware.org/pipermail/libc-alpha/2020-September/117864.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-05 21:19     ` Andreas Dilger
  2020-10-06  0:43       ` DJ Delorie
@ 2020-10-06  7:41       ` Florian Weimer
  2020-10-06 20:24         ` Andreas Dilger
  1 sibling, 1 reply; 10+ messages in thread
From: Florian Weimer @ 2020-10-06  7:41 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Carlos O'Donell, libc-alpha

* Andreas Dilger:

> No, because we can't know in advance whether all IO on the fd will
> be properly aligned and sized for O_DIRECT.  This would be decided
> on a per-write*() syscall basis, if it is large enough and aligned
> properly.  We saw good improvements for some IO benchmarks with
> automatic O_DIRECTifying large write calls, […]

I do not think this would result in an overall benefit.  Have you
tweaked the kernel to evcit the modified pages from the page cache after
large writes, to fake the impact of the change?  And run some typical
tasks with that, like a kernel build?

I expect that performance will not be great because most written data is
read again after a fairly short time.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-06  7:41       ` Florian Weimer
@ 2020-10-06 20:24         ` Andreas Dilger
  2020-10-07  7:06           ` Florian Weimer
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2020-10-06 20:24 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Carlos O'Donell, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1348 bytes --]

On Oct 6, 2020, at 1:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Andreas Dilger:
> 
>> No, because we can't know in advance whether all IO on the fd will
>> be properly aligned and sized for O_DIRECT.  This would be decided
>> on a per-write*() syscall basis, if it is large enough and aligned
>> properly.  We saw good improvements for some IO benchmarks with
>> automatic O_DIRECTifying large write calls, […]
> 
> I do not think this would result in an overall benefit.  Have you
> tweaked the kernel to evcit the modified pages from the page cache after
> large writes, to fake the impact of the change?  And run some typical
> tasks with that, like a kernel build?
> 
> I expect that performance will not be great because most written data is
> read again after a fairly short time.

That really depends heavily on the IO workload.  While a kernel build
is a common workload for kernel or GCC developers, there are lots of other
workloads (e.g. machine learning, weather forecasting, fluid dynamics,
CGI, video streaming, etc.) that read/write large chunks from/to disk.

I thought after some of the comments here that maybe larger allocations
will be naturally aligned due to mmap, but checked up to 16MiB allocations,
and they still did not show any aligned allocations.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page alignment for large malloc requests?
  2020-10-06 20:24         ` Andreas Dilger
@ 2020-10-07  7:06           ` Florian Weimer
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Weimer @ 2020-10-07  7:06 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: libc-alpha

* Andreas Dilger:

> On Oct 6, 2020, at 1:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> 
>> * Andreas Dilger:
>> 
>>> No, because we can't know in advance whether all IO on the fd will
>>> be properly aligned and sized for O_DIRECT.  This would be decided
>>> on a per-write*() syscall basis, if it is large enough and aligned
>>> properly.  We saw good improvements for some IO benchmarks with
>>> automatic O_DIRECTifying large write calls, […]
>> 
>> I do not think this would result in an overall benefit.  Have you
>> tweaked the kernel to evcit the modified pages from the page cache after
>> large writes, to fake the impact of the change?  And run some typical
>> tasks with that, like a kernel build?
>> 
>> I expect that performance will not be great because most written data is
>> read again after a fairly short time.
>
> That really depends heavily on the IO workload.  While a kernel build
> is a common workload for kernel or GCC developers, there are lots of other
> workloads (e.g. machine learning, weather forecasting, fluid dynamics,
> CGI, video streaming, etc.) that read/write large chunks from/to disk.

Still it's unclear which of them actually benefit from forcing a
write-read cycle to go through persistent storage for the read.  The
page cache experiment I suggested would allow you to determine that
without having to change userspace at first.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-10-07  7:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CB025C55-0E07-48DE-89A1-02F1F78E355A@dilger.ca>
2020-10-03 20:31 ` page alignment for large malloc requests? Andreas Dilger
2020-10-04 19:38   ` Carlos O'Donell
2020-10-05  7:30     ` Florian Weimer
2020-10-06  1:52       ` Carlos O'Donell
2020-10-05 21:19     ` Andreas Dilger
2020-10-06  0:43       ` DJ Delorie
2020-10-06  7:41       ` Florian Weimer
2020-10-06 20:24         ` Andreas Dilger
2020-10-07  7:06           ` Florian Weimer
2020-10-05  7:29   ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).