How easy/hard would it be for glibc malloc to automatically align larger allocations
(e.g. say 4KB+ that are also multiples of 4KB) to a page address boundary, so that
they are always properly aligned for O_DIRECT IO?

I _thought_ that was already being done by default, but much to my surprise that was
not the case.  For improved IO efficiency, I was looking at whether it would be possible
to transparently avoid doing a user->kernel data copy during large write() calls and
just submitting the IO directly to underlying flash storage, but since the input buffers
are not aligned properly, this isn't possible.

I'm of course aware of posix_memalign(), but I was wondering about "normal" applications
that are written by users that don't know anything about this, and just allocate memory
and use it to submit IO.

I'd think that keeping this kind of "friendly" 4KB-multiple allocations in its own heap
would be very efficient for malloc, but I am not really familiar with the details.

Cheers, Andreas