* [PATCH 00/16] memory tagging improvements
@ 2021-03-05 21:36 Wilco Dijkstra
2021-03-08 9:34 ` Szabolcs Nagy
0 siblings, 1 reply; 3+ messages in thread
From: Wilco Dijkstra @ 2021-03-05 21:36 UTC (permalink / raw)
To: Szabolcs Nagy, dj; +Cc: 'GNU C Library'
Hi,
> - request2size and checked_request2size are not consistent which looks
> wrong (likely works in practice other than some sizes don't end up in
> fastbin or tcache that one might expected to, but i think this can
> break easily). Fixing it is not obvious because of assumptions about
> the chunk layout that is no longer true with mtag enabled.
It's better to change request2size to round up differently if tagging is
required. That avoids dynamically changing how chunks are rounded up.
> Other possible optimizations:
> - ifunc malloc apis
> good: near zero overhead for runtime dispatch.
> bad: malloc has to be built twice or larger refactoring is needed.
Using ifuncs is a good idea for the future - it allows removing the old
hooks and other terrible hacks like DUMPED_MAIN_ARENA_CHUNK.
Also it allows more modern malloc implementations to be added while
keeping backwards compatibility if needed (for eg. emacs).
> - unconditional top byte zero
> internal data is tagged 0, only user allocation is tagged non-zero
That's a great idea! Many chunk2mem can be changed into chunk2rawmem
approximately halving the overhead. And mem2chunk can unconditionally mask
out the tag (note all this could be made target-dependent if needed).
> - unconditional layout
> padding up to tag granule can be done unconditionally
> good: no runtime check, same layout with and without heap tagging.
> bad: wasting space with small allocations, design change.
This is unconditionally good for performance and avoids the request2size oddities.
If we're worried about memory usage, we need to talk about modern malloc
implementations that don't add a 16-byte header to every block.
> Benchmarked using bench-malloc-thread glibc ubenchmark on Neoverse N1.
> For size=16 the runtime check overhead is
> ~ 32% before the patch set
> ~ 16% after the patch set.
> On a practical workload or for large sizes the overhead should be tiny.
This is a huge improvement, but still far from being practical given malloc is already
slow, so making it this much slower is a non-starter. For this to be usable for distros,
we really need to get the overhead below 5% (and it looks like that is feasible - with
the optimizations mentioned I get ~3.2% overhead).
Cheers,
Wilco
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH 00/16] memory tagging improvements
2021-03-05 21:36 [PATCH 00/16] memory tagging improvements Wilco Dijkstra
@ 2021-03-08 9:34 ` Szabolcs Nagy
0 siblings, 0 replies; 3+ messages in thread
From: Szabolcs Nagy @ 2021-03-08 9:34 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: dj, 'GNU C Library'
The 03/05/2021 21:36, Wilco Dijkstra wrote:
> > - request2size and checked_request2size are not consistent which looks
> > wrong (likely works in practice other than some sizes don't end up in
> > fastbin or tcache that one might expected to, but i think this can
> > break easily). Fixing it is not obvious because of assumptions about
> > the chunk layout that is no longer true with mtag enabled.
>
> It's better to change request2size to round up differently if tagging is
> required. That avoids dynamically changing how chunks are rounded up.
...
> > - unconditional top byte zero
> > internal data is tagged 0, only user allocation is tagged non-zero
>
> That's a great idea! Many chunk2mem can be changed into chunk2rawmem
> approximately halving the overhead. And mem2chunk can unconditionally mask
> out the tag (note all this could be made target-dependent if needed).
i will look into this, i expected it to need bigger changes.
> > - unconditional layout
> > padding up to tag granule can be done unconditionally
> > good: no runtime check, same layout with and without heap tagging.
> > bad: wasting space with small allocations, design change.
>
> This is unconditionally good for performance and avoids the request2size oddities.
> If we're worried about memory usage, we need to talk about modern malloc
> implementations that don't add a 16-byte header to every block.
i think it is not enough to change request2size, one would
have to audit the code for hidden assumptions about the
layout, so even if we are happy to waste another word per
small allocations, it is not an obvious change.
(right now the logic is a bit ugly, but the ugliness only
affects behaviour in the tagging case)
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 00/16] memory tagging improvements
@ 2021-03-04 16:30 Szabolcs Nagy
0 siblings, 0 replies; 3+ messages in thread
From: Szabolcs Nagy @ 2021-03-04 16:30 UTC (permalink / raw)
To: libc-alpha, Richard.Earnshaw, DJ Delorie
This set tries to reduce the overhead of memory tagging support: when
glibc is built with --enable-memory-tagging there are runtime checks
that affect performance even if heap tagging is not enabled at runtime.
I refactored the code to reduce internal abstractions and ifdefs. And
a number of patches optimize heap tagging on aarch64 with MTE.
Also available in the nsz/mtag branch.
Issues found and fixed:
- realloc failed to untag the old memory in some cases.
- incorrect but useless macro definition in public sys/prctl.h.
- incorrect but harmless internal MTAG_MMAP_FLAGS declaration.
Issues found but not fixed:
- large granule is not supported: both chunk header and user allocation
must be tag granule aligned, but the code assumes that the chunk
header is 2*SIZE_SZ so larger granule is not supported.
- request2size and checked_request2size are not consistent which looks
wrong (likely works in practice other than some sizes don't end up in
fastbin or tcache that one might expected to, but i think this can
break easily). Fixing it is not obvious because of assumptions about
the chunk layout that is no longer true with mtag enabled.
Other possible optimizations:
- ifunc malloc apis
good: near zero overhead for runtime dispatch.
bad: malloc has to be built twice or larger refactoring is needed.
- unconditional top byte zero
internal data is tagged 0, only user allocation is tagged non-zero
(we rely on this e.g. where MADV_DONTNEED is used: then the kernel
zeros the tags so with non-zero tags internal pointers would not be
able to access recycled memory), so tag_at(p) for internal data
could be unconditional top byte clear.
good: faster than the branch in tag_at, may improve security (?).
bad: design change.
- unconditional layout
padding up to tag granule can be done unconditionally
good: no runtime check, same layout with and without heap tagging.
bad: wasting space with small allocations, design change.
This patchset tried to avoid intrusive changes that affect behaviour.
Benchmarked using bench-malloc-thread glibc ubenchmark on Neoverse N1.
For size=16 the runtime check overhead is
~ 32% before the patch set
~ 16% after the patch set.
On a practical workload or for large sizes the overhead should be tiny.
Szabolcs Nagy (16):
malloc: Fix a realloc crash with heap tagging [BZ 27468]
Remove PR_TAGGED_ADDR_ENABLE from sys/prctl.h
malloc: Move MTAG_MMAP_FLAGS definition
malloc: Simplify __mtag_tag_new_usable
malloc: Avoid taggig mmaped memory on free
malloc: Ensure the generic mtag hooks are not used
malloc: Refactor TAG_ macros to avoid indirection
malloc: Use global flag instead of function pointer dispatch for mtag
malloc: Only support zeroing and not arbitrary memset with mtag
malloc: Change calloc when tagging is disabled
malloc: Use branches instead of mtag_granule_mask
malloc: Use mtag_enabled instead of USE_MTAG
aarch64: inline __libc_mtag_address_get_tag
aarch64: inline __libc_mtag_new_tag
aarch64: Optimize __libc_mtag_tag_region
aarch64: Optimize __libc_mtag_tag_zero_region
include/malloc.h | 7 -
malloc/arena.c | 45 +-----
malloc/hooks.c | 20 ++-
malloc/malloc.c | 177 ++++++++++++-----------
sysdeps/aarch64/Makefile | 4 +-
sysdeps/aarch64/__mtag_address_get_tag.S | 32 ----
sysdeps/aarch64/__mtag_memset_tag.S | 53 -------
sysdeps/aarch64/__mtag_new_tag.S | 37 -----
sysdeps/aarch64/__mtag_tag_region.S | 98 ++++++++++---
sysdeps/aarch64/__mtag_tag_zero_region.S | 113 +++++++++++++++
sysdeps/aarch64/libc-mtag.h | 32 ++--
sysdeps/generic/libc-mtag.h | 43 ++++--
sysdeps/unix/sysv/linux/sys/prctl.h | 4 -
13 files changed, 353 insertions(+), 312 deletions(-)
delete mode 100644 sysdeps/aarch64/__mtag_address_get_tag.S
delete mode 100644 sysdeps/aarch64/__mtag_memset_tag.S
delete mode 100644 sysdeps/aarch64/__mtag_new_tag.S
create mode 100644 sysdeps/aarch64/__mtag_tag_zero_region.S
--
2.17.1
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-03-08 9:34 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 21:36 [PATCH 00/16] memory tagging improvements Wilco Dijkstra
2021-03-08 9:34 ` Szabolcs Nagy
-- strict thread matches above, loose matches on Subject: below --
2021-03-04 16:30 Szabolcs Nagy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).