public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 00/16] memory tagging improvements
@ 2021-03-05 21:36 Wilco Dijkstra
  2021-03-08  9:34 ` Szabolcs Nagy
  0 siblings, 1 reply; 3+ messages in thread
From: Wilco Dijkstra @ 2021-03-05 21:36 UTC (permalink / raw)
  To: Szabolcs Nagy, dj; +Cc: 'GNU C Library'

Hi,

> - request2size and checked_request2size are not consistent which looks
>   wrong (likely works in practice other than some sizes don't end up in
>   fastbin or tcache that one might expected to, but i think this can
>   break easily). Fixing it is not obvious because of assumptions about
>   the chunk layout that is no longer true with mtag enabled.

It's better to change request2size to round up differently if tagging is
required. That avoids dynamically changing how chunks are rounded up.

> Other possible optimizations:
> - ifunc malloc apis
>   good: near zero overhead for runtime dispatch.
>   bad: malloc has to be built twice or larger refactoring is needed.

Using ifuncs is a good idea for the future - it allows removing the old
hooks and other terrible hacks like DUMPED_MAIN_ARENA_CHUNK.
Also it allows more modern malloc implementations to be added while
keeping backwards compatibility if needed (for eg. emacs).

> - unconditional top byte zero
>   internal data is tagged 0, only user allocation is tagged non-zero

That's a great idea! Many chunk2mem can be changed into chunk2rawmem
approximately halving the overhead. And mem2chunk can unconditionally mask
out the tag (note all this could be made target-dependent if needed).

> - unconditional layout
>   padding up to tag granule can be done unconditionally
>   good: no runtime check, same layout with and without heap tagging.
>   bad: wasting space with small allocations, design change.

This is unconditionally good for performance and avoids the request2size oddities.
If we're worried about memory usage, we need to talk about modern malloc
implementations that don't add a 16-byte header to every block.

> Benchmarked using bench-malloc-thread glibc ubenchmark on Neoverse N1.
> For size=16 the runtime check overhead is
>   ~ 32% before the patch set
>   ~ 16% after the patch set.
> On a practical workload or for large sizes the overhead should be tiny.

This is a huge improvement, but still far from being practical given malloc is already
slow, so making it this much slower is a non-starter. For this to be usable for distros,
we really need to get the overhead below 5% (and it looks like that is feasible - with
the optimizations mentioned I get ~3.2% overhead).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 3+ messages in thread
* [PATCH 00/16] memory tagging improvements
@ 2021-03-04 16:30 Szabolcs Nagy
  0 siblings, 0 replies; 3+ messages in thread
From: Szabolcs Nagy @ 2021-03-04 16:30 UTC (permalink / raw)
  To: libc-alpha, Richard.Earnshaw, DJ Delorie

This set tries to reduce the overhead of memory tagging support: when
glibc is built with --enable-memory-tagging there are runtime checks
that affect performance even if heap tagging is not enabled at runtime.

I refactored the code to reduce internal abstractions and ifdefs. And
a number of patches optimize heap tagging on aarch64 with MTE.

Also available in the nsz/mtag branch.

Issues found and fixed:
- realloc failed to untag the old memory in some cases.
- incorrect but useless macro definition in public sys/prctl.h.
- incorrect but harmless internal MTAG_MMAP_FLAGS declaration.

Issues found but not fixed:
- large granule is not supported: both chunk header and user allocation
  must be tag granule aligned, but the code assumes that the chunk
  header is 2*SIZE_SZ so larger granule is not supported.
- request2size and checked_request2size are not consistent which looks
  wrong (likely works in practice other than some sizes don't end up in
  fastbin or tcache that one might expected to, but i think this can
  break easily). Fixing it is not obvious because of assumptions about
  the chunk layout that is no longer true with mtag enabled.

Other possible optimizations:
- ifunc malloc apis
  good: near zero overhead for runtime dispatch.
  bad: malloc has to be built twice or larger refactoring is needed.
- unconditional top byte zero
  internal data is tagged 0, only user allocation is tagged non-zero
  (we rely on this e.g. where MADV_DONTNEED is used: then the kernel
  zeros the tags so with non-zero tags internal pointers would not be
  able to access recycled memory), so tag_at(p) for internal data
  could be unconditional top byte clear.
  good: faster than the branch in tag_at, may improve security (?).
  bad: design change.
- unconditional layout
  padding up to tag granule can be done unconditionally
  good: no runtime check, same layout with and without heap tagging.
  bad: wasting space with small allocations, design change.

This patchset tried to avoid intrusive changes that affect behaviour.

Benchmarked using bench-malloc-thread glibc ubenchmark on Neoverse N1.
For size=16 the runtime check overhead is
  ~ 32% before the patch set
  ~ 16% after the patch set.
On a practical workload or for large sizes the overhead should be tiny.

Szabolcs Nagy (16):
  malloc: Fix a realloc crash with heap tagging [BZ 27468]
  Remove PR_TAGGED_ADDR_ENABLE from sys/prctl.h
  malloc: Move MTAG_MMAP_FLAGS definition
  malloc: Simplify __mtag_tag_new_usable
  malloc: Avoid taggig mmaped memory on free
  malloc: Ensure the generic mtag hooks are not used
  malloc: Refactor TAG_ macros to avoid indirection
  malloc: Use global flag instead of function pointer dispatch for mtag
  malloc: Only support zeroing and not arbitrary memset with mtag
  malloc: Change calloc when tagging is disabled
  malloc: Use branches instead of mtag_granule_mask
  malloc: Use mtag_enabled instead of USE_MTAG
  aarch64: inline __libc_mtag_address_get_tag
  aarch64: inline __libc_mtag_new_tag
  aarch64: Optimize __libc_mtag_tag_region
  aarch64: Optimize __libc_mtag_tag_zero_region

 include/malloc.h                         |   7 -
 malloc/arena.c                           |  45 +-----
 malloc/hooks.c                           |  20 ++-
 malloc/malloc.c                          | 177 ++++++++++++-----------
 sysdeps/aarch64/Makefile                 |   4 +-
 sysdeps/aarch64/__mtag_address_get_tag.S |  32 ----
 sysdeps/aarch64/__mtag_memset_tag.S      |  53 -------
 sysdeps/aarch64/__mtag_new_tag.S         |  37 -----
 sysdeps/aarch64/__mtag_tag_region.S      |  98 ++++++++++---
 sysdeps/aarch64/__mtag_tag_zero_region.S | 113 +++++++++++++++
 sysdeps/aarch64/libc-mtag.h              |  32 ++--
 sysdeps/generic/libc-mtag.h              |  43 ++++--
 sysdeps/unix/sysv/linux/sys/prctl.h      |   4 -
 13 files changed, 353 insertions(+), 312 deletions(-)
 delete mode 100644 sysdeps/aarch64/__mtag_address_get_tag.S
 delete mode 100644 sysdeps/aarch64/__mtag_memset_tag.S
 delete mode 100644 sysdeps/aarch64/__mtag_new_tag.S
 create mode 100644 sysdeps/aarch64/__mtag_tag_zero_region.S

-- 
2.17.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-03-08  9:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 21:36 [PATCH 00/16] memory tagging improvements Wilco Dijkstra
2021-03-08  9:34 ` Szabolcs Nagy
  -- strict thread matches above, loose matches on Subject: below --
2021-03-04 16:30 Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).