From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 8D2C53858C83; Wed, 15 Mar 2023 05:34:59 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8D2C53858C83
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1678858499;
	bh=oFqpchD816nWHTWs+f8oRT2vF7APtw3yqGuE7c2su7U=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=uSrP042BiY9QIFd/C4m0+eH2nrZS/Jq5SoeAzV8Ns0JiMoX/iZILQ5M0M7+rFjhvk
	 5EoGcwAkpr90LvIQVXc13pdL7/q8Jup7GFeu2M/DRFsdxnuGEf2jmgbh+iBs+ubOPv
	 emBDYmzeMaFbLeo0cR50RE9p/yvfOfXL5syXjuQY=
From: "janderson at rice dot edu" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug dynamic-link/30007] rfe: dlopen to specified address
Date: Wed, 15 Mar 2023 05:34:59 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: dynamic-link
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: janderson at rice dot edu
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-30007-131-8dnmk12eY5@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-30007-131@http.sourceware.org/bugzilla/>
References: <bug-30007-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <glibc-bugs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D30007

Jonathon Anderson <janderson at rice dot edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |janderson at rice dot edu
--- Comment #8 from Jonathon Anderson <janderson at rice dot edu> ---
Hopping over here from a long and winding discussion in
https://sourceware.org/bugzilla/show_bug.cgi?id=3D30127.

(In reply to Stas Sergeev from comment
https://sourceware.org/bugzilla/show_bug.cgi?id=3D30127#c46)
> So let me summarize that memfd_create()
> (shm_open() actually) is not a replacement,
> but rather is an essential part of the
> scheme. Using it together with la_premap_dlmem()
> and la_premap() you can get the desired
> picture. Desired picture is 2 identical
> mappings of the same lib, one at relolc_addr,
> one at mmap_addr=3Dreloc_addr+VM_window_start.
>=20
> There is basically nothing else!
> That scheme is very simple to describe,
> but not that simple to grok from that
> description, as no one have tried that
> layout yet.
I think above and this is a succinct description of Stas's intended use cas=
e:
having double mappings for solibs allows sharing data between the host and =
a VM
with only address translation at the VM boundary, instead of address
translation on every memory access inside the VM. Solutions exist for heap
memory and stack memory, leaving primarily the .data/.bss memory allocated =
as
part of an solib. (Correct me if I'm mistaken of course.)

The proposed la_premap and la_premap_dlmem (part of the dlmem() patch)
collectively "solve" this problem by granting LD_AUDIT some limited control
over the object (segment) mapping process. My first impression from reading=
 the
test cases, they seem a bit too specific to this use case. IMHO they are al=
so
out-of-scope for LD_AUDIT: LD_AUDIT works at the level of symbols and objec=
ts,
both generic across OSs and even binary formats (ELF + DLL), whereas la_pre=
map*
expose an implementation detail of the dynamic linker. Most importantly, we=
 do
not yet deeply understand the implications exposing these callbacks can hav=
e,
security or otherwise.

An alternative solution I brought up in the prior discussion is "wrapping" =
the
mmap syscall. In general, any Linux syscall can be wrapped using seccomp (e=
.g.
via libseccomp [1]) or more recently with syscall user dispatch [2]. With t=
he
wrapper in place, every mmap would be replicated in the VM memory window and
update a table used for address translation. Some behavior changes would be
needed to appropriately implement MAP_ANONYMOUS and MAP_FIXED, but neither =
seem
particularly problematic.

AFAIK, this "wrapping mmap" approach is vastly more powerful and effective =
than
the proposed la_premap{,_dlmem}. It operates at the Linux kernel level, and
requires no changes to Glibc to implement nor a bleeding-edge kernel. It is
powerful enough to transparently handle heap memory (provided the targeted
allocation arena is brand new, i.e. in a newly opened dlmopen namespace).
Wrapping and reimplementing syscalls are well-understood and widely used
techniques by VM-adjacent tools, e.g. Wine (Windows syscall emulator) [3] a=
nd
Docker/Podman (container runtimes) [4].

If this well-understood approach solves the problem, IMHO there isn't much
point in arguing this RFE further.

[1]: https://libseccomp.readthedocs.io/en/latest/
[2]: https://docs.kernel.org/admin-guide/syscall-user-dispatch.html
[3]: https://lwn.net/Articles/826313/
[4]: https://docs.docker.com/engine/security/seccomp/=20

In response to a few other bits of prior discussion about mapping objects:

(In reply to Stas Sergeev from comment
https://sourceware.org/bugzilla/show_bug.cgi?id=3D30127#c45)
> > > by doing 2 mappings of the same lib.=20
> > ...If all you wanted was to mmap the solib to another address, you can
> > already do that using mmap and /proc/self/map_files/. Maybe dl_iterate_=
phdr.
>=20
> That can only work for loadable sections,
> I believe. .bss cannot be shared that way,
> and likely much more.
You're right, neglected .bss when suggesting this idea. This would not be an
issue when using an mmap wrapper however, as the region is simply mapped wi=
th
MAP_ANONYMOUS.

(In reply to Stas Sergeev from comment
https://sourceware.org/bugzilla/show_bug.cgi?id=3D30127#c48)
> (In reply to Jonathon Anderson from comment https://sourceware.org/bugzil=
la/show_bug.cgi?id=3D30127#c33)
> > The result of the first call to mmap() for an solib decides the base ad=
dress
>=20
> While a bit outdated topic, I don't
> think "the first call to mmap()" is a
> good or reliable work-around. It may
> change with an impl, or because of the
> threads.
To clarify here, the "first" call to mmap() is the one without MAP_FIXED, a=
nd
is used to allocate the pages that will later be overwritten by MAP_FIXED.
Threads should not become a problem here, just check the flags.

Also, this is the pattern heavily recommended in man mmap(2) (NOTES, "Using
MAP_FIXED safely"). IMHO it's unlikely that part of the implementation will
change drastically, and I'm confident an mmap syscall wrapper could still
handle it even if it did. :D

> > AFAICT these discussions are all solved by memfd_create. Almost all of =
the
> > complaints revolve around the memory vs. disk performance difference,
>=20
> I am getting a bit nervous already when people
> mention memfd_create(). :) In what way is it
> any better than shm_open(), that I used in my
> la_premap_dlmem() example?
> Yes, I could also use memfd_create() with
> la_premap_dlmem(), but I prefer shm_open().
> Why people think that memfd_create() is the
> thing, is unclear to me. :) But it fits my
> design very well, as does shm_open().
My understanding is that the "file" created by memfd_create() cannot be sha=
red
outside the process and it's spawned children, whereas the "file" created by
shm_open() can be accessed by any other process with the same path argument.
memfd_create() seems to be the more appropriate function when a *private*
memory-backed file descriptor is needed, shm_open() is better suited for sh=
ared
memory across processes (hence the name).

--=20
You are receiving this mail because:
You are on the CC list for the bug.=