From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 821B33858C54; Tue, 28 Mar 2023 06:17:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 821B33858C54
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1679984277;
	bh=/e0WKuEIYoYJ1/5vK9ijBiy/n7r4cbwmoF4+afTO1TU=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Bds6CHEeptNl9VejFEE7xxVHxoAllSFSUj70t4XJcvoEP4U1ipTDRTI/PUdTXabJI
	 03kxuU8QjWmUV2DFs1GAY+0JOkZNIzzid3zpBRo67QOGdh/aXX7AgdiPhjjB1jGDF3
	 9Mk1V65Ejujz9AHv//HmMx0jptV9ZIubnLc1hzkc=
From: "janderson at rice dot edu" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug dynamic-link/30007] rfe: dlopen to specified address
Date: Tue, 28 Mar 2023 06:17:57 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: dynamic-link
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: janderson at rice dot edu
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-30007-131-11y90EAoYI@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-30007-131@http.sourceware.org/bugzilla/>
References: <bug-30007-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <glibc-bugs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D30007
--- Comment #27 from Jonathon Anderson <janderson at rice dot edu> ---
(In reply to Stas Sergeev from comment #24)
> Could you please explain the concern
> itself? I mean, what problem is there
> to have an API to dlmem() from memory?
> Is it a security concern, or what kind of?
Briefly summarizing the main points from the original email in the mailing =
list
[1]:
> dlmem() works at a lower level of abstraction than the rest of the dl* AP=
Is, i.e. memory instead of solibs/objects. That has widespread impacts acro=
ss many users of Glibc, including but not limited to security, LD_AUDIT, an=
d developer tools (GDB). Some reasons follow:
>   - dlmem() does not ensure that the passed memory is a correctly mmap()'=
d object. It would be strongly preferable that the API ensures we CAN'T end=
 up in an inconsistent state, instead of making it UB if the user slips up.
>   - dlmem() removes the "file descriptor" abstraction out of the link_map=
. A lot of tooling has to change to fit this new reality, both inside and o=
utside Glibc: LD_AUDIT, developer tools (e.g. GDB), etc.
>   - dlmem() skips many syscalls, removing the kernel-side auditable event=
s required for security tooling. In contrast, "dlopenfd" requires both memf=
d_create() (or similar) and mmap() of that fd, allowing e.g. FFI/JIT to be =
locked down by a security seccomp filter.

Adding my own concern as well:
  - dlmem() seems to to expect the user to parse the program headers and mm=
ap()
the binary as required. That requires the application to re-implement a cor=
e,
delicate piece of ld.so... and do so correctly. From an API design perspect=
ive,
that seems like a very poor choice of abstraction.

AFAICS none of these issues have been resolved in the latest patches. Some =
of
these issues are intrinsic to the dlmem() semantics. So if another, better =
API
will work for your case, that certainly would be preferred.

[1]: https://sourceware.org/pipermail/libc-alpha/2023-February/145735.html

> > The following API is close to your use case but doesn't raise the same
> > concerns as dlmem(). Does this solve your problem, if not what's missin=
g?
> >     void *dlopen4(const char *filename, int flags, const struct dlopen4=
_args
> > *ext, size_t ext_size /* =3D sizeof(struct dlopen3_args) */);
> >     void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const
> > struct dlopen4_args *ext, size_t ext_size /* =3D sizeof(struct dlopen3_=
args)
> > */);
> >     struct dlopen4_args {
> >       /* If not NULL, function called before mmap when loading the obje=
ct
> > [and its dependencies?].
> >          Returns the base of a mmapped range of given length and alignm=
ent.
> > This mapping will be
> >          overwritten by the loaded object.  */
> >       void *(*dla_premap)(void *preferred_addr, size_t length, size_t a=
lign,
> > void *userdata);
> >       /* User data passed to dla_premap.  */
> >       void *dla_premap_userdata;
> >     };
>=20
> The primary problem is that this API
> doesn't allow to preserve the user's
> mapping. It is only using that mapping
> to specify the reloc address, while
> dlmem() can optionally preserve it (I
> use the separate flag for that).
This is precisely one of the concerns with dlmem(). Why must the user's map=
ping
be preserved? So that the mirroring can be set up before the object is load=
ed?

Would replacing the dla_premap hook with some kind of custom-mmap()
(dla_mmap()) hook fit your use case better? That could allow you to set up
mirroring *as* the object is loaded, instead of before.

FWIW, do you need page-mirroring at all if you can just choose the reloc
address to be within the VM space?

> The secondary problem is "filename",
> but yes, I know you'll suggest to get
> it from /proc/self/fd.
I would prefer /proc/self/fd over dlopenfd4(). But dlopenfd() seems to be of
wider interest, so whatever works.

> I can remove all refactors and replace
> them with copy/pasts. Much bigger code
> but no change to existing code.
> Will that be any better?
> OTOH all refactors I did, just take some
> code chunk and move it to a separate func
> with the different indentation level.
> These diffs should be looked into with
> some tool that ignores indentation.
> Only then it would be clear how small they
> are.
I wouldn't waste any more time on the dlmem() patch until the concerns above
can be addressed.

> > As I mentioned before, syscall interception is a technique used in many
> > VM-adjacent and widely used technologies, to name a few: containers
> > Windows emulation (Wine), browser sandboxes
> > (Firefox/Chromium),
>=20
> I wonder if the above ones are actually
> do the syscall interception, or just use
> the bpf filters to avoid malicious code
> from using syscalls?
None of the examples above do exactly what you're looking for. If I knew of=
 any
OSS that did, I would just point you there. AFAIK your use case is very uni=
que.

Of the examples I've named:
  - Wine outright implements Windows syscalls on Linux, by intercepting all
syscalls in the running process and performing the translation in userspace
(SIGSYS handler).
  - strace and GDB intercept syscalls "remotely" via ptrace(). IMHO the pro=
cess
of poking the registers and memory via ptrace() is not all that different t=
han
doing so from inside a SIGSYS signal handler.
  - Podman/Docker use libseccomp to filter syscalls with BPF seccomp() filt=
ers.
BPF isn't powerful enough for the proposed approach, but it is similar in t=
hat
it can alter the arguments and return values (to a limited extent).
  - Firefox and Chrom(ium) also use seccomp() filters, but they also regist=
er
special handlers for SIGSYS. IIRC it's mainly for error reporting and not f=
or
interception, but you get the idea.

In short, intercepting syscalls is done in multiple OSS projects to varying
extents, for security and for profit. Wine is the only one that is as extre=
me
as your use case, but the rest do have some degree of similarity.

> > Given all this, I consider it much easier to write a syscall intercepti=
on
> > code than to write a shim library to translate between 32- and 64-bit c=
all
> > ABIs. FWIW. :D
>=20
> Its a bit strange to intercept the syscalls
> of your own code. I am quite sure none of
> the projects you mentioned, actually do this.
> They intercept the syscalls of some 3rd-party
> code running along, but never their own syscalls.
Presumably you won't intercept (all of) your own syscalls, primarily you're
aiming for the syscalls while the 3rd-party "ancient code" is loading. So i=
sn't
it pretty much the same?

> Most of dl_audit framework can be implemented
> with syscall interception, but why don't you
> want to do that?
Because (1) LD_AUDIT hearkens back to the days of Solaris and so is already=
 on
literally every GNU/Linux box in active use, and because (2) symbol binding
(la_symbind) is done completely in userspace and can't be intercepted by
syscalls.

Very different situation.

> > > dlopenfd()+memfd doesn't give even the
> > > possibility of specifying the reloc address,
> > > and that's a very minimal, insufficient requirement.
> > Because you need the pages to be mirrored? Or is there another requirem=
ent
> > here?
>=20
> Mirrored and also reloc address specified.
> AFAICT fdlopen()+memfd gives neither.
And based on prior comments, I assume you also want to preserve user mappin=
gs
here.

> > If flags contains MAP_PRIVATE, extra steps are once again needed. If th=
is is
> > a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used l=
ater
> > to add write access (IIRC I have not observed Glibc's ld.so do so with
> > strace),
>=20
> But this is exactly what happens if you
> have unaligned SHT_NOBITS section. It
> goes to the same page that used MAP_PRIVATE
> to load an elf segment. glibc then re-protects
> and memsets that part. Even if you haven't
> seen that with strace, I was pointing to the
> exact code that does this.
Missed that comment, sorry. Link to the code so we're all on the same page:=
 [2]

Note that the mprotect() calls are only if(__glibc_unlikely((c->prot &
PROT_WRITE) =3D=3D 0)). It seems that newer ld places .data and small .bss =
in a RW
LOAD segment, which would explain why I've never observed it happen myself =
with
strace and modern software.

This makes me curious how old/common binaries are that trip this case. This
code (complete with the "Dag nab it" comment) have been present in Glibc si=
nce
1995: [3]. So maybe... *really* ancient binaries? :D

If it bothers you, this case can be ignored and the following case (that co=
pies
the data to a writable anonymous file) used instead.

[2]:
https://sourceware.org/git/?p=3Dglibc.git;a=3Dblob;f=3Delf/dl-map-segments.=
h;hb=3D07dd75589ecbedec5162a5645d57f8bd093a45db#l165
[3]:
https://sourceware.org/git/?p=3Dglibc.git;a=3Dblob;f=3Delf/dl-load.c;hb=3Dd=
66e34cd423425c348bcc83df127dd19711b0b9a#l339

> > then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the
> > rest will work.
>=20
> If the page is never re-protected, then
> MAP_SHARED is not even needed. You can
> just have 2 private mappings from same file.
True!

> > Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, =
the
>=20
> AFAIK there is no such case.
> PROT_WRITE is applied later with mprotect()
> if you have an unaligned SHT_NOBITS section,
> but is AFAICS never applied initially.
PROT_WRITE is applied initially if the LOAD segment is marked as RW. A quick
readelf -l on a few of my system's binaries seems to indicate this is pretty
common for .data and .bss in modern software.

> Contrary to what you say, no one is
> intercepting his own syscalls.
I beg to disagree. Many projects filter or intercept their own syscalls. Th=
is
*specific* approach hasn't been done before (I would point you to it if it
was), but intercepting (or at least filtering) syscalls in-process is nothi=
ng
new.

> And the SHT_NOBITS section problem is not
> yet addressed, although of course you will
> propose to intercept also mprotect() to get
> it in.
The most I would do in an mprotect() interception is ensure PROT_WRITE does=
n't
get added to any pages (i.e. abort() the application if it does). That does=
n't
really solve this problem, but it could catch some issues with the mirrored
pages. Maybe. :P

> Pages are not mirrored in case of a
> MAP_PRIVATE mapping that was later
> re-protected to r/w. Of course you
> can always use MAP_SHARED beforehand,
> and do a writable file copy,
Indeed! Which is exactly what I suggested. :D

> which will
> basically mean to just copy the initially
> memory-based solib into a file on hdd rather
> than to even properly use memfd.
Why is the HDD required here, can't you just copy to a memfd file? That's w=
hat
I suggested above.

> > IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only=
, so
> > they don't need to be mirrored in a simple PoC implementation. At least=
 for
> > simple cases, YMMV.=20
>=20
> Not sure if the unaligned SHT_NOBITS
> (that causes re-protect to R/W) is a
> "simple case" or not.
Well, it *seems* very uncommon in modern software. Not sure whether it's ra=
re
in your "ancient code" case. Either way, solution discussed above.

> If indentation is ignored, then my patches
> touch a dozen of lines. There are just the
> moves of a large chunks of code to a separate
> funcs.
I more meant that comparing the hundreds of lines that have moved around is
time-consuming. There are tools to help, it just takes a lot of time that c=
ould
be better spent other places. Speaking from experience with my main project=
. :P

But it seems like your latest patches are shorter than I had remembered, I
stand corrected. IIRC at one point there was a 1300-addition patch, which is
where my comment came from, but that seems to have been cleaned up now. Gre=
at!
:D

(In reply to Stas Sergeev from comment #26)
> (In reply to Jonathon Anderson from comment #23)
> > These are niceties, but I think we can agree a direct implementation of
> > dlopen_with_offset() would be better for the use cases that need it. It
> > would also require far less refactors than dlmem().
>=20
> Getting a bit more abstract here,
> why refactors are that bad? glibc
> is full of multi-thousands-line funcs
> intersected by gotos. Is this because
> the refactors are prohibited?
> I mean, I was hoping for a "thank you"
> for a couple of small refactors.
> Is the current glibc code style (huge
> spaghetti funcs) is intentional and
> enforced?
I don't run the show here... but AFAIK the code here is carefully, heavily,
manually optimized to generate the best performance with a wide range of C
compilers. Carelessly refactoring it and especially adding additional funct=
ion
calls will destroy a lot of that work. (Although I dislike the spaghetti as
much as you do. :P)

I've seen other refactors merge from the mailing list, IIRC performance alm=
ost
always comes up in the leading discussion.

But again, the main problem with your patches is the concerns with the dlme=
m()
semantics, not the size nor quality of your patches themselves. So let's fix
that first.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=