From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 290AB3858C5E; Mon, 27 Mar 2023 17:16:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 290AB3858C5E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1679937399; bh=F5KfYHSCqvlK51pmLSmp5oPoTQ3963XRWgxiN0eycWI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=RgARdsy2B5zcO0VcXbaLirqOonPub9Q/qNOt9TArfV6v3w90HSxbpE/I4aenenc2+ lqkHq2++c7cdjUT7a3OTVitDkpCbNeR0M5O8yNnuIJKXm4KuOputOxYGmpP0SqM+SK iy3hKa7lasKIveBN+disHYerlun8Uc7XomIZlOR0= From: "janderson at rice dot edu" To: glibc-bugs@sourceware.org Subject: [Bug dynamic-link/30007] rfe: dlopen to specified address Date: Mon, 27 Mar 2023 17:16:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: dynamic-link X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: enhancement X-Bugzilla-Who: janderson at rice dot edu X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D30007 --- Comment #23 from Jonathon Anderson --- Sorry for the long delay in response, it's still a very busy time on my end= . :P I'll make up for it with a very long (and probably repetitive) response instead. >.< (In reply to Stas Sergeev from comment #22) > Hi guys, so what is the status of > all this? If my patches would never > be looked into, no matter what, then > perhaps you should tell me that right > here, so that I stop wasting my and > other's time. AFAIK your patches will be looked at once a use case that requires it is solidified, that can't be solved with current tech nor any better proposed = API. So far, it has been unclear why the primary function of dlmem() is needed f= or your use case. Why do you need to load solibs straight from memory at all?= =20 > In any other case I have the following > questions: > 1. Have we passed the stage where my > use-case is explained and clarified? Yes. > 2. Have we passed the stage where I > kept presented with an "alternative > solutions" like "intercept all mmap > (and perhaps also mprotect) syscalls > and do some weird thing on them"? My > last conclusion was that such "solution" > doesn't work for unaligned SHT_NOBITS > sections. No. I'm certain it works for unaligned SHT_NOBITS sections, any changes mad= e to one side of the "mirror" are reflected in the other. (Although there is ano= ther flaw I missed before, an updated version of the technique is towards the bo= ttom of this message. :P) > 3. If we passed 1 and 2, then I think > the next step is to discuss an API, > so here's the API: > ... > Does anyone know if its a good or bad > API, and how should it be improved? There is not yet a solid use case for the primary function of this API, the fact that it "loads an solib from memory." This primary functionality is the main source of concern originally raised by Carlos O'Donell, and AFAICT has= n't been resolved. The following API is close to your use case but doesn't raise the same conc= erns as dlmem(). Does this solve your problem, if not what's missing? void *dlopen4(const char *filename, int flags, const struct dlopen4_args *ext, size_t ext_size /* =3D sizeof(struct dlopen3_args) */); void *dlmopen5(Lmid_t lmid, const char *filename, int flags, const stru= ct dlopen4_args *ext, size_t ext_size /* =3D sizeof(struct dlopen3_args) */); struct dlopen4_args { /* If not NULL, function called before mmap when loading the object [= and its dependencies?]. Returns the base of a mmapped range of given length and alignment. This mapping will be overwritten by the loaded object. */ void *(*dla_premap)(void *preferred_addr, size_t length, size_t align, void *userdata); /* User data passed to dla_premap. */ void *dla_premap_userdata; }; > It > allows to implement dlopen_with_offset() > in a couple of lines, it preserves the > file-based mappings so that /proc/self/maps > or /proc/self/map_files are valid, and > it allows to specify the solib name, so > it handles the file-based mmaps, like > dlopen_with_offset(), rather perfectly. These are niceties, but I think we can agree a direct implementation of dlopen_with_offset() would be better for the use cases that need it. It wou= ld also require far less refactors than dlmem(). > I wish I could have a separate libdl, but > so far that looks very difficult. If you > have any suggestions how can I have the > separate libdl, then that would indeed be > a perfect alternative solution that will > eliminate any need to patch glibc sources. > Or maybe some simple hooks can be added to > aid a standalone libdl? Let me know and I > will work in that direction then. I don't have any suggestions here, ld.so and libdl and Glibc are all deeply tied together. The best I can recommend is to patch Glibc and base a contai= ner around it, if that works for your client(s). :P > But "no reply" is a bit inconclusive. You don't need to tell me that I'm slow to respond. :P FWIW, Glibc like many other large OSS projects moves slowly. Speaking from experience, expect many months before getting a change landed in a Fedora release, and multiple years before it spreads to other Linux distributions = like Debian/Ubuntu or OpenSUSE. (In reply to Stas Sergeev from comment #16) > (In reply to Jonathon Anderson from comment #15) > > As Adhemerval has already mentioned from the very start of this RFE (co= mment > > #1): > > > Any GNU extension requires a specific usercase that can't be easily a= ccomplished with current API. >=20 > "easily" is quite important here. > Even if your syscall interception approach > could work (which I think is not the case), > it doesn't fall into an "easy" category. As I mentioned before, syscall interception is a technique used in many VM-adjacent and widely used technologies, to name a few: containers (Podman/Docker), Windows emulation (Wine), browser sandboxes (Firefox/Chromium), and debuggers (GDB/strace). Many great examples exist in the open-source community suitable for study, IMHO strace and Crun (part of Podman) are good choices to start. Given all this, I consider it much easier to write a syscall interception c= ode than to write a shim library to translate between 32- and 64-bit call ABIs. FWIW. :D > To me, having a good API is also important. > Why dlmem() is not the one? > ... >=20 > > Thus, the first priority for this RFE should be to establish this use c= ase > > and express the failings of the current technology. A proposed patch se= ries > > is difficult to review >=20 > Even when I split them into 13 nearly trivial > patches? Then what else can I do to have it > easy for a review? I don't have many comments about the patch itself. If I find time to write = them up I'll direct them to the dlmem() RFE. > dlopenfd()+memfd doesn't give even the > possibility of specifying the reloc address, > and that's a very minimal, insufficient requirement. Because you need the pages to be mirrored? Or is there another requirement here? > > It would be very constructive if you could investigate my proposed solu= tion > > as detailed below, and precisely express what the insurmountable proble= ms > > with it are. :D >=20 > I always do. :) So far, there seems to be a lot of confusion about the technique but no objective flaws about the overall approach. I did notice a flaw in the inte= rim that complicates the technique, but again not insurmountable. * * * I'll describe the approach and updated technique verbatim below, in the hop= es it will smooth the discussion here, with the goal of understanding the flaws with the overall approach for your use case. The goal of the overall approach is to "mirror" ALL pages mmapped (after the syscall interception is installed) to pages inside the VM. That includes the pages forming a newly loaded solib. This is a very powerful approach that is not limited to the dynamic linker, it can be extended to mirror ANY memory allocated by the userspace code, including malloc()'d memory. "Mirroring" pages here (e.g. page A is mirrored to page A') has three strong criteria that need to be met: a. Any change to the memory in page A is reflected in page A', and vice versa. b. The location of page A' relative to some other mirrored page B' reflec= ts the location of page A relative to page B, if the userspace code requires s= uch (MAP_FIXED). c. A "page translation table" exists that records the mirror relationship from A to A'. The only way to implement criteria (a) on Linux is to propagate memory chan= ges back to the backing fd (MAP_SHARED), so /proc/self/maps will definitely see file-backed mappings even for anonymous pages. On the other hand, (a) also means if a .bss region is cleared with memset(), those changes will be reflected in the mirror pages and so we don't have to intercept those. Criteria (b) only matters for MAP_FIXED calls, in the ~MAP_FIXED case the kernel (syscall interception) is allowed to choose any reasonable address to place the mmap()'d pages. The recommendation from man mmap is to (paraphras= ed): "mmap() without MAP_FIXED first, then overwrite the allocated mapping with MAP_FIXED." This avoids races in multithreaded code. The technique described later presumes this recommendation is followed in all userspace code and wi= ll abort() if not. This recommendation is followed by Glibc's dynamic linker, = this is the rationale behind the first mmap() call you noticed gets completely over-mapped. Every mmap() syscall is intercepted with this technique (I thought I said t= hat explicitly but maybe it got lost in editing :P). There are other syscalls t= hat alter the page table that could be intercepted for a more complete solution: munmap(), mremap(), mprotect(), brk(). For simplicity I'm only going to dis= cuss the interception for mmap(), other syscalls are left as an exercise to the reader (and should not be necessary for a preliminary implementation, I thi= nk).=20 Now for the actual technique. The intercepted wrapper for mmap(addr, length, prot, flags, fd, offset) performs the following operations: 0. Adjust the arguments if flags contains MAP_ANONYMOUS or MAP_PRIVATE (described below), 1. mmap() the original pages (that live "outside" the VM), call them A, 2. allocate the mirror pages (that live "inside" the VM), call them A', 3. mmap() A' as a mirror of A, 4. update the "page translation table" (criteria (c)) with the A -> A' relation, and 5. return the address of A from step (1). There are a number of cases that need to be handled. The "base case" is (MAP_SHARED & ~MAP_ANONYMOUS & ~MAP_FIXED), here step (1) calls mmap(addr, length, prot, flags, fd, offset), and step (3) calls mmap(A'.addr, length, prot, flags | MAP_FIXED, fd, offset). Step (2) allocates any free pages in = the VM. This creates a natural mirrored mapping between A and A'. If flags contains MAP_ANONYMOUS, an extra step (0) is added before step (1)= . In step (0), fd is replaced by a file descriptor allocated with memfd_create(), and offset by the offset of some freshly allocated pages in that file. flags has the MAP_ANONYMOUS bit removed, since now it is no longer an anonymous mapping. All cases below then apply. If flags contains MAP_FIXED, step (2) needs to change. Assuming the man mmap recommendation is followed, there must already be an A -> A' mapping in the "page translation table" in this case. Step (2) reuses this prior mapping a= nd uses this previously-allocated A', if one doesn't exist it abort()s the ent= ire application. (Note that this reflects the over-mapping done by the dynamic linker in the VM space, so no issues with that.)=20 If flags contains MAP_PRIVATE, extra steps are once again needed. If this i= s a read-only mapping (~PROT_WRITE) and assuming mprotect() is not used later to add write access (IIRC I have not observed Glibc's ld.so do so with strace), then simply replace MAP_PRIVATE with MAP_SHARED in step (0) and the rest wi= ll work. Otherwise, if flags contains MAP_PRIVATE and prot contains PROT_WRITE, the mapped portion of the file needs to be copied out to an editable file. I can think of two implementations off the top of my head, others likly exist. Fi= rst idea: 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a. 0.2. orig =3D mmap(NULL, length, PROT_READ, flags, fd, offset); 0.3. fd =3D fd_a, offset =3D offset_a; 1. mmap(..., prot, ..., fd, offset) the original pages (that live "outside" the VM), call them A, 1.1. memcpy(A.addr, orig, length); 1.2. munmap(orig, length); Second idea: 0.1. Allocate some pages in an anonymous file as if flags contained MAP_ANONYMOUS, results in fd_a and offset_a. 0.2. orig_off =3D lseek(fd, 0, SEEK_CUR); 0.3. lseek(fd, offset, SEEK_SET); 1. mmap(..., prot, ..., fd_a, offset_a) the original pages (that live "outside" the VM), call them A, 1.1. read(fd, A.addr, length); 1.2. lseek(fd, orig_off, SEEK_SET); 1.3. fd =3D fd_a, offset =3D offset_a; That's it, that's the entire technique. It's a powerful approach reminiscen= t of container tech, which I find fitting for a use case messing with a VM. It's= a straightforward technique with good similar examples in the open-source community, for example strace's --inject=3D options. It's a small technique= , I would budget at around 100-300 lines for a PoC implementation. It's not a performant approach, but presumably your apps aren't dlopen()/dlclose()'ing solibs like there's no tomorrow. What's wrong with it? * * * > Of course now I have some very bad feeling > that your next proposal will be "trap > all mmaps, not just the first one"... > Well, before you do that, consider the > following: > 1. Some mappings are converted from > file-based to anonymous via mprotect+memset. The fact that the pages are mirrored handles this, changes in one are refle= cted in the other. Note that this trait is required to make shared memory work at all. IIRC ld.so only uses mprotect() to mark the RELRO segments as read-only, so they don't need to be mirrored in a simple PoC implementation. At least for simple cases, YMMV.=20 > 2. _dl_map_segment() handles the "large > alignment" case with 2 mmaps. The first > large one is done only for alignment, and > should I share with VM also that? Yes. It's simpler and more robust if you don't try to be smart about these cases, at least for a PoC. > 3. Do you really think that trapping all > mmaps and trying to hack around the > aforementioned problems, is a good idea? > ... > Plus I'd say your algo is not a solution. > Intercepting all mmap calls from dynamic > loader and provide some weird tricks to them, > is not any better than to write another loader, > for example. :) Yes, I really think syscall interception is a great idea. It's an order of magnitude smaller than your refactoring patches, and works on every GNU/Lin= ux box (possibly every Linux box) updated in the last 5 years. It can be exten= ded to be more powerful than any alteration to the dynamic linker. If it works = for you, IMHO it is VASTLY better solution than patching Glibc, both for you and for your client(s). :D > I am very surprised you make the claims like > "your patch is very difficult to review" > w/o even looking into the very small patches > that mostly split the huge multi-thousands-line > funcs into a reusable parts... Your patch is difficult to review for reasons that have to do with the API = and use case, not the implementation. It's also a refactor touching over a thou= sand lines, that's enough reason to make it hard to review. :P --=20 You are receiving this mail because: You are on the CC list for the bug.=