From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) by sourceware.org (Postfix) with ESMTPS id B18D43858D1E for ; Sun, 19 Mar 2023 15:10:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B18D43858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x131.google.com with SMTP id g17so12039093lfv.4 for ; Sun, 19 Mar 2023 08:10:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679238656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=8lVAMuuFOeclHnpMO4sF4MUSA6s83di205JNV+43Xko=; b=W1NsQ55rzWlE99jEOB2JWKpWUeJwfsS6ullF55xQZ74prm0f4U5CnE87xGmvr0UB0Y K1FMRB1wWl/noXl0ToWz6PQmVGXWdElckBZtnXpV+9nixJHHIa9jwBmpNvdfcghVZerq tHGrg+Bs1+qLmd1764t66co0PYE05umyavWFOTMKrdEJ9erD73QK8BrfuR2Am1dcZk+3 udeh85mRqi47+/u9BaplCz0TnDr58hkjnaGAiNZi7OsDN10GMFHyC+tZhBstzcJ/Ee8P tojFnGXgOaQRzVWAX5qeKvB0l3/7JAc6isKsvN6oK4ArKkGseS9+1goVTztMqbrkT5jJ RVaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679238656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8lVAMuuFOeclHnpMO4sF4MUSA6s83di205JNV+43Xko=; b=MVCiTeGFvgPiofe+coILtUdl0btlMx1tUmYDQIQFbWqLLbKuhrim1yAU1YDWZ0mTzd m/qRhWT5pVDXwLNQ0MqxOJbwdcZHY2nsQLy3Zvb2KeAZIk6GRPP2dhkNz0Qqk58FbcwF 7JdxI1AnUr+1C3ybH6i4hVPqexPkWbXQl/Qw+ZxQVqG+EvqWDhv0tNyMYBjr026IIle0 Sf9jtePeYG6u1Z5BKSPniiYjyoBCUTzngrcBXZ15PbRrmghpxeOleSSiofuuUgiXOO0X /75EBYTxesuwQACt1wPMfBvRRjVrWzOwdR34892SNiW2qzp90b8NH+KU8KnqeHEX2gWn Xmmw== X-Gm-Message-State: AO0yUKXUozQHf7zQRkV8nOONd6ZXtil8CAWYPPFGRtUzKtCZ0t9PsY6a 7c2zyzvfquOdTY8bBRSPMhORvM6B4vHqzA== X-Google-Smtp-Source: AK7set/Tw2V78byGr6r5NTrU6z03jFeV6e6MaRw194J5e0G54J5ZwkYVfzOmsWOQsr64R0kGziDKvg== X-Received: by 2002:ac2:5e8c:0:b0:4e8:3ede:7e3c with SMTP id b12-20020ac25e8c000000b004e83ede7e3cmr5433813lfq.65.1679238655909; Sun, 19 Mar 2023 08:10:55 -0700 (PDT) Received: from surface-pro-6.. ([2a00:1370:818c:4a57:577a:76f4:df43:5e66]) by smtp.gmail.com with ESMTPSA id m19-20020ac24253000000b004e90dee5469sm1274089lfl.157.2023.03.19.08.10.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Mar 2023 08:10:55 -0700 (PDT) From: Sergey Bugaev To: libc-alpha@sourceware.org, bug-hurd@gnu.org Cc: Samuel Thibault , Sergey Bugaev , Luca Subject: [RFC PATCH 00/34] The rest of the x86_64-gnu port Date: Sun, 19 Mar 2023 18:09:43 +0300 Message-Id: <20230319151017.531737-1-bugaevc@gmail.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_BLACK autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hello! (Naturally, the subject line is a reference to the "How to draw an owl" meme.) It's been more than a month since I've tried to run ./configure --host=x86_64-gnu and see what would come out of it, and here we are now: with these patches, glibc fully builds, and even somewhat "works"! On testing ========== By "works", I mean: I was unable to actually get it running on GNU Mach. It either never gets started, or crashes soon enough. The latter is actually to be expected, since the kernel does not actually support i386_fsgs_base_state yet. I was unable to investigate what exactly happens, because in addition to the troubles with actually running GNU Mach on qemu-system-x86_64 (-kernel doesn't work..., you really have to build an image with GRUB) and attaching a debugger to it (either GDB or QEMU get utterly confused be the switch to the long mode...), I had troubles with actually spawning the task while breaking on its first instruction (a la starti). In particular prompt-task-resume didn't seem to work for me, nor did breaking somewhere before the task should have been resumed. So I would appreciate some help with both testing this patchset (i.e. if you do have a working x86_64 Mach + userspace setup, build glibc and try to run it), and some general tips about how I would go about debugging the bootstrap task from the first instruction onwards with x86_64 GNU Mach, QEMU, and GDB. Anyone? Luca (cc'ed), perhaps you could help me with testing & give me some tips? Instead of testing on GNU Mach, I settled for the next best thing and tested it on GNU/Linux, under GDB. I had to skip over the syscalls and emulate their effects, either in my head (e.g. the return value of mach_reply_prot ()), by writing $fs_base in GDB for thread_set_state (i386_fsgs_base_state), or by making a Linux syscall to mmap some anonymous memory (for vm_map). Obviously this is not the same thing as running it on the Mach for real, but -- it went fine, and reached main ()! This means there likely aren't any catastrophic issues with early startup (think init-first.c), TLS setup / accesses, etc. On assembly and registers ========================= This patchset involves code that has to deal inline assembly and/or registers, such as intr-msg.h, longjmp, and sigreturn. I have written *something* that looks like it might work, but without actual testing, it's hard to know if it does. We can't really test any signal-related code until there's enough of the Hurd running to have a proc server, etc. As for sigreturn specifically: I'm concerned about the possibility that putting the register dump onto the user's stack (or at %rsp - 128, on x86_64) may clobber the data trampoline.c puts there (unless an altstack is used), including the very sigcontext. This applies to both i386 and x86_64. Empirically, we know this works out fine for i386 -- maybe sigcontext doesn't actually get overwritten, or gets overwritten in just the right way (think memmove, although i386/sigreturn.c actually uses memcpy...). I also haven't given much thought to FP state manipulation, since I know very little about it. It might be that it's broken entirely. In any case, it wouldn't hurt if you review my attempts at asm & register manipulation extra carefully. On TLS and microoptimization ============================ As you can see, I've done a bunch of changes to how TLS-related things work, on both x86_64 and i386. The reasons for this are: 1. I wanted to minimizae and its usages, so every line dropped from is a small win. In the end, only __hurd_sigthread_stack_{base,end} remain. I think these could be moved to , and we should be able to rid ourselves of for good. 2. I have discovered that the way __hurd_local_reply_port is declared is prone to GCC miscompiling accesses to it (reported here: [0]). Even when not miscompiled, this resulted in pretty inefficient code generation. Since two of the three places where __hurd_local_reply_port was used were in signal code where we know for sure that TLS is already working (since we must be running the signal thread), they could access tcb->reply_port directly (using the appropriate THREAD_*MEM accessor macros), and the rest of __hurd_local_reply_port / __LIBC_NO_TLS logic could be moved into mig-reply.c (and improved/specialized there), so that's what I've done. [0]: https://sourceware.org/pipermail/libc-alpha/2023-March/146304.html 3. Disabling / compiling out support for the no-TLS case in libc.so (and libpthread.so, etc.) -- but not in static builds, and not in ld.so. This turned out to be kind of required for x86_64 (more on that below), but it is a nice optimization in and of itself. To illistrate the overall effect of these optimizations, here's a comparision of the code generated for mig_get_reply_port () in libc.so for i386: Before the changes (as shipped in Debian GNU/Hurd): Dump of assembler code for function __GI___mig_get_reply_port: 0x0001c0a0 <+0>: push %ebp 0x0001c0a1 <+1>: mov %ds,%dx 0x0001c0a4 <+4>: mov %gs,%ax 0x0001c0a7 <+7>: mov %esp,%ebp 0x0001c0a9 <+9>: push %esi 0x0001c0aa <+10>: call 0x1e1bc5 <__x86.get_pc_thunk.si> 0x0001c0af <+15>: add $0x24af45,%esi 0x0001c0b5 <+21>: push %ebx 0x0001c0b6 <+22>: cmp %ax,%dx 0x0001c0b9 <+25>: je 0x1c130 <__GI___mig_get_reply_port+144> 0x0001c0bb <+27>: mov %gs:0x0,%eax 0x0001c0c1 <+33>: mov 0x38(%eax),%eax 0x0001c0c4 <+36>: test %eax,%eax 0x0001c0c6 <+38>: je 0x1c110 <__GI___mig_get_reply_port+112> 0x0001c0c8 <+40>: mov %ds,%dx 0x0001c0cb <+43>: mov %gs,%ax 0x0001c0ce <+46>: cmp %ax,%dx 0x0001c0d1 <+49>: je 0x1c0f8 <__GI___mig_get_reply_port+88> 0x0001c0d3 <+51>: lea 0x1798(%esi),%edx 0x0001c0d9 <+57>: mov %gs:0x0,%eax 0x0001c0df <+63>: lea 0x38(%eax),%ecx 0x0001c0e2 <+66>: cmp %edx,%ecx 0x0001c0e4 <+68>: je 0x1c0f8 <__GI___mig_get_reply_port+88> 0x0001c0e6 <+70>: mov %ds,%bx 0x0001c0e9 <+73>: mov %gs,%cx 0x0001c0ec <+76>: cmp %cx,%bx 0x0001c0ef <+79>: je 0x1c110 <__GI___mig_get_reply_port+112> 0x0001c0f1 <+81>: mov 0x38(%eax),%eax 0x0001c0f4 <+84>: cmp %eax,(%edx) 0x0001c0f6 <+86>: je 0x1c110 <__GI___mig_get_reply_port+112> 0x0001c0f8 <+88>: mov %ds,%dx 0x0001c0fb <+91>: mov %gs,%ax 0x0001c0fe <+94>: cmp %ax,%dx 0x0001c101 <+97>: je 0x1c140 <__GI___mig_get_reply_port+160> 0x0001c103 <+99>: pop %ebx 0x0001c104 <+100>: pop %esi 0x0001c105 <+101>: mov %gs:0x0,%eax 0x0001c10b <+107>: pop %ebp 0x0001c10c <+108>: mov 0x38(%eax),%eax 0x0001c10f <+111>: ret 0x0001c110 <+112>: mov %ds,%dx 0x0001c113 <+115>: mov %gs,%ax 0x0001c116 <+118>: cmp %ax,%dx 0x0001c119 <+121>: je 0x1c150 <__GI___mig_get_reply_port+176> 0x0001c11b <+123>: mov %gs:0x0,%ebx 0x0001c122 <+130>: add $0x38,%ebx 0x0001c125 <+133>: call 0x1b7b0 <__GI___mach_reply_port> 0x0001c12a <+138>: mov %eax,(%ebx) 0x0001c12c <+140>: jmp 0x1c0f8 <__GI___mig_get_reply_port+88> 0x0001c12e <+142>: xchg %ax,%ax 0x0001c130 <+144>: lea 0x1798(%esi),%eax 0x0001c136 <+150>: mov (%eax),%eax 0x0001c138 <+152>: jmp 0x1c0c4 <__GI___mig_get_reply_port+36> 0x0001c13a <+154>: lea 0x0(%esi),%esi 0x0001c140 <+160>: lea 0x1798(%esi),%eax 0x0001c146 <+166>: pop %ebx 0x0001c147 <+167>: pop %esi 0x0001c148 <+168>: pop %ebp 0x0001c149 <+169>: mov (%eax),%eax 0x0001c14b <+171>: ret 0x0001c14c <+172>: lea 0x0(%esi,%eiz,1),%esi 0x0001c150 <+176>: lea 0x1798(%esi),%ebx 0x0001c156 <+182>: jmp 0x1c125 <__GI___mig_get_reply_port+133> End of assembler dump. After the changes: Dump of assembler code for function __GI___mig_get_reply_port: 0x00020060 <+0>: mov %gs:0x38,%eax 0x00020066 <+6>: test %eax,%eax 0x00020068 <+8>: je 0x20070 <__GI___mig_get_reply_port+16> 0x0002006a <+10>: ret 0x0002006b <+11>: lea 0x0(%esi,%eiz,1),%esi 0x0002006f <+15>: nop 0x00020070 <+16>: sub $0xc,%esp 0x00020073 <+19>: call 0x1f790 <__GI___mach_reply_port> 0x00020078 <+24>: mov %eax,%gs:0x38 0x0002007e <+30>: add $0xc,%esp 0x00020081 <+33>: ret End of assembler dump. I think that this is pretty nice :) Note that I didn't focus on optimizing mig_get_reply_port () specifically, and also that the versions in ld.so and in libc.a are more complex (but still nowhere near as complex as the original). Now, the horror story about __LIBC_NO_TLS () and __libc_tls_initialized: Last time, when I realized that ld.so is lazily pulling object files out of libc that something references, I understood that just putting the __libc_tls_initialized into init-first.c would not work, for two reasons: for one, that would cause ld.so to have it's own local copy of __libc_tls_initialized -- we wouldn't want that, we want ld.so and rtld to have a consistent idea of whether or not the TLS is initialized. Secondly, this would pull in init-first.o into rtld, which is very wrong, and in fact init-first.c even contains code to intentionally cause a linking error if this ever happens. The latter could be solved by just declaring __libc_tls_initialized outside of init-first.c, but the former, I thought, actually required defining it into ldsodefs.h, to be renamed dl_tls_initialized and accessed using the GL() macro. When I asked [1], nobody discouraged me from going this way. [1]: https://sourceware.org/pipermail/libc-alpha/2023-March/146254.html But in trying to implement that, I ran into trouble with including in . Namely, it turned out that there's an inverse dependency between these two headers already. was explicitly (and needlessly) including -- that was easy to get rid of -- but also implicitly depending on it in several ways. First, it includes (for struct link_map), and that needs to define FORCED_DYNAMIC_TLS_OFFSET or something like that. Second, it needs to define some locks, so it includes , and that immediately needs for __libc_lock_owner_self. Moreover, includes , and that includes , and then again includes for the tcbhead_t definition. So as you can see, there are quite a few ways that wants to include ! And naturally, including in then fails inside , where it discovers that __rtld_lock_define_recursive is not defined and so on. So... I came up with three (!) different ways to work around that, before coming up with the fourth one, as included in this patch set. The first way I implemented this was with a pair of out-of-line functions, __libc_no_tls () and __libc_set_tls_initialized (), whose implementation in a separate file could freely include . __libc_no_tls () had to be exported out of libc.so, @@GLIBC_PRIVATE, but other than that, it seemed to work. But then I happened to take a look at the generated code and naturally discovered that it didn't get LTO-inlined (and why would it, if I'm not building with LTO -- nor would LTO work cross-DSO in any case). Doing an extra function call (through PLT if we're talking about libpthread.so...) for a function that is literally a load of a single byte sounded bad. Really, how could I settle for bad code generated -- all becuase I couldn't figure out some stupid header dependencies? So for the second attempt, I forced the headers to work the way I needed them. This involved some unpretty kludges; for instance here's how the kludge in looked: /* If we're not being included from inside (or after) these few headers, include ldsodefs.h for the GL macro. Otherwise, those headers will include (or have already included) ldsodefs.h themselves. This is done in this weird way because of issues with circular dependencies between these headers. */ #if !defined (_LIBC_LOCK_H) && !defined (_MACH_LOWLEVELLOCK_H) \ && !defined (_LOCK_INTERN_H) && !defined (_LINK_H) \ && !defined (_X86_ATOMIC_MACHINE_H) # include #endif add to this more instances of #include (guarded by similar include guards) scattered across various random headers, and -- it builds. The generated code now was what I wanted it to be (a direct access to GL(dl_tls_initialized)), but this was obviously not pretty or nice. A much cleaner solution, I thought, would be to split the various headers involved into more granular parts. For instance, only really needs to declare the locks, but not to actually lock and unlock them. If we moved the lock declaration macros to a new, smaller header, say, , then that would not need . A smiliar split would be nedded for . itself could be split: NPTL already has a separate where THREAD_{G,S}ETMEM macros are; but I was imagining making it more granular still: for instance, we could have which only defines the tcbhead_t layout (and that's what would include), which defines the accessors, which defines the various functions to set up TLS for a thread, for the GSCOPE decls, and on the Hurd, for __LIBC_NO_TLS (). Each of the headers will be small and only bring in what it really needs, and not everything and the kitchen sink. This would be a clean and nice solution -- but it would be quite invasive, and require changes in all the ports. And I have neither the hardware nor the capacity to test that this breaks nothing on architectures I know nothing about. (For instance, what's or1k or nios2? I've no idea.) And it's unlikely that such a large, but poorly executed and tested reorganization would be accepted simply because it would be convenient to the Hurd port. Still, this seemed like the best way to pursue, so I half-implemented a limited form of this (only splitting the locking headers). This was enough to get x86_64-gnu to build without the include guard kludges. But then again, I would need to do the same changes to NPTL, and without a way to really test them, I didn't feel confident enough. So that's when I had a small epiphany: TLS for the initial thread is always set up inside rtld, before it passes control over to libc! There's no need to share the flag with libc.so, since inside libc.so, it's always initialized! We can just defined __LIBC_NO_TLS () to 0 outside of rtld (in shared builds). This instantly solves the issue of circular includes (no longer need to use ldsodefs!), and also makes the generated code even smaller / more efficient at runtime, since we can statically compile out the no-TLS branches. This logic *might* be broken for ifunc resolvers (I don't know -- is it?), but then apparently they're not able to use most of the normal libc functionality anyway, so hopefully this is not a big deal. So much about the TLS, let's finally jump to the Conclusion ========== So, yeah, this is "the rest of the x86_64-gnu port". Please do review, try to build it, and try to run it if you can. And teach *me* to run it, if you know how to. I have tested that i686-gnu still builds and works, but more testing is needed. Some things are still missing, for instance I haven't looked at implementing {get,set,make,swap}context. It seems they aren't required for basic operation. And naturally, once we start running/using this for real, we'll discover what else is missing or broken. I hope I didn't screw up the rebasing anywhere, but this is a pretty large patchset, so I might have. If you see a commit that doesn't make sense, or some "AMENDME" or "fixup" in the commit message or some such, please let me know :) I have also started a port of the Hurd proper to x86_64, but I am not sending out the patches for that yet. Sergey