From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=5EiC=7L=gmail.com=bugaevc@sourceware.org>
Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131])
	by sourceware.org (Postfix) with ESMTPS id B18D43858D1E
	for <libc-alpha@sourceware.org>; Sun, 19 Mar 2023 15:10:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B18D43858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x131.google.com with SMTP id g17so12039093lfv.4
        for <libc-alpha@sourceware.org>; Sun, 19 Mar 2023 08:10:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112; t=1679238656;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=8lVAMuuFOeclHnpMO4sF4MUSA6s83di205JNV+43Xko=;
        b=W1NsQ55rzWlE99jEOB2JWKpWUeJwfsS6ullF55xQZ74prm0f4U5CnE87xGmvr0UB0Y
         K1FMRB1wWl/noXl0ToWz6PQmVGXWdElckBZtnXpV+9nixJHHIa9jwBmpNvdfcghVZerq
         tHGrg+Bs1+qLmd1764t66co0PYE05umyavWFOTMKrdEJ9erD73QK8BrfuR2Am1dcZk+3
         udeh85mRqi47+/u9BaplCz0TnDr58hkjnaGAiNZi7OsDN10GMFHyC+tZhBstzcJ/Ee8P
         tojFnGXgOaQRzVWAX5qeKvB0l3/7JAc6isKsvN6oK4ArKkGseS9+1goVTztMqbrkT5jJ
         RVaA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679238656;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=8lVAMuuFOeclHnpMO4sF4MUSA6s83di205JNV+43Xko=;
        b=MVCiTeGFvgPiofe+coILtUdl0btlMx1tUmYDQIQFbWqLLbKuhrim1yAU1YDWZ0mTzd
         m/qRhWT5pVDXwLNQ0MqxOJbwdcZHY2nsQLy3Zvb2KeAZIk6GRPP2dhkNz0Qqk58FbcwF
         7JdxI1AnUr+1C3ybH6i4hVPqexPkWbXQl/Qw+ZxQVqG+EvqWDhv0tNyMYBjr026IIle0
         Sf9jtePeYG6u1Z5BKSPniiYjyoBCUTzngrcBXZ15PbRrmghpxeOleSSiofuuUgiXOO0X
         /75EBYTxesuwQACt1wPMfBvRRjVrWzOwdR34892SNiW2qzp90b8NH+KU8KnqeHEX2gWn
         Xmmw==
X-Gm-Message-State: AO0yUKXUozQHf7zQRkV8nOONd6ZXtil8CAWYPPFGRtUzKtCZ0t9PsY6a
	7c2zyzvfquOdTY8bBRSPMhORvM6B4vHqzA==
X-Google-Smtp-Source: AK7set/Tw2V78byGr6r5NTrU6z03jFeV6e6MaRw194J5e0G54J5ZwkYVfzOmsWOQsr64R0kGziDKvg==
X-Received: by 2002:ac2:5e8c:0:b0:4e8:3ede:7e3c with SMTP id b12-20020ac25e8c000000b004e83ede7e3cmr5433813lfq.65.1679238655909;
        Sun, 19 Mar 2023 08:10:55 -0700 (PDT)
Received: from surface-pro-6.. ([2a00:1370:818c:4a57:577a:76f4:df43:5e66])
        by smtp.gmail.com with ESMTPSA id m19-20020ac24253000000b004e90dee5469sm1274089lfl.157.2023.03.19.08.10.54
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 19 Mar 2023 08:10:55 -0700 (PDT)
From: Sergey Bugaev <bugaevc@gmail.com>
To: libc-alpha@sourceware.org,
	bug-hurd@gnu.org
Cc: Samuel Thibault <samuel.thibault@gnu.org>,
	Sergey Bugaev <bugaevc@gmail.com>,
	Luca <luca@orpolo.org>
Subject: [RFC PATCH 00/34] The rest of the x86_64-gnu port
Date: Sun, 19 Mar 2023 18:09:43 +0300
Message-Id: <20230319151017.531737-1-bugaevc@gmail.com>
X-Mailer: git-send-email 2.39.2
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,URIBL_BLACK autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

Hello!

(Naturally, the subject line is a reference to the "How to draw an owl" meme.)

It's been more than a month since I've tried to run ./configure
--host=x86_64-gnu and see what would come out of it, and here we are now:
with these patches, glibc fully builds, and even somewhat "works"!

On testing
==========

By "works", I mean:

I was unable to actually get it running on GNU Mach. It either never gets
started, or crashes soon enough. The latter is actually to be expected, since
the kernel does not actually support i386_fsgs_base_state yet. I was unable
to investigate what exactly happens, because in addition to the troubles with
actually running GNU Mach on qemu-system-x86_64 (-kernel doesn't work..., you
really have to build an image with GRUB) and attaching a debugger to it
(either GDB or QEMU get utterly confused be the switch to the long mode...),
I had troubles with actually spawning the task while breaking on its first
instruction (a la starti). In particular prompt-task-resume didn't seem to
work for me, nor did breaking somewhere before the task should have been
resumed.

So I would appreciate some help with both testing this patchset (i.e. if you
do have a working x86_64 Mach + userspace setup, build glibc and try to run
it), and some general tips about how I would go about debugging the bootstrap
task from the first instruction onwards with x86_64 GNU Mach, QEMU, and GDB.

Anyone? Luca (cc'ed), perhaps you could help me with testing & give me some
tips?

Instead of testing on GNU Mach, I settled for the next best thing and tested
it on GNU/Linux, under GDB. I had to skip over the syscalls and emulate their
effects, either in my head (e.g. the return value of mach_reply_prot ()), by
writing $fs_base in GDB for thread_set_state (i386_fsgs_base_state), or by
making a Linux syscall to mmap some anonymous memory (for vm_map). Obviously
this is not the same thing as running it on the Mach for real, but -- it went
fine, and reached main ()! This means there likely aren't any catastrophic
issues with early startup (think init-first.c), TLS setup / accesses, etc.

On assembly and registers
=========================

This patchset involves code that has to deal inline assembly and/or registers,
such as intr-msg.h, longjmp, and sigreturn. I have written *something* that
looks like it might work, but without actual testing, it's hard to know if it
does. We can't really test any signal-related code until there's enough of the
Hurd running to have a proc server, etc.

As for sigreturn specifically: I'm concerned about the possibility that
putting the register dump onto the user's stack (or at %rsp - 128, on x86_64)
may clobber the data trampoline.c puts there (unless an altstack is used),
including the very sigcontext. This applies to both i386 and x86_64.
Empirically, we know this works out fine for i386 -- maybe sigcontext doesn't
actually get overwritten, or gets overwritten in just the right way (think
memmove, although i386/sigreturn.c actually uses memcpy...).

I also haven't given much thought to FP state manipulation, since I know very
little about it. It might be that it's broken entirely.

In any case, it wouldn't hurt if you review my attempts at asm & register
manipulation extra carefully.

On TLS and microoptimization
============================

As you can see, I've done a bunch of changes to how TLS-related things work,
on both x86_64 and i386. The reasons for this are:

1. I wanted to minimizae <hurd/threadvar.h> and its usages, so every line
   dropped from <hurd/threadvar.h> is a small win. In the end, only
   __hurd_sigthread_stack_{base,end} remain. I think these could be moved
   to <hurd/signal.h>, and we should be able to rid ourselves of
   <hurd/threadvar.h> for good.

2. I have discovered that the way __hurd_local_reply_port is declared is prone
   to GCC miscompiling accesses to it (reported here: [0]). Even when not
   miscompiled, this resulted in pretty inefficient code generation. Since two
   of the three places where __hurd_local_reply_port was used were in signal
   code where we know for sure that TLS is already working (since we must be
   running the signal thread), they could access tcb->reply_port directly
   (using the appropriate THREAD_*MEM accessor macros), and the rest of
   __hurd_local_reply_port / __LIBC_NO_TLS logic could be moved into
   mig-reply.c (and improved/specialized there), so that's what I've done.

   [0]: https://sourceware.org/pipermail/libc-alpha/2023-March/146304.html

3. Disabling / compiling out support for the no-TLS case in libc.so (and
   libpthread.so, etc.) -- but not in static builds, and not in ld.so. This
   turned out to be kind of required for x86_64 (more on that below), but it
   is a nice optimization in and of itself.

To illistrate the overall effect of these optimizations, here's a comparision
of the code generated for mig_get_reply_port () in libc.so for i386:

Before the changes (as shipped in Debian GNU/Hurd):

Dump of assembler code for function __GI___mig_get_reply_port:
   0x0001c0a0 <+0>:       push   %ebp
   0x0001c0a1 <+1>:       mov    %ds,%dx
   0x0001c0a4 <+4>:       mov    %gs,%ax
   0x0001c0a7 <+7>:       mov    %esp,%ebp
   0x0001c0a9 <+9>:       push   %esi
   0x0001c0aa <+10>:      call   0x1e1bc5 <__x86.get_pc_thunk.si>
   0x0001c0af <+15>:      add    $0x24af45,%esi
   0x0001c0b5 <+21>:      push   %ebx
   0x0001c0b6 <+22>:      cmp    %ax,%dx
   0x0001c0b9 <+25>:      je     0x1c130 <__GI___mig_get_reply_port+144>
   0x0001c0bb <+27>:      mov    %gs:0x0,%eax
   0x0001c0c1 <+33>:      mov    0x38(%eax),%eax
   0x0001c0c4 <+36>:      test   %eax,%eax
   0x0001c0c6 <+38>:      je     0x1c110 <__GI___mig_get_reply_port+112>
   0x0001c0c8 <+40>:      mov    %ds,%dx
   0x0001c0cb <+43>:      mov    %gs,%ax
   0x0001c0ce <+46>:      cmp    %ax,%dx
   0x0001c0d1 <+49>:      je     0x1c0f8 <__GI___mig_get_reply_port+88>
   0x0001c0d3 <+51>:      lea    0x1798(%esi),%edx
   0x0001c0d9 <+57>:      mov    %gs:0x0,%eax
   0x0001c0df <+63>:      lea    0x38(%eax),%ecx
   0x0001c0e2 <+66>:      cmp    %edx,%ecx
   0x0001c0e4 <+68>:      je     0x1c0f8 <__GI___mig_get_reply_port+88>
   0x0001c0e6 <+70>:      mov    %ds,%bx
   0x0001c0e9 <+73>:      mov    %gs,%cx
   0x0001c0ec <+76>:      cmp    %cx,%bx
   0x0001c0ef <+79>:      je     0x1c110 <__GI___mig_get_reply_port+112>
   0x0001c0f1 <+81>:      mov    0x38(%eax),%eax
   0x0001c0f4 <+84>:      cmp    %eax,(%edx)
   0x0001c0f6 <+86>:      je     0x1c110 <__GI___mig_get_reply_port+112>
   0x0001c0f8 <+88>:      mov    %ds,%dx
   0x0001c0fb <+91>:      mov    %gs,%ax
   0x0001c0fe <+94>:      cmp    %ax,%dx
   0x0001c101 <+97>:      je     0x1c140 <__GI___mig_get_reply_port+160>
   0x0001c103 <+99>:      pop    %ebx
   0x0001c104 <+100>:     pop    %esi
   0x0001c105 <+101>:     mov    %gs:0x0,%eax
   0x0001c10b <+107>:     pop    %ebp
   0x0001c10c <+108>:     mov    0x38(%eax),%eax
   0x0001c10f <+111>:     ret
   0x0001c110 <+112>:     mov    %ds,%dx
   0x0001c113 <+115>:     mov    %gs,%ax
   0x0001c116 <+118>:     cmp    %ax,%dx
   0x0001c119 <+121>:     je     0x1c150 <__GI___mig_get_reply_port+176>
   0x0001c11b <+123>:     mov    %gs:0x0,%ebx
   0x0001c122 <+130>:     add    $0x38,%ebx
   0x0001c125 <+133>:     call   0x1b7b0 <__GI___mach_reply_port>
   0x0001c12a <+138>:     mov    %eax,(%ebx)
   0x0001c12c <+140>:     jmp    0x1c0f8 <__GI___mig_get_reply_port+88>
   0x0001c12e <+142>:     xchg   %ax,%ax
   0x0001c130 <+144>:     lea    0x1798(%esi),%eax
   0x0001c136 <+150>:     mov    (%eax),%eax
   0x0001c138 <+152>:     jmp    0x1c0c4 <__GI___mig_get_reply_port+36>
   0x0001c13a <+154>:     lea    0x0(%esi),%esi
   0x0001c140 <+160>:     lea    0x1798(%esi),%eax
   0x0001c146 <+166>:     pop    %ebx
   0x0001c147 <+167>:     pop    %esi
   0x0001c148 <+168>:     pop    %ebp
   0x0001c149 <+169>:     mov    (%eax),%eax
   0x0001c14b <+171>:     ret
   0x0001c14c <+172>:     lea    0x0(%esi,%eiz,1),%esi
   0x0001c150 <+176>:     lea    0x1798(%esi),%ebx
   0x0001c156 <+182>:     jmp    0x1c125 <__GI___mig_get_reply_port+133>
End of assembler dump.

After the changes:

Dump of assembler code for function __GI___mig_get_reply_port:
   0x00020060 <+0>:	mov    %gs:0x38,%eax
   0x00020066 <+6>:	test   %eax,%eax
   0x00020068 <+8>:	je     0x20070 <__GI___mig_get_reply_port+16>
   0x0002006a <+10>:	ret
   0x0002006b <+11>:	lea    0x0(%esi,%eiz,1),%esi
   0x0002006f <+15>:	nop
   0x00020070 <+16>:	sub    $0xc,%esp
   0x00020073 <+19>:	call   0x1f790 <__GI___mach_reply_port>
   0x00020078 <+24>:	mov    %eax,%gs:0x38
   0x0002007e <+30>:	add    $0xc,%esp
   0x00020081 <+33>:	ret
End of assembler dump.

I think that this is pretty nice :) Note that I didn't focus on optimizing
mig_get_reply_port () specifically, and also that the versions in ld.so and
in libc.a are more complex (but still nowhere near as complex as the original).

Now, the horror story about __LIBC_NO_TLS () and __libc_tls_initialized:

Last time, when I realized that ld.so is lazily pulling object files out of
libc that something references, I understood that just putting the
__libc_tls_initialized into init-first.c would not work, for two reasons: for
one, that would cause ld.so to have it's own local copy of
__libc_tls_initialized -- we wouldn't want that, we want ld.so and rtld to
have a consistent idea of whether or not the TLS is initialized. Secondly,
this would pull in init-first.o into rtld, which is very wrong, and in fact
init-first.c even contains code to intentionally cause a linking error if this
ever happens.

The latter could be solved by just declaring __libc_tls_initialized outside of
init-first.c, but the former, I thought, actually required defining it into
ldsodefs.h, to be renamed dl_tls_initialized and accessed using the GL()
macro. When I asked [1], nobody discouraged me from going this way.

[1]: https://sourceware.org/pipermail/libc-alpha/2023-March/146254.html

But in trying to implement that, I ran into trouble with including
<ldsodefs.h> in <tls.h>. Namely, it turned out that there's an inverse
dependency between these two headers already. <ldsodefs.h> was explicitly (and
needlessly) including <tls.h> -- that was easy to get rid of -- but also
implicitly depending on it in several ways. First, it includes <link.h> (for
struct link_map), and that needs <tls.h> to define FORCED_DYNAMIC_TLS_OFFSET
or something like that. Second, it needs to define some locks, so it includes
<libc-lock.h>, and that immediately needs <tls.h> for __libc_lock_owner_self.
Moreover, <libc-lock.h> includes <lowlevellock.h>, and that includes
<atomic.h>, and then <x86/atomic-machine.h> again includes <tls.h> for the
tcbhead_t definition.

So as you can see, there are quite a few ways that <ldsodefs.h> wants to
include <tls.h>! And naturally, including <ldsodefs.h> in <tls.h> then fails
inside <ldsodefs.h>, where it discovers that __rtld_lock_define_recursive is
not defined and so on.

So... I came up with three (!) different ways to work around that, before
coming up with the fourth one, as included in this patch set.

The first way I implemented this was with a pair of out-of-line functions,
__libc_no_tls () and __libc_set_tls_initialized (), whose implementation in a
separate file could freely include <ldsodefs.h>. __libc_no_tls () had to be
exported out of libc.so, @@GLIBC_PRIVATE, but other than that, it seemed to
work. But then I happened to take a look at the generated code and naturally
discovered that it didn't get LTO-inlined (and why would it, if I'm not
building with LTO -- nor would LTO work cross-DSO in any case). Doing an extra
function call (through PLT if we're talking about libpthread.so...) for a
function that is literally a load of a single byte sounded bad. Really, how
could I settle for bad code generated -- all becuase I couldn't figure out
some stupid header dependencies?

So for the second attempt, I forced the headers to work the way I needed them.
This involved some unpretty kludges; for instance here's how the kludge in
<tls.h> looked:

/* If we're not being included from inside (or after) these few headers,
   include ldsodefs.h for the GL macro.  Otherwise, those headers will
   include (or have already included) ldsodefs.h themselves. This is done
   in this weird way because of issues with circular dependencies between
   these headers.  */
#if !defined (_LIBC_LOCK_H) && !defined (_MACH_LOWLEVELLOCK_H) \
    && !defined (_LOCK_INTERN_H) && !defined (_LINK_H) \
    && !defined (_X86_ATOMIC_MACHINE_H)
# include <ldsodefs.h>
#endif

add to this more instances of #include <ldsodefs.h> (guarded by similar
include guards) scattered across various random headers, and -- it builds.
The generated code now was what I wanted it to be (a direct access to
GL(dl_tls_initialized)), but this was obviously not pretty or nice.

A much cleaner solution, I thought, would be to split the various headers
involved into more granular parts. For instance, <ldsodefs.h> only really
needs to declare the locks, but not to actually lock and unlock them. If we
moved the lock declaration macros to a new, smaller header, say,
<libc-lock-def.h>, then that would not need <tls.h>. A smiliar split would be
nedded for <lowlevellock.h>. <tls.h> itself could be split: NPTL already has
a separate <tcb-access.h> where THREAD_{G,S}ETMEM macros are; but I was
imagining making it more granular still: for instance, we could have
<tls/tcb.h> which only defines the tcbhead_t layout (and that's what
<x86/atomic-machine.h> would include), <tls/access.h> which defines the
accessors, <tls/setup.h> which defines the various functions to set up TLS for
a thread, <tls/gscope.h> for the GSCOPE decls, and on the Hurd, <tls/no-tls.h>
for __LIBC_NO_TLS (). Each of the headers will be small and only bring in what
it really needs, and not everything and the kitchen sink.

This would be a clean and nice solution -- but it would be quite invasive,
and require changes in all the ports. And I have neither the hardware nor the
capacity to test that this breaks nothing on architectures I know nothing
about. (For instance, what's or1k or nios2? I've no idea.) And it's unlikely
that such a large, but poorly executed and tested reorganization would be
accepted simply because it would be convenient to the Hurd port.

Still, this seemed like the best way to pursue, so I half-implemented a
limited form of this (only splitting the locking headers). This was enough to
get x86_64-gnu to build without the include guard kludges. But then again, I
would need to do the same changes to NPTL, and without a way to really test
them, I didn't feel confident enough.

So that's when I had a small epiphany: TLS for the initial thread is always
set up inside rtld, before it passes control over to libc! There's no need to
share the flag with libc.so, since inside libc.so, it's always initialized!
We can just defined __LIBC_NO_TLS () to 0 outside of rtld (in shared builds).
This instantly solves the issue of circular includes (no longer need to use
ldsodefs!), and also makes the generated code even smaller / more efficient
at runtime, since we can statically compile out the no-TLS branches.

This logic *might* be broken for ifunc resolvers (I don't know -- is it?),
but then apparently they're not able to use most of the normal libc
functionality anyway, so hopefully this is not a big deal.

So much about the TLS, let's finally jump to the

Conclusion
==========

So, yeah, this is "the rest of the x86_64-gnu port". Please do review, try to
build it, and try to run it if you can. And teach *me* to run it, if you know
how to. I have tested that i686-gnu still builds and works, but more testing
is needed.

Some things are still missing, for instance I haven't looked at implementing
{get,set,make,swap}context. It seems they aren't required for basic operation.
And naturally, once we start running/using this for real, we'll discover what
else is missing or broken.

I hope I didn't screw up the rebasing anywhere, but this is a pretty large
patchset, so I might have. If you see a commit that doesn't make sense, or
some "AMENDME" or "fixup" in the commit message or some such, please let me
know :)

I have also started a port of the Hurd proper to x86_64, but I am not sending
out the patches for that yet.

Sergey