From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id A8A5F3858431; Thu, 16 Mar 2023 23:09:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A8A5F3858431 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1679008175; bh=lGidJwwjMKo28yviNx1+qMWhV8G6cKr7LKTO477rdLA=; h=From:To:Subject:Date:From; b=alHP7+JxfZ9V6ykdo6JeAV4ZZJBjNrxwxdLOkF8wg8I0SfzYQXAPvRotdHseVsujf Q3RY8Zf2DO6Zic34E6M/iAw0LW07mArqf/XwJvFr0XRzkdBrvMme6z4XAb2zkVLVXo ViT/F2D/DjSZHrbnSSiQ1Ut1fMe6PQSAYu2xlHSo= From: "loganh at synopsys dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug c++/109164] New: aarch64 thread_local initialization error with -ftree-pre and -foptimize-sibling-calls Date: Thu, 16 Mar 2023 23:09:34 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c++ X-Bugzilla-Version: 12.1.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: loganh at synopsys dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109164 Bug ID: 109164 Summary: aarch64 thread_local initialization error with -ftree-pre and -foptimize-sibling-calls Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: loganh at synopsys dot com Target Milestone: --- Created attachment 54687 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D54687&action=3Dedit Bash script that reproduces the issue With -ftree-pre, -foptimize-sibling-calls, and -O1 enabled, on aarch64-linux-gnu, GCC 12.1.0 can generate code to access parts of thread_l= ocal variables before the corresponding TLS init function is called if the varia= ble is accessed from a different TU than the variable is defined in. This reordering could likely cause a number of different issues, but the one that I've run into is that: - When the compiler generates code to call a virtual function on a referenc= e to a to a global thread_local instance of an object defined in a different translation unit, and - The function calls itself in at least once branch, the address of the object is fetched from TLS before it's initialized, and when the vtable lookup is attempted on that object to call the virtual func= tion the program segfaults. Here's an example of the kind of code that will trip it up: struct Struct {=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 virtual void virtual_func();=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 };=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20 extern thread_local Struct& thread_local_ref;=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 bool other_func(void);=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 bool test_func(void) {=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 thread_local_ref.virtual_func();=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 return other_func() && test_func();=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20 } When this is compiled (on aarch64-linux-gnu, with -O1 and -ftree-pre and -foptimize-sibling-calls) to an object file and then dumped with objdump -C= -d, this is the code produced: 0000000000000000 :=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 0: a9be7bfd stp x29, x30, [sp, #-32]!=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20 4: 910003fd mov x29, sp=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20 8: a90153f3 stp x19, x20, [sp, #16]=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20 c: 90000000 adrp x0, 0 =20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 10: f9400000 ldr x0, [x0]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 14: d53bd041 mrs x1, tpidr_el0=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 18: f8606834 ldr x20, [x1, x0]=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 1c: 90000013 adrp x19, 0 =20= =20=20=20=20=20=20=20=20=20 20: f9400273 ldr x19, [x19]=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 24: b4000053 cbz x19, 2c =20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 28: 94000000 bl 0 =20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20 2c: f9400280 ldr x0, [x20]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 30: f9400001 ldr x1, [x0]=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20 34: aa1403e0 mov x0, x20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20 38: d63f0020 blr x1=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 3c: 94000000 bl 0 =20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20 40: 12001c00 and w0, w0, #0xff=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 44: 35ffff00 cbnz w0, 24 =20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 48: a94153f3 ldp x19, x20, [sp, #16]=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20 4c: a8c27bfd ldp x29, x30, [sp], #32=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20 50: d65f03c0 ret=20=20 Looking at addresses 0x14 through 0x18, you can see that the address of 'thread_local_ref' is read from the TLS block for the thread; the first time this function is called, this will result in register x20 containing zero, since the TLS block isn't intialized until the function call at 0x28. Direc= tly after that, at location 0x2c, a read is attempted from the address in regis= ter x20 (zero) causing a segfault. Without -ftree-pre and -foptimize-sibling ca= lls, and without letting `test_func` call itself on at least one path, the code = to get the address of `thread_local_ref` is generated after the TLS init call,= so the problem does not occur. I've attached a script that will reproduce what I've shown here, as well as demonstrate the issue in action with a full executable that will produce the segfault I've described.=