public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/113686] New: [RISC-V] TLS (Local Exec) relaxation on structures (LE)
@ 2024-01-31 18:47 hpa at zytor dot com
  2024-01-31 19:05 ` [Bug target/113686] " palmer at gcc dot gnu.org
  2024-02-01 20:24 ` hpa at zytor dot com
  0 siblings, 2 replies; 3+ messages in thread
From: hpa at zytor dot com @ 2024-01-31 18:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113686

            Bug ID: 113686
           Summary: [RISC-V] TLS (Local Exec) relaxation on structures
                    (LE)
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hpa at zytor dot com
  Target Milestone: ---

When the Local Exec TLS model is in use, gcc generates inefficient code for
accessing the member of a structure:

struct foobar {
   int alpha;
   int beta;
};

_Thread_local struct foobar foo;

void func(int bar)
{
    foo.beta = bar;
}

    # Version 1
    lui    a1,%tprel_hi(foo)
    add    a1,a1,tp,%tprel_add(foo)
    addi   a1,a1,%tprel_lo(foo)
    sw     a0,4(a1)

However, in this case it could be generated as:

    # Version 2
    lui    a1,%tprel_hi(sym+4)
    addi   a1,a1,tp,%tprel_add(sym+4)
    sw     a0,%tprel_lo(sym+4)(a1)

... which, if %tprel_hi(sym+4) == 0, as it often is for small embedded
software, the linker can relax to a simple (tp) reference:

    # Version 2a (post-relaxation with small .tbss)
    sw a0,%tprel_lo(sym+4)(tp)

The linker will *not* relax version 1 all the way; leaving an unnecessary mv:

    # Version 1a (post-relaxation with small .tbss)
    mv a1,tp
    sw a0,%tprel_lo(sym+4)(tp)

It is of course trickier for the case of multiple subsequent references to the
structure if the structure is not aligned, as gcc can't know a priori where the
4K breaks are[*]. The version 1 code is more efficient in that case (3
instructions + 1 instruction/field as opposed to 3 instructions/field.)

However, if the structure *is* aligned, gcc will still not optimize 1 into 2.

There are at least a few options I see:

1. gcc option: gcc can generate version 2 code for a single field reference, or
if the alignment is such that all fields are guaranteed to fall inside the same
4K window.

2. gcc and optional ABI option: introduce a "TLS TE-tiny" model for deep
embedded use, where the combined size of the TSS area is limited to 4K
equivalent to the way direct gp references [or zero, if the global pointer is
0] work. Thus, direct (tp) references can be used.

NOTE: With the current binutils, this will error unless .option norelax is in
effect. It might be desirable to instead have a new relocation type, which
would require binutils support. Alternatively, ld should recognize that the TLS
offset is within +/- 2K and suppress the warning in that case (since at that
point the address is available the the linker.)

The linker could be further optimized by allowing the TLS to offset; presumably
equivalently to the __global_pointer$ symbol.

3. binutils option: teach ld to relax these kinds of chained pointer
references.



[*] Rant: in my opinion, the lui/auipc instructions are fundamentally
misdesigned by not having an overlap bit to guarantee a sizable window.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/113686] [RISC-V] TLS (Local Exec) relaxation on structures (LE)
  2024-01-31 18:47 [Bug target/113686] New: [RISC-V] TLS (Local Exec) relaxation on structures (LE) hpa at zytor dot com
@ 2024-01-31 19:05 ` palmer at gcc dot gnu.org
  2024-02-01 20:24 ` hpa at zytor dot com
  1 sibling, 0 replies; 3+ messages in thread
From: palmer at gcc dot gnu.org @ 2024-01-31 19:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113686

palmer at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nelsonc1225 at sourceware dot org,
                   |                            |palmer at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-01-31

--- Comment #1 from palmer at gcc dot gnu.org ---
(In reply to H. Peter Anvin from comment #0)
> When the Local Exec TLS model is in use, gcc generates inefficient code for
> accessing the member of a structure:
> 
> struct foobar {
>    int alpha;
>    int beta;
> };
> 
> _Thread_local struct foobar foo;
> 
> void func(int bar)
> {
>     foo.beta = bar;
> }
> 
>     # Version 1
>     lui    a1,%tprel_hi(foo)
>     add    a1,a1,tp,%tprel_add(foo)
>     addi   a1,a1,%tprel_lo(foo)
>     sw     a0,4(a1)
> 
> However, in this case it could be generated as:
> 
>     # Version 2
>     lui    a1,%tprel_hi(sym+4)
>     addi   a1,a1,tp,%tprel_add(sym+4)
>     sw     a0,%tprel_lo(sym+4)(a1)
> 
> ... which, if %tprel_hi(sym+4) == 0, as it often is for small embedded
> software, the linker can relax to a simple (tp) reference:
> 
>     # Version 2a (post-relaxation with small .tbss)
>     sw a0,%tprel_lo(sym+4)(tp)
> 
> The linker will *not* relax version 1 all the way; leaving an unnecessary mv:
> 
>     # Version 1a (post-relaxation with small .tbss)
>     mv a1,tp
>     sw a0,%tprel_lo(sym+4)(tp)
> 
> It is of course trickier for the case of multiple subsequent references to
> the structure if the structure is not aligned, as gcc can't know a priori
> where the 4K breaks are[*]. The version 1 code is more efficient in that
> case (3 instructions + 1 instruction/field as opposed to 3
> instructions/field.)
> 
> However, if the structure *is* aligned, gcc will still not optimize 1 into 2.
> 
> There are at least a few options I see:
> 
> 1. gcc option: gcc can generate version 2 code for a single field reference,
> or if the alignment is such that all fields are guaranteed to fall inside
> the same 4K window.

IIUC we could do this without adding anything to the linker or psABI, it's just
better code from GCC (we already have TPREL_LO12_S for the stores).  That's
just better code so it seems uncontroversial to me.

> 2. gcc and optional ABI option: introduce a "TLS TE-tiny" model for deep
> embedded use, where the combined size of the TSS area is limited to 4K
> equivalent to the way direct gp references [or zero, if the global pointer
> is 0] work. Thus, direct (tp) references can be used.

Unless I'm missing something, we never emit direct GP references from GCC right
now.  We rely on the linker to relax them.

> NOTE: With the current binutils, this will error unless .option norelax is
> in effect. It might be desirable to instead have a new relocation type,
> which would require binutils support. Alternatively, ld should recognize
> that the TLS offset is within +/- 2K and suppress the warning in that case
> (since at that point the address is available the the linker.)
> 
> The linker could be further optimized by allowing the TLS to offset;
> presumably equivalently to the __global_pointer$ symbol.
> 
> 3. binutils option: teach ld to relax these kinds of chained pointer
> references.

I'd favor adding support better for relaxing TP-relative sequences to the
linker where we can, it avoids the need for a new code model and we've already
got most of the linker complexity as it's required for GP.  So I think we can
essentially just call these LD missed optimizations.  Nelson might be out for a
bit, but I added him to the CC list.

> [*] Rant: in my opinion, the lui/auipc instructions are fundamentally
> misdesigned by not having an overlap bit to guarantee a sizable window.

I agree we've got auipc issues, it bites us all over the place (we essentially
can't share a hi* between multiple lo*s, as we don't know when overflow is
going to happen).  There'd been some vague proposals to add a third relocation
in the chain to align things, but I think they fizzled out because it'd require
talking to the psABI folks.

I think we're broadly safe for lui, though, so not sure if I'm missing
something there?  The low bits are always 0 so the intermediate alignment is
known.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/113686] [RISC-V] TLS (Local Exec) relaxation on structures (LE)
  2024-01-31 18:47 [Bug target/113686] New: [RISC-V] TLS (Local Exec) relaxation on structures (LE) hpa at zytor dot com
  2024-01-31 19:05 ` [Bug target/113686] " palmer at gcc dot gnu.org
@ 2024-02-01 20:24 ` hpa at zytor dot com
  1 sibling, 0 replies; 3+ messages in thread
From: hpa at zytor dot com @ 2024-02-01 20:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113686

--- Comment #2 from H. Peter Anvin <hpa at zytor dot com> ---
The intermediate alignment for lui is known, so if an object is known to fit
*entirely* within its natural alignment then it can be safely CSE'd, but this
is typically not the case with structures or arrays.

It would be nice to fix this in the architecture (as a new standard extension).

Unfortunately, due to the decision to allocate 3/4 of the instruction encoding
space to 16-bit instructions only 3 completely reserved 5-bit opcodes remain -
and that is only what is currently in the ISA document. Adding "auipc1" and
"lui1" would consume two of those (they would most naturally fall at op =
1X10011)

There are some less desirable options, like reclaiming part of the encoding
space for longer instructions (requiring another 1 bit in the prefix for long
instructions would provide the additional two encodings); on RV32 *only* the
RV64 encoding space op = 0X11011 could be used, but I doubt it would be much
appreciated to have this capability on RV32 and not RV64.

(Not to mention hacks like only having part of the register space accessible,
which wouldn't necessarily be horrendous, though, as these instructions would
belong in some pretty fixed patterns.)

One could also say that auipc1 would be "good enough" (combined with
PC-relative relocations for TLS, at least for the TE model) and would still be
valuable enough to occupy a full opcode.

(Obviously, the idea would be that these instructions really are
pseudo-instructions carrying one more immediate bit, and that the linker would
apply those bits using "HI21" relocations.)

I think this is something where the ISA architects could really use to hear
from you compiler developers; since I'm not really familiar with where the pain
points in the compiler lie, it is hard for me to speak authoritatively.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-02-01 20:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-31 18:47 [Bug target/113686] New: [RISC-V] TLS (Local Exec) relaxation on structures (LE) hpa at zytor dot com
2024-01-31 19:05 ` [Bug target/113686] " palmer at gcc dot gnu.org
2024-02-01 20:24 ` hpa at zytor dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).