public inbox for gnu-gabi@sourceware.org
 help / color / mirror / Atom feed
* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section Sriraman Tallam
  2017-01-01  0:00 ` Cary Coutant
@ 2017-01-01  0:00 ` Rafael Espíndola
  2017-01-01  0:00   ` Sriraman Tallam
  2017-01-01  0:00 ` H.J. Lu
  2 siblings, 1 reply; 29+ messages in thread
From: Rafael Espíndola @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: gnu-gabi, binutils, Xinliang David Li, Cary Coutant,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu,
	Rahul Chaudhry, Luis Lozano, Peter Collingbourne, Rui Ueyama

On 25 April 2017 at 13:12, Sriraman Tallam <tmsriram@google.com> wrote:
> We identified a problem with PIE executables, more than 5% code size
> bloat compared to non-PIE and we have a few proposals to reduce the
> bloat.  Please take a look and let us know what you think.

Just a bit of terminology, it is not code, it is a read only data.
Why is the table size a problem? I can imagine a few reasons, but it
would be nice to know which one you are trying to solve:

* The file size itself is a problem for shipping the file.
* The startup time is a problem and a compact table makes the dynamic
linker faster.
* The memory usage is a problem.

For the last one one thing that could be done is have something like
PT_GNU_RELRO but that tells the dynamic linker to unmmap the region
completely once it is done with the relocations.

In addition what is done for COFF, the other existing solution I know
is https://wiki.mozilla.org/Elfhack. In the case of mozilla the issue
was just the size of the binary being shipped I think.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
@ 2017-01-01  0:00 Sriraman Tallam
  2017-01-01  0:00 ` Cary Coutant
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Sriraman Tallam @ 2017-01-01  0:00 UTC (permalink / raw)
  To: gnu-gabi, binutils
  Cc: Xinliang David Li, Cary Coutant, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Rafael Espíndola, Peter Collingbourne,
	Rui Ueyama

We identified a problem with PIE executables, more than 5% code size
bloat compared to non-PIE and we have a few proposals to reduce the
bloat.  Please take a look and let us know what you think.

* What is the problem?

PIE is a security hardening feature that enables ASLR (Address Space
Layout Randomization) and enables the executable to be loaded at a
random virtual address upon every execution instance. On an average, a
binary when built as PIE is larger by 5% to 9%, as measured on a suite
of benchmarks used at Google where the average text size is ~100MB,
when compared to the one built without PIE.  This is also independent
of the target architecture and we found this to be true for x86_64,
arm64 and power.  We noticed that the primary reason for this code
size bloat is due to the extra dynamic relocations that are generated
in order to make the binary position independent.  This proposal
introduces new ways to represent these dynamic relocations that can
reduce the code size bloat to just a few percent.

As an example,  to show the bloat in code size, here is the data from
one of our larger  binaries,

Without PIE, the binary’s code size in bytes is this as displayed by
the ‘size’ command:

 text             data            bss           dec
504663285 16242884 9130248 530036417

With PIE, the binary’s code size in bytes is this as displayed by the
‘size’ command:

 text            data           bss           dec
539781977 16242900 9130248 565155125

The text size of the binary grew by 7% and the total size by 6.6%.
Our experiments have shown that the binary sizes grow anywhere from 5%
to 9%  with PIE on almost all benchmarks we looked at.  Notice that
almost all the code bloat comes from the “text” segment of the binary,
which contains the executable code of the application and any
read-only data.  We looked into this segment to see why this is
happening and found that the size of the section that contains the
dynamic relocations for a binary explodes with PIE.  For instance,
without PIE, for the above binary the dynamic relocation section
contains 46 entries whereas with PIE, the same section contains
1463325 entries.  It takes 24 bytes to store one entry, that is 3
integer values each of size 8 bytes.  So, the dynamic relocations
alone need an extra space of (1463325 - 46) * 8 bytes which is 35
million bytes which is almost all the bloat incurred!.

* What are these extra dynamic relocations that are created for PIE executables?

We noticed that these extra relocations for PIE binaries have a common
pattern and are needed for the reason that it is not known until
run-time where the binary will be loaded.  All of these extra dynamic
relocations are of the ELF type R_X86_64_RELATIVE.   Let us show using
an example what these relocations do.
Let us take an example of a program that stores the address of a global:

#include <stdio.h>

const int a = 10;

const int *b = &a;

int main() {

 printf (“b = %p\n”, b);

}

First, let us look at the binary built without PIE.  Let’s look at the
data section where ‘b’ and ‘a’ are allocated.

00000000004007d0 <a>:
 4007d0:       0a 00


0000000000401b10 <b>:
 401b10:       d0 07
 401b12:       40 00 00

Variable ‘a’ is allocated at address 0x4007d0 which matches the output
when running the binary.  ‘b’ is allocated at address 0x401b10 and its
contents in little-endian byte order is the address of ‘a’.

Now, lets us examine the contents of the PIE binary:

00000000000008d8 <a>:
8d8:   0a 00

0000000000001c50 <b>:
   1c50:       d8 08
                    1c50: R_X86_64_RELATIVE *ABS*+0x8d8
   1c52:       00 00
   1c54:       00 00


Notice there is a dynamic relocation here which tells the dynamic
linker that this value needs to be fixed at run-time.  This is needed
because ASLR can load this binary anywhere in the address space and
this relocation fixes the address after it is loaded.


* More details about R_X86_64_RELATIVE relocations

This relocation is worth 24 bytes  and has three fields

Offset

Type - here it is R_X86_64_RELATIVE

Addend (what extra value needs to be added)

The offset field of this relocation is the address offset from the
start where this relocation applies.  The type field indicates the
type of the dynamic relocation but we are interested in particularly
one type of dynamic relocation, R_X86_64_RELATIVE.   This is important
because in the motivating example that we presented above, all the
extra dynamic relocations were of this type!


* We have these proposals to reduce the size of the dynamic relocations section:


Idea A: Shrinking the size of the dynamic relocations

1. Use the offset of the relocation to store the addend

2. Create a separate section for storing R_X86_64_RELATIVE type
relocations, “.pie.dyn”.

3. Create a new relocation type, R_X86_64_RELATIVE_OFFSET that has
exactly one field, the offset.

4. The type field is not necessary as these is only one type of
relocation in this new section.

5. Store the addend at the offset, which has enough space to store 8
byte addends.

6. The dynamic linker can be thought to read the value from this
offset for the addend rather than reading this from the entry.

7. This new section is just a sequence of offsets which are
R_X86_64_RELATIVE relocations.  The dynamic linker needs to be taught
to apply these relocations correctly.

8. Figure out a way to either prevent this from running or warning at
startup when trying to run this program with old dynamic linkers.

9. Eliminating two-thirds of the relocation reduces the size of these
sections by 66%.  Additionally, this section can also be compressed
and a compression factor of 10 is possible which makes the total size
of these sections 3% to 4% of the original.

10. The decompression by the dynamic linker can be done in chunks to
keep the memory overhead fixed and small.  Notice that the
decompressed chunk’s contents can be discarded after the relocation is
applied.

Idea B: Compressing the dynamic relocation section into .zrela.dyn

1. Simpler to implement.

2. The dynamic linker needs to decompress and apply these relocations
without  memory overhead, which otherwise defeats the purpose of doing
this, by decompressing in chunks.

3. The decompressed chunks can be discarded as and when the
relocations are applied so a fixed size buffer for decompression
should work well.

4. Figure out a way to either prevent this from running or warning at
startup when trying to run this program with unsupported dynamic
linkers.

5. Compression factors that can be achieved with gzip are a factor of
10 which makes the total size 10% of the original.


Thanks to Sterling Augustine, Ian Lance Taylor, Paul Pluzhnikov, and
David Li for the numerous  inputs!  Please let us know what you think
and what would be suitable for implementation. Also, attaching the
notes from having this discussion on binutils before I sent it here:

Notes from Peter Collingbourne:


* You can make the observation that relative relocations are normally
aligned to the machine's pointer size, and they normally pertain to
consecutive entries in a table of addresses, such as a vtable. That
hints at a more efficient scheme where instead of storing offsets you
store the difference between the offset and the previous one, divided
by the machine's pointer size. That, combined with a simple run-length
encoding scheme (see below) should be enough. Any unaligned
relocations can be represented using regular R_*_RELATIVE relocations.

* What I had in mind for backwards compatibility was that you could
have the linker arrange for the binary to contain code that applies
the relocations when the image is loaded, but only if the dynamic
loader hasn't applied them itself.

i.e. you could have something like

static void *foo = &foo;

void apply_relocs() {
  if (foo != &foo) {
    //apply relocations
  }
}

such that apply_relocs would be the first thing called by the dynamic
loader before the user's initialization code.

Of course, if you didn't want to support old dynamic loaders at all,
you could exit() inside the body of the if statement instead of
applying relocations.

* I don't think you need any sort of standard compression for this.
The run-length encoding scheme I had in mind would represent the first
group of consecutive relocations in the file as a pair of values
(offset of the relocation / machine pointer size, number of
consecutive relocations), and subsequent groups as the pair (relative
offset from previous group of consecutive relocations / machine
pointer size, number of consecutive relocations) all using the uleb128
encoding. So for example, the relocation list:

0xc8
0xd0
0xd8
0xe0
0x100
0x108
0x110

would be encoded as: uleb128(0xc8 / 8) uleb128(4) uleb128((0x100-0xe0)
/ 8) uleb128(3)

i.e. 4 bytes.

In one experiment, I extracted the list of relative relocations from a
Chromium binary and compressed them using this scheme. There were
90302 relocations in total, and this scheme compressed them down to
65488 bytes, which would be 3% of the original for a rela section
(this experiment was actually conducted using an ARM binary, but I
would expect similar results for x86_64).

Notes from Rui Ueyama:

* Windows DLLs have the same problem and already solves it in their
own way, so let me share my knowledge about how DLLs work. It might be
useful for you.

* On Windows, each DLL is linked against a fixed base address and
usually loaded to that address. However, if there's already another
DLL that overlaps in the address space, the Windows loader has to
relocate it. To do that, Windows DLLs contain ".reloc" sections which
contain offsets that need to be fixed up at runtime. If the Windows
loader find that a DLL cannot be loaded to its desired base address,
it loads it to somewhere else, and add <actual base address> -
<desired base address> to each offset that is specified by the .reloc
section.

* In ELF terms, .reloc sections contains arrays of relocation offsets.
All these relocations in the section are implicitly R_X86_64_RELATIVE,
and addends are read from section contents (so it is REL as opposed to
RELA).

* This already reduce the size of relocations to 1/3, but Windows does
more to reduce .reloc size (probably because it was invented for late
'80s or '90s PCs.) Offsets in .reloc are grouped by page where page
size is wordsize minus 16 bits, and offsets sharing the same page
address are stored consecutively to represent them with less space.
This is a very similar technique as the page table which is grouped by
(multiple stages of) pages.

* For example, let's say we have 0x00030, 0x00500, 0x01000, 0x01100,
0x20004, and 0x20008 in a .reloc section. In the section, they are
represented like this:

  0x00000  -- takes 6 bytes
    0x0030 -- each takes 2 bytes
    0x0500
    0x1000
    0x1100
 0x20000 -- takes 6 bytes
    0x0004 -- each takes 2 bytes
    0x0008

* In total, the .reloc section is 24 bytes. If you store each address
as 8 byte integer, it takes 48 bytes, so this technique indeed reduces
the size of the .reloc section.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Cary Coutant
  2017-01-01  0:00   ` Sriraman Tallam
  2017-01-01  0:00   ` H.J. Lu
@ 2017-01-01  0:00   ` Florian Weimer
  2017-01-01  0:00     ` Alan Modra
  2017-01-01  0:00   ` Markus Trippelsdorf
  3 siblings, 1 reply; 29+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant, Sriraman Tallam
  Cc: gnu-gabi, binutils, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Rafael Espíndola, Peter Collingbourne,
	Rui Ueyama

On 04/26/2017 06:06 AM, Cary Coutant wrote:
> 4. Evaluate some alternate representations to improve upon the simple
> list of addresses. An idea I've been considering is emitting a bitmap
> -- an alternating series of pairs (starting address, bits), where
> "bits" is just a fixed-size bit vector describing each word starting
> at the given starting address. The "bits" field could be, say, 128 or
> 256 bits in length, or could begin with a byte that defines its
> length. I think this would work well, as the relocations are often
> clustered, but haven't yet done any testing of that hypothesis. A
> simple run-length encoding, as previously suggested, would also work,
> of course. I'm not in favor of using a general-purpose compression
> algorithm on the relocations.

It might also be possible to sort the relocated objects so that they are 
more contiguous.

It could be beneficial to arrange things so that vector instructions can 
be used for the relocation.  I'm not sure if it is worthwhile to design 
a compression scheme based on bit fiddling for this because those who 
want compression can probably use a compressed file system anyway.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section Sriraman Tallam
  2017-01-01  0:00 ` Cary Coutant
  2017-01-01  0:00 ` Rafael Espíndola
@ 2017-01-01  0:00 ` H.J. Lu
  2017-01-01  0:00   ` Sriraman Tallam
  2 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: gnu-gabi, binutils, Xinliang David Li, Cary Coutant,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor,
	Rahul Chaudhry, Luis Lozano, Rafael Espíndola,
	Peter Collingbourne, Rui Ueyama

On Tue, Apr 25, 2017 at 10:12 AM, Sriraman Tallam <tmsriram@google.com> wrote:
> We identified a problem with PIE executables, more than 5% code size
> bloat compared to non-PIE and we have a few proposals to reduce the
> bloat.  Please take a look and let us know what you think.
>
> * What is the problem?
>
> PIE is a security hardening feature that enables ASLR (Address Space
> Layout Randomization) and enables the executable to be loaded at a
> random virtual address upon every execution instance. On an average, a
> binary when built as PIE is larger by 5% to 9%, as measured on a suite
> of benchmarks used at Google where the average text size is ~100MB,
> when compared to the one built without PIE.  This is also independent
> of the target architecture and we found this to be true for x86_64,
> arm64 and power.  We noticed that the primary reason for this code
> size bloat is due to the extra dynamic relocations that are generated
> in order to make the binary position independent.  This proposal
> introduces new ways to represent these dynamic relocations that can
> reduce the code size bloat to just a few percent.
>
> As an example,  to show the bloat in code size, here is the data from
> one of our larger  binaries,
>
> Without PIE, the binary’s code size in bytes is this as displayed by
> the ‘size’ command:
>
>  text             data            bss           dec
> 504663285 16242884 9130248 530036417
>
> With PIE, the binary’s code size in bytes is this as displayed by the
> ‘size’ command:
>
>  text            data           bss           dec
> 539781977 16242900 9130248 565155125
>
> The text size of the binary grew by 7% and the total size by 6.6%.
> Our experiments have shown that the binary sizes grow anywhere from 5%
> to 9%  with PIE on almost all benchmarks we looked at.  Notice that
> almost all the code bloat comes from the “text” segment of the binary,
> which contains the executable code of the application and any
> read-only data.  We looked into this segment to see why this is
> happening and found that the size of the section that contains the
> dynamic relocations for a binary explodes with PIE.  For instance,
> without PIE, for the above binary the dynamic relocation section
> contains 46 entries whereas with PIE, the same section contains
> 1463325 entries.  It takes 24 bytes to store one entry, that is 3
> integer values each of size 8 bytes.  So, the dynamic relocations
> alone need an extra space of (1463325 - 46) * 8 bytes which is 35
> million bytes which is almost all the bloat incurred!.
>
> * What are these extra dynamic relocations that are created for PIE executables?
>
> We noticed that these extra relocations for PIE binaries have a common
> pattern and are needed for the reason that it is not known until
> run-time where the binary will be loaded.  All of these extra dynamic
> relocations are of the ELF type R_X86_64_RELATIVE.   Let us show using
> an example what these relocations do.
> Let us take an example of a program that stores the address of a global:
>
> #include <stdio.h>
>
> const int a = 10;
>
> const int *b = &a;
>
> int main() {
>
>  printf (“b = %p\n”, b);
>
> }
>
> First, let us look at the binary built without PIE.  Let’s look at the
> data section where ‘b’ and ‘a’ are allocated.
>
> 00000000004007d0 <a>:
>  4007d0:       0a 00
>
>
> 0000000000401b10 <b>:
>  401b10:       d0 07
>  401b12:       40 00 00
>
> Variable ‘a’ is allocated at address 0x4007d0 which matches the output
> when running the binary.  ‘b’ is allocated at address 0x401b10 and its
> contents in little-endian byte order is the address of ‘a’.
>
> Now, lets us examine the contents of the PIE binary:
>
> 00000000000008d8 <a>:
> 8d8:   0a 00
>
> 0000000000001c50 <b>:
>    1c50:       d8 08
>                     1c50: R_X86_64_RELATIVE *ABS*+0x8d8
>    1c52:       00 00
>    1c54:       00 00
>
>
> Notice there is a dynamic relocation here which tells the dynamic
> linker that this value needs to be fixed at run-time.  This is needed
> because ASLR can load this binary anywhere in the address space and
> this relocation fixes the address after it is loaded.
>
>
> * More details about R_X86_64_RELATIVE relocations
>
> This relocation is worth 24 bytes  and has three fields
>
> Offset
>
> Type - here it is R_X86_64_RELATIVE
>
> Addend (what extra value needs to be added)
>
> The offset field of this relocation is the address offset from the
> start where this relocation applies.  The type field indicates the
> type of the dynamic relocation but we are interested in particularly
> one type of dynamic relocation, R_X86_64_RELATIVE.   This is important
> because in the motivating example that we presented above, all the
> extra dynamic relocations were of this type!
>
>
> * We have these proposals to reduce the size of the dynamic relocations section:
>

There are 3 pieces of run-time relocation information:

1. Type and symbol. 4 or 8 bytes
2. Offset. 4 or 8 bytes
3. Addend.  4 or 8 bytes

If we use REL instead of RELA, addend can be implicit and stored in-place.
If we limit the type to relative relocation, we only need offset.
This is for PIC,
not just for PIE. An we can use special encoding scheme for offset table,
which can be placed in DT_GNU_RELATIVE_REL with
DT_GNU_RELATIVE_RELSZ.

-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section Sriraman Tallam
@ 2017-01-01  0:00 ` Cary Coutant
  2017-01-01  0:00   ` Sriraman Tallam
                     ` (3 more replies)
  2017-01-01  0:00 ` Rafael Espíndola
  2017-01-01  0:00 ` H.J. Lu
  2 siblings, 4 replies; 29+ messages in thread
From: Cary Coutant @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: gnu-gabi, binutils, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Rafael Espíndola, Peter Collingbourne,
	Rui Ueyama

> Idea A: Shrinking the size of the dynamic relocations

In a RELATIVE relocation, the r_offset field is really the only field
of interest. The symbol and addend are not used -- the dynamic loader
simply adjusts the address at the given offset by the relocation
factor for the load module. Thus, it truly is possible to reduce these
relocations to just the one word.

Simon Baldwin did a lot of work on this a few years ago for Chromium,
writing a post-link utility to extract all the RELATIVE relocations
and rewrite them to a new section.

When Simon was working on his utility, I decided it wouldn't be too
difficult to implement it directly in gold, and I toyed around with
it, but never devoted the time to take all the steps it would take to
bring this about. (Frankly, I was underwhelmed by the quoted savings,
but it sounds like I may have underestimated the value of that
savings. Sorry!)

Anyway, I cleaned up what I had done a bit, and pushed it to a
personal binutils branch, users/ccoutant/experimental-relr, for your
consideration. Here's what I did:

1. Added a new section .relr.dyn, with a new type SHT_RELR, and
sh_entsize = address size. The contents of this section are (for now)
a simple list of addresses that require relative relocations.

2. Added new dynamic table entries, DT_RELR, DT_RELRSZ, and
DT_RELRENT, which point to the new section.

3. Added a linker option, --experimental-use-relr, to enable the feature.

4. Modified the x86-64 target to emit new-style RELR relocations. I
punted for 64-bit relocations on x32, and for IFUNC relocations --
those will still be emitted as old-style relative relocations in
.rela.dyn.

For my experimentation, I picked the next-available values in the
gABI-reserved range, rather than vendor-specific values, for no good
reason. They're easy to renumber if we need to experiment with this in
the wild before standardizing.

Still to do:

1. Modify the other targets in gold to support RELR relocations.

2. Add readelf/objdump support. (You can still use readelf as is --
it'll just show the section type and dynamic table entries as numbers.
And you can dump the new section with "readelf -x.relr.dyn".)

3. Add dynamic loader support.

4. Evaluate some alternate representations to improve upon the simple
list of addresses. An idea I've been considering is emitting a bitmap
-- an alternating series of pairs (starting address, bits), where
"bits" is just a fixed-size bit vector describing each word starting
at the given starting address. The "bits" field could be, say, 128 or
256 bits in length, or could begin with a byte that defines its
length. I think this would work well, as the relocations are often
clustered, but haven't yet done any testing of that hypothesis. A
simple run-length encoding, as previously suggested, would also work,
of course. I'm not in favor of using a general-purpose compression
algorithm on the relocations.

5. Make a proposal to the gABI to add SHT_RELR and DT_RELR*.

-cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` H.J. Lu
@ 2017-01-01  0:00   ` Sriraman Tallam
  0 siblings, 0 replies; 29+ messages in thread
From: Sriraman Tallam @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu
  Cc: gnu-gabi, binutils, Xinliang David Li, Cary Coutant,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor,
	Rahul Chaudhry, Luis Lozano, Rafael Espíndola,
	Peter Collingbourne, Rui Ueyama

On Tue, Apr 25, 2017 at 11:02 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Apr 25, 2017 at 10:12 AM, Sriraman Tallam <tmsriram@google.com> wrote:
>> We identified a problem with PIE executables, more than 5% code size
>> bloat compared to non-PIE and we have a few proposals to reduce the
>> bloat.  Please take a look and let us know what you think.
>>
>> * What is the problem?
>>
>> PIE is a security hardening feature that enables ASLR (Address Space
>> Layout Randomization) and enables the executable to be loaded at a
>> random virtual address upon every execution instance. On an average, a
>> binary when built as PIE is larger by 5% to 9%, as measured on a suite
>> of benchmarks used at Google where the average text size is ~100MB,
>> when compared to the one built without PIE.  This is also independent
>> of the target architecture and we found this to be true for x86_64,
>> arm64 and power.  We noticed that the primary reason for this code
>> size bloat is due to the extra dynamic relocations that are generated
>> in order to make the binary position independent.  This proposal
>> introduces new ways to represent these dynamic relocations that can
>> reduce the code size bloat to just a few percent.
>>
>> As an example,  to show the bloat in code size, here is the data from
>> one of our larger  binaries,
>>
>> Without PIE, the binary’s code size in bytes is this as displayed by
>> the ‘size’ command:
>>
>>  text             data            bss           dec
>> 504663285 16242884 9130248 530036417
>>
>> With PIE, the binary’s code size in bytes is this as displayed by the
>> ‘size’ command:
>>
>>  text            data           bss           dec
>> 539781977 16242900 9130248 565155125
>>
>> The text size of the binary grew by 7% and the total size by 6.6%.
>> Our experiments have shown that the binary sizes grow anywhere from 5%
>> to 9%  with PIE on almost all benchmarks we looked at.  Notice that
>> almost all the code bloat comes from the “text” segment of the binary,
>> which contains the executable code of the application and any
>> read-only data.  We looked into this segment to see why this is
>> happening and found that the size of the section that contains the
>> dynamic relocations for a binary explodes with PIE.  For instance,
>> without PIE, for the above binary the dynamic relocation section
>> contains 46 entries whereas with PIE, the same section contains
>> 1463325 entries.  It takes 24 bytes to store one entry, that is 3
>> integer values each of size 8 bytes.  So, the dynamic relocations
>> alone need an extra space of (1463325 - 46) * 8 bytes which is 35
>> million bytes which is almost all the bloat incurred!.
>>
>> * What are these extra dynamic relocations that are created for PIE executables?
>>
>> We noticed that these extra relocations for PIE binaries have a common
>> pattern and are needed for the reason that it is not known until
>> run-time where the binary will be loaded.  All of these extra dynamic
>> relocations are of the ELF type R_X86_64_RELATIVE.   Let us show using
>> an example what these relocations do.
>> Let us take an example of a program that stores the address of a global:
>>
>> #include <stdio.h>
>>
>> const int a = 10;
>>
>> const int *b = &a;
>>
>> int main() {
>>
>>  printf (“b = %p\n”, b);
>>
>> }
>>
>> First, let us look at the binary built without PIE.  Let’s look at the
>> data section where ‘b’ and ‘a’ are allocated.
>>
>> 00000000004007d0 <a>:
>>  4007d0:       0a 00
>>
>>
>> 0000000000401b10 <b>:
>>  401b10:       d0 07
>>  401b12:       40 00 00
>>
>> Variable ‘a’ is allocated at address 0x4007d0 which matches the output
>> when running the binary.  ‘b’ is allocated at address 0x401b10 and its
>> contents in little-endian byte order is the address of ‘a’.
>>
>> Now, lets us examine the contents of the PIE binary:
>>
>> 00000000000008d8 <a>:
>> 8d8:   0a 00
>>
>> 0000000000001c50 <b>:
>>    1c50:       d8 08
>>                     1c50: R_X86_64_RELATIVE *ABS*+0x8d8
>>    1c52:       00 00
>>    1c54:       00 00
>>
>>
>> Notice there is a dynamic relocation here which tells the dynamic
>> linker that this value needs to be fixed at run-time.  This is needed
>> because ASLR can load this binary anywhere in the address space and
>> this relocation fixes the address after it is loaded.
>>
>>
>> * More details about R_X86_64_RELATIVE relocations
>>
>> This relocation is worth 24 bytes  and has three fields
>>
>> Offset
>>
>> Type - here it is R_X86_64_RELATIVE
>>
>> Addend (what extra value needs to be added)
>>
>> The offset field of this relocation is the address offset from the
>> start where this relocation applies.  The type field indicates the
>> type of the dynamic relocation but we are interested in particularly
>> one type of dynamic relocation, R_X86_64_RELATIVE.   This is important
>> because in the motivating example that we presented above, all the
>> extra dynamic relocations were of this type!
>>
>>
>> * We have these proposals to reduce the size of the dynamic relocations section:
>>
>
> There are 3 pieces of run-time relocation information:
>
> 1. Type and symbol. 4 or 8 bytes
> 2. Offset. 4 or 8 bytes
> 3. Addend.  4 or 8 bytes
>
> If we use REL instead of RELA, addend can be implicit and stored in-place.
> If we limit the type to relative relocation, we only need offset.
> This is for PIC,
> not just for PIE. An we can use special encoding scheme for offset table,
> which can be placed in DT_GNU_RELATIVE_REL with
> DT_GNU_RELATIVE_RELSZ.

I have not done an intrusive change like this before, so I am
wondering what are the various tools/pieces that  need to be modified.
Pointers to how to go about this would be really helpful. I can think
of these:

* Linker  - gold, lld, gnuld
* Dynamic Linker
* readelf
* objdump
* ABI changes - what is involved here?

Thanks
Sri




>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Cary Coutant
                     ` (2 preceding siblings ...)
  2017-01-01  0:00   ` Florian Weimer
@ 2017-01-01  0:00   ` Markus Trippelsdorf
  3 siblings, 0 replies; 29+ messages in thread
From: Markus Trippelsdorf @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: gnu-gabi, binutils, Xinliang David Li, Cary Coutant,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu,
	Rahul Chaudhry, Luis Lozano, Rafael Espíndola,
	Peter Collingbourne, Rui Ueyama

On 2017.04.25 at 21:06 -0700, Cary Coutant wrote:
> > Idea A: Shrinking the size of the dynamic relocations
>
> In a RELATIVE relocation, the r_offset field is really the only field
> of interest. The symbol and addend are not used -- the dynamic loader
> simply adjusts the address at the given offset by the relocation
> factor for the load module. Thus, it truly is possible to reduce these
> relocations to just the one word.
>
> Simon Baldwin did a lot of work on this a few years ago for Chromium,
> writing a post-link utility to extract all the RELATIVE relocations
> and rewrite them to a new section.

FYI Mike Hommey did something similar for Firefox and wrote about here:
https://glandium.org/blog/?p=1177

--
Markus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Cary Coutant
  2017-01-01  0:00   ` Sriraman Tallam
@ 2017-01-01  0:00   ` H.J. Lu
  2017-01-01  0:00   ` Florian Weimer
  2017-01-01  0:00   ` Markus Trippelsdorf
  3 siblings, 0 replies; 29+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant
  Cc: Sriraman Tallam, gnu-gabi, binutils, Xinliang David Li,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor,
	Rahul Chaudhry, Luis Lozano, Rafael Espíndola,
	Peter Collingbourne, Rui Ueyama

On Tue, Apr 25, 2017 at 9:06 PM, Cary Coutant <ccoutant@gmail.com> wrote:
>> Idea A: Shrinking the size of the dynamic relocations
>

> 4. Evaluate some alternate representations to improve upon the simple
> list of addresses. An idea I've been considering is emitting a bitmap
> -- an alternating series of pairs (starting address, bits), where
> "bits" is just a fixed-size bit vector describing each word starting
> at the given starting address. The "bits" field could be, say, 128 or
> 256 bits in length, or could begin with a byte that defines its
> length. I think this would work well, as the relocations are often
> clustered, but haven't yet done any testing of that hypothesis. A
> simple run-length encoding, as previously suggested, would also work,
> of course. I'm not in favor of using a general-purpose compression
> algorithm on the relocations.
>

On x86, the last 2/3 bits are always zero.  We can use it to encode the
next value.  Offsets are represented by

1. A base offset and followed by consecutive deltas.
2. Encode the size of next delta in the unused bits.

static size_t
encode_offsets (uintptr_t *original, char *encoded, size_t n)
{
  size_t i;
  int current_bytes, next_bytes;
  size_t encoded_size;
  uintptr_t offset, next, current;

  offset = original[0];
  if ((offset & (sizeof (uintptr_t) - 1)) != 0)
    abort ();
  next = original[1] - offset;
  if (next == 0)
    abort ();
  next_bytes = sizeof (uintptr_t) -  __builtin_clzl (next) / 8;
  if (next_bytes > sizeof (uintptr_t))
    abort ();
  /* Encode number of bytes of the next delta in the unused bits.  */
  offset |= next_bytes - 1;
  memcpy (encoded, &offset, sizeof (uintptr_t));
  encoded_size = sizeof (uintptr_t);
  encoded += encoded_size;

  for (i = 1; i < n; i++)
    {
      current = next;
      current_bytes = next_bytes;
      if ((i + 1) < n)
{
 offset = original[i];
 if ((offset & (sizeof (uintptr_t) - 1)) != 0)
   abort ();
 next = original[i + 1] - offset;
 if (next == 0)
   abort ();
 next_bytes = sizeof (uintptr_t) -  __builtin_clzl (next) / 8;
 if (next_bytes > sizeof (uintptr_t))
   abort ();
 /* Encode number of bytes of the next delta in the unused bits.  */
 current |= next_bytes - 1;
}
      /* FIXME: Only works for little endian.  */
      memcpy (encoded, &current, current_bytes);
      encoded_size += current_bytes;
      encoded += current_bytes;
    }

  return encoded_size;
}

On x86-64, I got

Total : 1471 offsets
Before: 11768 bytes
After : 1479 bytes


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Rafael Espíndola
@ 2017-01-01  0:00   ` Sriraman Tallam
  0 siblings, 0 replies; 29+ messages in thread
From: Sriraman Tallam @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rafael Espíndola
  Cc: gnu-gabi, binutils, Xinliang David Li, Cary Coutant,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu,
	Rahul Chaudhry, Luis Lozano, Peter Collingbourne, Rui Ueyama

On Thu, Apr 27, 2017 at 1:04 PM, Rafael Espíndola
<rafael.espindola@gmail.com> wrote:
> On 25 April 2017 at 13:12, Sriraman Tallam <tmsriram@google.com> wrote:
>> We identified a problem with PIE executables, more than 5% code size
>> bloat compared to non-PIE and we have a few proposals to reduce the
>> bloat.  Please take a look and let us know what you think.
>
> Just a bit of terminology, it is not code, it is a read only data.
> Why is the table size a problem? I can imagine a few reasons, but it
> would be nice to know which one you are trying to solve:
>
> * The file size itself is a problem for shipping the file.

Yes, this is a problem.  Certain binaries go in size sensitive
partitions and this is a blocker.

> * The startup time is a problem and a compact table makes the dynamic
> linker faster.

I think this is useful for Chrome if it improves it.  Looking at the
glandium blog seems to suggest it does.

> * The memory usage is a problem.

Yes, this is a problem too.

>
> For the last one one thing that could be done is have something like
> PT_GNU_RELRO but that tells the dynamic linker to unmmap the region
> completely once it is done with the relocations.
>
> In addition what is done for COFF, the other existing solution I know
> is https://wiki.mozilla.org/Elfhack. In the case of mozilla the issue
> was just the size of the binary being shipped I think.
>
> Cheers,
> Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Cary Coutant
@ 2017-01-01  0:00   ` Sriraman Tallam
  2017-01-01  0:00   ` H.J. Lu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Sriraman Tallam @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant
  Cc: gnu-gabi, binutils, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Rafael Espíndola, Peter Collingbourne,
	Rui Ueyama

On Tue, Apr 25, 2017 at 9:06 PM, Cary Coutant <ccoutant@gmail.com> wrote:
>> Idea A: Shrinking the size of the dynamic relocations
>
> In a RELATIVE relocation, the r_offset field is really the only field
> of interest. The symbol and addend are not used -- the dynamic loader
> simply adjusts the address at the given offset by the relocation
> factor for the load module. Thus, it truly is possible to reduce these
> relocations to just the one word.
>
> Simon Baldwin did a lot of work on this a few years ago for Chromium,
> writing a post-link utility to extract all the RELATIVE relocations
> and rewrite them to a new section.
>
> When Simon was working on his utility, I decided it wouldn't be too
> difficult to implement it directly in gold, and I toyed around with
> it, but never devoted the time to take all the steps it would take to
> bring this about. (Frankly, I was underwhelmed by the quoted savings,
> but it sounds like I may have underestimated the value of that
> savings. Sorry!)

Ok, here is how I understand the impact of these savings :
* In terms of text size, the size savings on most binaries is equal to
what ICF can offer and is on top of ICF.
* Some of our binaries are forced to disable PIE because they are
stored in space constrained partitions and this will make it possible
to enable PIE again for them.
* Chrome/ChromeOS binaries enable full ICF (not safe) and are willing
to take the debugging nuisance that comes along with it for the
savings, so I am assuming this would interest them.

>
> Anyway, I cleaned up what I had done a bit, and pushed it to a
> personal binutils branch, users/ccoutant/experimental-relr, for your
> consideration. Here's what I did:
>
> 1. Added a new section .relr.dyn, with a new type SHT_RELR, and
> sh_entsize = address size. The contents of this section are (for now)
> a simple list of addresses that require relative relocations.
>
> 2. Added new dynamic table entries, DT_RELR, DT_RELRSZ, and
> DT_RELRENT, which point to the new section.
>
> 3. Added a linker option, --experimental-use-relr, to enable the feature.
>
> 4. Modified the x86-64 target to emit new-style RELR relocations. I
> punted for 64-bit relocations on x32, and for IFUNC relocations --
> those will still be emitted as old-style relative relocations in
> .rela.dyn.
>
> For my experimentation, I picked the next-available values in the
> gABI-reserved range, rather than vendor-specific values, for no good
> reason. They're easy to renumber if we need to experiment with this in
> the wild before standardizing.

Thanks!, you have shown me a clear path forward.

>
> Still to do:
>
> 1. Modify the other targets in gold to support RELR relocations.
>
> 2. Add readelf/objdump support. (You can still use readelf as is --
> it'll just show the section type and dynamic table entries as numbers.
> And you can dump the new section with "readelf -x.relr.dyn".)
>
> 3. Add dynamic loader support.
>
> 4. Evaluate some alternate representations to improve upon the simple
> list of addresses. An idea I've been considering is emitting a bitmap
> -- an alternating series of pairs (starting address, bits), where
> "bits" is just a fixed-size bit vector describing each word starting
> at the given starting address. The "bits" field could be, say, 128 or
> 256 bits in length, or could begin with a byte that defines its
> length. I think this would work well, as the relocations are often
> clustered, but haven't yet done any testing of that hypothesis. A
> simple run-length encoding, as previously suggested, would also work,
> of course. I'm not in favor of using a general-purpose compression
> algorithm on the relocations.
>
> 5. Make a proposal to the gABI to add SHT_RELR and DT_RELR*.

Ok, I guess Step 4 can be done in a later iteration.


Thanks all for the inputs.
Sri

>
> -cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00   ` Florian Weimer
@ 2017-01-01  0:00     ` Alan Modra
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Modra @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Cary Coutant, Sriraman Tallam, gnu-gabi, binutils,
	Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Rahul Chaudhry, Luis Lozano,
	Rafael Espíndola, Peter Collingbourne, Rui Ueyama

On Fri, Apr 28, 2017 at 08:25:04AM +0200, Florian Weimer wrote:
> On 04/26/2017 06:06 AM, Cary Coutant wrote:
> >4. Evaluate some alternate representations to improve upon the simple
> >list of addresses. An idea I've been considering is emitting a bitmap
> >-- an alternating series of pairs (starting address, bits), where
> >"bits" is just a fixed-size bit vector describing each word starting
> >at the given starting address. The "bits" field could be, say, 128 or
> >256 bits in length, or could begin with a byte that defines its
> >length. I think this would work well, as the relocations are often
> >clustered, but haven't yet done any testing of that hypothesis. A
> >simple run-length encoding, as previously suggested, would also work,
> >of course. I'm not in favor of using a general-purpose compression
> >algorithm on the relocations.
> 
> It might also be possible to sort the relocated objects so that they are
> more contiguous.

That's true for GOT entry relocations, but difficult for anything not
linker generated.  However, the GOT doesn't tend to have too many
entries with relative relocs.  What's more, it is better to teach the
linker how to edit code using such entries from GOT indirect to GOT
pointer relative and thus eliminate the GOT entry entirely..

I'm in favour of a very simple compression scheme, perhaps just
using a new relocation to adjust an array of words by the base
address.  From looking at a few random binaries it seems that many
relative relocations apply to a run of consecutive words.  Of course,
using such a simple scheme spoils the fun of designing something
entirely new.  ;-)

> It could be beneficial to arrange things so that vector instructions can be
> used for the relocation.  I'm not sure if it is worthwhile to design a
> compression scheme based on bit fiddling for this because those who want
> compression can probably use a compressed file system anyway.
> 
> Thanks,
> Florian

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
       [not found]                             ` <CAORpzuPYsSBJtypm3NDcfcgRzos3WO4JjkvgiqpyBYBhoqLVFA@mail.gmail.com>
@ 2018-01-01  0:00                               ` Florian Weimer
  0 siblings, 0 replies; 29+ messages in thread
From: Florian Weimer @ 2018-01-01  0:00 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Rahul Chaudhry, Cary Coutant, Sriraman Tallam,
	Rahul Chaudhry via gnu-gabi, Suprateeka R Hegde, David Edelsohn,
	Rafael Avila de Espindola, Binutils Development, Alan Modra,
	Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Luis Lozano, Peter Collingbourne,
	Rui Ueyama, llvm-dev

* Roland McGrath:

> Given this corpus of "reloc traces" you can code up many competing encoding
> formats and do serious measurements of their space savings across the
> entire corpus from simple simulations without having to implement each
> encoding in an actual toolchain and dynamic linker to do the analysis.

On the other hand, the linker currently assumes that the order in
which relative relocations are emitted does not matter.  With a
different encoding scheme, future linkers might reorder the
relocations or data objects themselves, either to reduce relocation
encoding size, or to make relocations more efficient (perhaps to
support vectorization).  Unfortunately, the numbers which can be
derived from ET_DYN files will not reflect to what extent such
reordering is possible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00       ` Florian Weimer
@ 2017-01-01  0:00         ` Suprateeka R Hegde
  2017-01-01  0:00           ` Sriraman Tallam via gnu-gabi
  0 siblings, 1 reply; 29+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer, David Edelsohn, Rafael Avila de Espindola
  Cc: Binutils Development, Alan Modra, Cary Coutant, Sriraman Tallam,
	gnu-gabi, Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Rahul Chaudhry, Luis Lozano,
	Peter Collingbourne, Rui Ueyama

On 02-May-2017 12:05 AM, Florian Weimer wrote:
> On 05/01/2017 08:28 PM, Suprateeka R Hegde wrote:
>> So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
>> not the case on HP-UX/Itanium. But as I said, this comparison does not
>> make sense as the runtime architecture and ISA are totally different.
> 
> It could be that HP-UX was written in a way to reduce relative
> relocations,

Rather, the Itanium runtime architecture itself provides a way to reduce
them.

> or that the final executables aren't actually PIC anymore.

I was referring to shlibs (PIC) on HP-UX and it was implicit in my mind.
Sorry for that.

I just built a large C++ shlib both on HP-UX/Itanium with our aCC
compiler and Linux x86-64 using GCC-6.2. The sources are almost same
with only a couple of lines differing between platforms.

(HP-UX/Linux)
Total:    12224/38311
RELATIVE: 18/6397

I will try to check the reason for such a huge difference in RELATIVE
reloc count. It might be useful for this discussion (just a guess)

--
Supra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00         ` Suprateeka R Hegde
@ 2017-01-01  0:00           ` Sriraman Tallam via gnu-gabi
  2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
  0 siblings, 1 reply; 29+ messages in thread
From: Sriraman Tallam via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: hegdesmailbox
  Cc: Florian Weimer, David Edelsohn, Rafael Avila de Espindola,
	Binutils Development, Alan Modra, Cary Coutant, gnu-gabi,
	Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Rahul Chaudhry, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

+llvm-dev

Discussion here: https://sourceware.org/ml/gnu-gabi/2017-q2/msg00000.html

On Tue, May 2, 2017 at 10:17 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 02-May-2017 12:05 AM, Florian Weimer wrote:
>> On 05/01/2017 08:28 PM, Suprateeka R Hegde wrote:
>>> So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
>>> not the case on HP-UX/Itanium. But as I said, this comparison does not
>>> make sense as the runtime architecture and ISA are totally different.
>>
>> It could be that HP-UX was written in a way to reduce relative
>> relocations,
>
> Rather, the Itanium runtime architecture itself provides a way to reduce
> them.
>
>> or that the final executables aren't actually PIC anymore.
>
> I was referring to shlibs (PIC) on HP-UX and it was implicit in my mind.
> Sorry for that.
>
> I just built a large C++ shlib both on HP-UX/Itanium with our aCC
> compiler and Linux x86-64 using GCC-6.2. The sources are almost same
> with only a couple of lines differing between platforms.
>
> (HP-UX/Linux)
> Total:    12224/38311
> RELATIVE: 18/6397
>
> I will try to check the reason for such a huge difference in RELATIVE
> reloc count. It might be useful for this discussion (just a guess)
>
> --
> Supra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
       [not found] <CAGWvnynFwXFGLj3tAVgDatn0zmuHcWHyRNuDvR+wRZCXLnar_A@mail.gmail.com>
@ 2017-01-01  0:00 ` Rafael Avila de Espindola
  2017-01-01  0:00   ` David Edelsohn
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael Avila de Espindola @ 2017-01-01  0:00 UTC (permalink / raw)
  To: David Edelsohn, Binutils Development
  Cc: Alan Modra, Florian Weimer, Cary Coutant, Sriraman Tallam,
	gnu-gabi, Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Rahul Chaudhry, Luis Lozano,
	Peter Collingbourne, Rui Ueyama

Where is it documented?

Thanks,
Rafael

David Edelsohn <dje.gcc@gmail.com> writes:

> AIX uses relative relocations and generates position independent
> executables by default.  The design of the AIX linker might provide
> some additional inspiration.
>
> Thanks, David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00     ` Suprateeka R Hegde
@ 2017-01-01  0:00       ` Florian Weimer
  2017-01-01  0:00         ` Suprateeka R Hegde
  0 siblings, 1 reply; 29+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: hegdesmailbox, David Edelsohn, Rafael Avila de Espindola
  Cc: Binutils Development, Alan Modra, Cary Coutant, Sriraman Tallam,
	gnu-gabi, Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Rahul Chaudhry, Luis Lozano,
	Peter Collingbourne, Rui Ueyama

On 05/01/2017 08:28 PM, Suprateeka R Hegde wrote:
> So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
> not the case on HP-UX/Itanium. But as I said, this comparison does not
> make sense as the runtime architecture and ISA are totally different.

It could be that HP-UX was written in a way to reduce relative 
relocations, or that the final executables aren't actually PIC anymore. 
The amount of C++ code with large vtables could differ as well.

Here's a GCC extension proposed which is relevant to producing fully 
relocatable code with fewer relative relocations:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38295

Florian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00   ` David Edelsohn
@ 2017-01-01  0:00     ` Suprateeka R Hegde
  2017-01-01  0:00       ` Florian Weimer
  0 siblings, 1 reply; 29+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: David Edelsohn, Rafael Avila de Espindola
  Cc: Binutils Development, Alan Modra, Florian Weimer, Cary Coutant,
	Sriraman Tallam, gnu-gabi, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Peter Collingbourne, Rui Ueyama

Is the AIX available on x86-64? Otherwise it may not be much useful to
refer the linker design. And it is just not linker when it comes to
relocs, it indeed involves entire runtime architecture and ISA of the
platform.

HP-UX on Itanium also uses PIC by default and we have never faced any
such exponential difference between relative and non-relative relocs,
and its been almost 2 decades now.

I just did some numbers. On a Ubuntu 16.10, I just took the ratio of
RELATIVE vs non-RELATIVE on x86-64. I apologize if this number is
already discussed.

readelf -Wr /usr/bin/* 2>&1 | grep RELATIVE | wc -l
1238652

readelf -Wr /usr/bin/* 2>&1 | grep -v RELATIVE | wc -l
51551

So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
not the case on HP-UX/Itanium. But as I said, this comparison does not
make sense as the runtime architecture and ISA are totally different.

At this point, I would say the combination of original ideas proposed
and some work already done in this direction is the way to go. On HP-UX,
our tool chain greatly reduces dyn relocs with IPO kind of options too.

--
Supra


On 01-May-2017 07:43 PM, David Edelsohn wrote:
> GNU ld supports earlier versions of AIX XCOFF, so one can look at that code.
> 
> The current documentation for XCOFF is
> 
> https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.files/XCOFF.htm
> 
> - David
> 
> 
> On Mon, May 1, 2017 at 9:31 AM, Rafael Avila de Espindola
> <rafael.espindola@gmail.com> wrote:
>> Where is it documented?
>>
>> Thanks,
>> Rafael
>>
>> David Edelsohn <dje.gcc@gmail.com> writes:
>>
>>> AIX uses relative relocations and generates position independent
>>> executables by default.  The design of the AIX linker might provide
>>> some additional inspiration.
>>>
>>> Thanks, David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00 ` Rafael Avila de Espindola
@ 2017-01-01  0:00   ` David Edelsohn
  2017-01-01  0:00     ` Suprateeka R Hegde
  0 siblings, 1 reply; 29+ messages in thread
From: David Edelsohn @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rafael Avila de Espindola
  Cc: Binutils Development, Alan Modra, Florian Weimer, Cary Coutant,
	Sriraman Tallam, gnu-gabi, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Rahul Chaudhry,
	Luis Lozano, Peter Collingbourne, Rui Ueyama

GNU ld supports earlier versions of AIX XCOFF, so one can look at that code.

The current documentation for XCOFF is

https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.files/XCOFF.htm

- David


On Mon, May 1, 2017 at 9:31 AM, Rafael Avila de Espindola
<rafael.espindola@gmail.com> wrote:
> Where is it documented?
>
> Thanks,
> Rafael
>
> David Edelsohn <dje.gcc@gmail.com> writes:
>
>> AIX uses relative relocations and generates position independent
>> executables by default.  The design of the AIX linker might provide
>> some additional inspiration.
>>
>> Thanks, David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00               ` Cary Coutant
  2017-01-01  0:00               ` Ian Lance Taylor via gnu-gabi
@ 2017-01-01  0:00               ` Florian Weimer
  2017-01-01  0:00                 ` Sriraman Tallam via gnu-gabi
  2 siblings, 1 reply; 29+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rahul Chaudhry via gnu-gabi
  Cc: Sriraman Tallam, Rahul Chaudhry, hegdesmailbox, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, Cary Coutant, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

* Rahul Chaudhry via gnu-gabi:

> The encoding used is a simple combination of delta-encoding and a
> bitmap of offsets. The section consists of 64-bit entries: higher
> 8-bits contain delta since last offset, and lower 56-bits contain a
> bitmap for which words to apply the relocation to. This is best
> described by showing the code for decoding the section:
>
> typedef struct
> {
>   Elf64_Xword  r_data;  /* jump and bitmap for relative relocations */
> } Elf64_Relrz;
>
> #define ELF64_R_JUMP(val)    ((val) >> 56)
> #define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)
>
> #ifdef DO_RELRZ
>   {
>     ElfW(Addr) offset = 0;
>     for (; relative < end; ++relative)
>       {
>         ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
>         ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
>         offset += jump * sizeof(ElfW(Addr));
>         if (jump == 0)
>           {
>             ++relative;
>             offset = relative->r_data;
>           }
>         ElfW(Addr) r_offset = offset;
>         for (; bits != 0; bits >>= 1)
>           {
>             if ((bits&1) != 0)
>               elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
>             r_offset += sizeof(ElfW(Addr));
>           }
>       }
>   }
> #endif

That data-dependent “if ((bits&1) != 0)” branch looks a bit nasty.

Have you investigated whether some sort of RLE-style encoding would be
beneficial? If there are blocks of relative relocations, it might even
be possible to use vector instructions to process them (although more
than four relocations at a time are probably not achievable in a
power-efficient manner on current x86-64).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00           ` Sriraman Tallam via gnu-gabi
@ 2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00               ` Cary Coutant
                                 ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: hegdesmailbox, Florian Weimer, David Edelsohn,
	Rafael Avila de Espindola, Binutils Development, Alan Modra,
	Cary Coutant, gnu-gabi, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

Sri and I have been working on this over the past few months, and we've made
some good progress that we'd like to share and get feedback on.

Our work is based on the 'experimental-relr' prototype from Cary that is
available at 'users/ccoutant/experimental-relr' branch in the binutils
repository [1], and was described earlier in this thread:
https://sourceware.org/ml/gnu-gabi/2017-q2/msg00003.html

We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
custom encoding to compactly represent the list of offsets. We're calling the
new compressed section '.relrz.dyn' (for relocations-relative-compressed).

The encoding used is a simple combination of delta-encoding and a bitmap of
offsets. The section consists of 64-bit entries: higher 8-bits contain delta
since last offset, and lower 56-bits contain a bitmap for which words to apply
the relocation to. This is best described by showing the code for decoding the
section:

typedef struct
{
  Elf64_Xword  r_data;  /* jump and bitmap for relative relocations */
} Elf64_Relrz;

#define ELF64_R_JUMP(val)    ((val) >> 56)
#define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)

#ifdef DO_RELRZ
  {
    ElfW(Addr) offset = 0;
    for (; relative < end; ++relative)
      {
        ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
        ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
        offset += jump * sizeof(ElfW(Addr));
        if (jump == 0)
          {
            ++relative;
            offset = relative->r_data;
          }
        ElfW(Addr) r_offset = offset;
        for (; bits != 0; bits >>= 1)
          {
            if ((bits&1) != 0)
              elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
            r_offset += sizeof(ElfW(Addr));
          }
      }
  }
#endif

Note that the 8-bit 'jump' encodes the number of _words_ since last offset. The
case where jump would not fit in 8-bits is handled by setting jump to 0, and
emitting the full offset for the next relocation in the subsequent entry.

The above code is the entirety of the implementation for decoding and
processing '.relrz.dyn' sections in glibc dynamic loader.

This encoding can represent up to 56 relocation offsets in a single 64-bit
word. For many of the binaries we tested, this encoding provides >40x
compression for storing offsets over the original `.relr.dyn` section.

For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
the bitmap.


Here are three real world examples that demonstrate the savings:

1) Chrome browser (x86_64, built as PIE):
   File size (stripped): 152265064 bytes (145.21MB)
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         109256 bytes  (0.10MB) if moved to '.relrz.dyn' section

   Savings: 14159752 bytes, or 9.29% of original file size.


2) Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   File size (stripped): 10238168 bytes (9.76MB)
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
         43744 bytes (0.04MB) if moved to .relrz.dyn section

   Savings: 1967552 bytes, or 19.21% of original file size.


3) Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   File size (stripped): 3030032 bytes (2.89MB)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
         1992 bytes (0.00MB) if moved to .relrz.dyn section

   Savings: 148536 bytes, or 4.90% of original file size.

Recent releases of Debian, Ubuntu, and several other distributions build
executables as PIE by default. Suprateeka posted some statistics earlier in
this thread on the prevalence of relative relocations in executables residing
in /usr/bin: https://sourceware.org/ml/gnu-gabi/2017-q2/msg00013.html

The third example above shows that using '.relrz.dyn' sections to encode
relative relocations can bring decent savings to executable sizes in /usr/bin
across many distributions.

We have working ld.gold and ld.so implementations for arm, aarch64, and x86_64,
and would be happy to send patches to the binutils and glibc communities for
review.

However, before that can happen, we need agreement on the ABI side for the new
section type and the encoding. We haven't worked on a change of this magnitude
before that touches so many different pieces from the linker, elf tools, and
the dynamic loader. Specifically, we need agreement and/or guidance on where
and how should the new section type and its encoding be documented. We're
proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Thanks,
Rahul

[1]: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=shortlog;h=refs/heads/users/ccoutant/experimental-relr



On Mon, May 8, 2017 at 1:55 PM, Sriraman Tallam <tmsriram@google.com> wrote:
> +llvm-dev
>
> Discussion here: https://sourceware.org/ml/gnu-gabi/2017-q2/msg00000.html
>
> On Tue, May 2, 2017 at 10:17 AM, Suprateeka R Hegde
> <hegdesmailbox@gmail.com> wrote:
>> On 02-May-2017 12:05 AM, Florian Weimer wrote:
>>> On 05/01/2017 08:28 PM, Suprateeka R Hegde wrote:
>>>> So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
>>>> not the case on HP-UX/Itanium. But as I said, this comparison does not
>>>> make sense as the runtime architecture and ISA are totally different.
>>>
>>> It could be that HP-UX was written in a way to reduce relative
>>> relocations,
>>
>> Rather, the Itanium runtime architecture itself provides a way to reduce
>> them.
>>
>>> or that the final executables aren't actually PIC anymore.
>>
>> I was referring to shlibs (PIC) on HP-UX and it was implicit in my mind.
>> Sorry for that.
>>
>> I just built a large C++ shlib both on HP-UX/Itanium with our aCC
>> compiler and Linux x86-64 using GCC-6.2. The sources are almost same
>> with only a couple of lines differing between platforms.
>>
>> (HP-UX/Linux)
>> Total:    12224/38311
>> RELATIVE: 18/6397
>>
>> I will try to check the reason for such a huge difference in RELATIVE
>> reloc count. It might be useful for this discussion (just a guess)
>>
>> --
>> Supra

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00               ` Florian Weimer
@ 2017-01-01  0:00                 ` Sriraman Tallam via gnu-gabi
  2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
  0 siblings, 1 reply; 29+ messages in thread
From: Sriraman Tallam via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Rahul Chaudhry via gnu-gabi, Rahul Chaudhry, Suprateeka R Hegde,
	Florian Weimer, David Edelsohn, Rafael Avila de Espindola,
	Binutils Development, Alan Modra, Cary Coutant,
	Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Luis Lozano, Peter Collingbourne,
	Rui Ueyama, llvm-dev

On Sat, Dec 9, 2017 at 3:06 PM, Florian Weimer <fw@deneb.enyo.de> wrote:
> * Rahul Chaudhry via gnu-gabi:
>
>> The encoding used is a simple combination of delta-encoding and a
>> bitmap of offsets. The section consists of 64-bit entries: higher
>> 8-bits contain delta since last offset, and lower 56-bits contain a
>> bitmap for which words to apply the relocation to. This is best
>> described by showing the code for decoding the section:
>>
>> typedef struct
>> {
>>   Elf64_Xword  r_data;  /* jump and bitmap for relative relocations */
>> } Elf64_Relrz;
>>
>> #define ELF64_R_JUMP(val)    ((val) >> 56)
>> #define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)
>>
>> #ifdef DO_RELRZ
>>   {
>>     ElfW(Addr) offset = 0;
>>     for (; relative < end; ++relative)
>>       {
>>         ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
>>         ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
>>         offset += jump * sizeof(ElfW(Addr));
>>         if (jump == 0)
>>           {
>>             ++relative;
>>             offset = relative->r_data;
>>           }
>>         ElfW(Addr) r_offset = offset;
>>         for (; bits != 0; bits >>= 1)
>>           {
>>             if ((bits&1) != 0)
>>               elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
>>             r_offset += sizeof(ElfW(Addr));
>>           }
>>       }
>>   }
>> #endif
>
> That data-dependent “if ((bits&1) != 0)” branch looks a bit nasty.
>
> Have you investigated whether some sort of RLE-style encoding would be
> beneficial? If there are blocks of relative relocations, it might even
> be possible to use vector instructions to process them (although more
> than four relocations at a time are probably not achievable in a
> power-efficient manner on current x86-64).

Yes, we originally investigated RLE style encoding but I guess the key
insight which led us towards the proposed encoding is the following.
The offset addresses which contain the relocations are close but not
necessarily contiguous.  We experimented with an encoding strategy
where we would store the initial offset and the number of words after
that which contained dynamic relocations.  This gave us good
compression numbers but the proposed scheme was way better.  I will
let Rahul say more as he did quite a bit of experiments with different
strategies.

Thanks
Sri

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00               ` Cary Coutant
@ 2017-01-01  0:00               ` Ian Lance Taylor via gnu-gabi
  2017-01-01  0:00               ` Florian Weimer
  2 siblings, 0 replies; 29+ messages in thread
From: Ian Lance Taylor via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rahul Chaudhry
  Cc: Sriraman Tallam, hegdesmailbox, Florian Weimer, David Edelsohn,
	Rafael Avila de Espindola, Binutils Development, Alan Modra,
	Cary Coutant, gnu-gabi, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, H.J. Lu, Luis Lozano, Peter Collingbourne,
	Rui Ueyama, llvm-dev

On Thu, Dec 7, 2017 at 2:51 PM, Rahul Chaudhry <rahulchaudhry@google.com> wrote:
>
> However, before that can happen, we need agreement on the ABI side for the new
> section type and the encoding. We haven't worked on a change of this magnitude
> before that touches so many different pieces from the linker, elf tools, and
> the dynamic loader. Specifically, we need agreement and/or guidance on where
> and how should the new section type and its encoding be documented. We're
> proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
> and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Sounds like good work.

The place to hold a discussion on ELF ABI issues is
generic-abi@googlegroups.com and gnu-gabi@sourceware.org.  Inasmuch as
there is an official ELF ABI any more now that SCO has gone under, it
is maintained on the generic-abi list.

Ian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00                         ` Cary Coutant
@ 2017-01-01  0:00                           ` Rahul Chaudhry via gnu-gabi
       [not found]                             ` <CAORpzuPYsSBJtypm3NDcfcgRzos3WO4JjkvgiqpyBYBhoqLVFA@mail.gmail.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant
  Cc: Roland McGrath, Sriraman Tallam, Florian Weimer,
	Rahul Chaudhry via gnu-gabi, Suprateeka R Hegde, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

On Thu, Dec 14, 2017 at 12:11 AM, Cary Coutant <ccoutant@gmail.com> wrote:
>> While adding a 'stride' field is definitely an improvement over simple
>> delta+count encoding, it doesn't compare well against the bitmap based
>> encoding.
>>
>> I took a look inside the encoding for the Vim binary. There are some instances
>> in the bitmap based encoding like
>>   [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
>> that encode sequences of relocations applying to alternate words. The stride
>> based encoding works very well on these and turns it into much more compact
>>   [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
>> using stride==0x10 and count==0xff.
>
> Have you looked much at where the RELATIVE relocations are coming from?
>
> I've looked at a PIE build of gold, and they're almost all for
> vtables, which mostly have consecutive entries with 8-byte strides.
> There are a few for the GOT, a few for static constructors (in
> .init_array), and a few for other initialized data, but vtables seem
> to account for the vast majority. (Gold has almost 19,000 RELATIVE
> dynamic relocs, and only about 500 non-RELATIVE dynamic relocs.)
>
> Where do the 16-byte strides come from? Vim is plain C, right? I'm
> guessing its RELATIVE relocation count is fairly low compared to big
> C++ apps. I'm also guessing that the pattern comes from some large
> structure or structures in the source code where initialized pointers
> alternate with non-pointer values. I'm also curious about Roland's
> app.

I took a look inside vim for the source of the ..5555.. pattern (relative
relocations applying to alternate words). One of the sources is the
"builtin_termcaps" symbol, which is an array of "struct builtin_term":

  struct builtin_term
  {
    int   bt_entry;
    char  *bt_string;
  };

So the pattern makes sense. An encoding using strides will work really well
here with stride == 0x10.

There is another repeating pattern I noticed in vim ..9999... One of the
sources behind this pattern is the "cmdnames" symbol, which is an array of
"struct cmdname":

  struct cmdname
  {
    char_u      *cmd_name;      /* name of the command */
    ex_func_T   cmd_func;       /* function for this command */
    long_u      cmd_argt;       /* flags declared above */
    int         cmd_addr_type;  /* flag for address type */
  };

In this struct, the first two fields are pointers, and the next two are
scalars. This explains the ..9999.. pattern for relative relocations. This is
an example where a stride based encoding does not work well, simply because
there is no single stride. The deltas are 8,24,8,24,8,24,...

I think these two examples demonstrate the main weakness of using a simple
stride based encoding: it is too sensitive to how the data structures are laid
out in the program source.

Rahul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
       [not found]                     ` <CAORpzuMftCGpXUObOyoFY0=jorMBDWEDbQJ23DifTNW3v-WA6Q@mail.gmail.com>
@ 2017-01-01  0:00                       ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00                         ` Cary Coutant
  0 siblings, 1 reply; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Sriraman Tallam, Florian Weimer, Rahul Chaudhry via gnu-gabi,
	Suprateeka R Hegde, Florian Weimer, David Edelsohn,
	Rafael Avila de Espindola, Binutils Development, Alan Modra,
	Cary Coutant, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

On Mon, Dec 11, 2017 at 6:14 PM, Roland McGrath <roland@hack.frob.com> wrote:
>
> On Mon, Dec 11, 2017 at 3:50 PM Rahul Chaudhry via gnu-gabi <gnu-gabi@sourceware.org> wrote:
>>
>> A simple combination of delta-encoding and run_length-encoding is one of the
>> first schemes we experimented with (32-bit entries with 24-bit 'delta' and an
>> 8-bit 'count'). This gave really good results, but as Sri mentions, we observed
>> several cases where the relative relocations were not on consecutive offsets.
>> There were common cases where the relocations applied to alternate words, and
>> that totally wrecked the scheme (a bunch of entries with delta==16 and
>> count==1).
>
>
> For the same issue in a different context, I recently implemented a scheme using run-length-encoding but using a variable stride.  So for a run of alternate words, you still get a single entry, but with stride 16 instead of 8.  In my application, most cases of strides > 8 are a run of only 2 or 3 but there are a few cases of dozens or hundreds with a stride of 16.  My case is a solution tailored to exactly one application (a kernel), so there is a closed sample set that's all that matters and the trade-off between simplicity of the analysis and compactness of the results is different than the general case you're addressing (my "analysis" consists of a few lines of AWK).  But I wonder if it might be worthwhile to study the effect a variable-stride RLE scheme or adding the variable-stride ability into your hybrid scheme has on your sample applications.
>
> Since we're talking about specifying a new ABI that will be serving us for many years to come and will be hard to change once deployed, it seems worth spending quite a bit of effort up front to come to the most compact scheme that's feasible.

I agree. Can you share more details of the encoding scheme that you found
useful (size of each entry, number of bits used for stride/count etc.)?

I just ran some experiments with an encoding with 32-bit entries: 16-bits for
delta, 8-bits for stride, and 8-bits for count. Here are the numbers, inlined
with those from the previous schemes for comparison:


1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+count encoding
         232540 bytes (0.22MB) using delta+stride+count encoding
         109256 bytes  (0.10MB) using jump+bitmap encoding


2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+count encoding
        132568 bytes (0.13MB) using delta+stride+count encoding
         43744 bytes (0.04MB) using jump+bitmap encoding


3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+count encoding
         7000 bytes (0.01MB) using delta+stride+count encoding
         1992 bytes (0.00MB) using jump+bitmap encoding


delta+count encoding is using 32-bit entries:
  24-bit delta: number of bytes since last offset.
   8-bit count: number of relocations to apply (consecutive words).

delta+stride+count encoding is using 32-bit entries:
  16-bit delta: number of bytes since last offset.
   8-bit stride: stride (in bytes) for applying 'count' relocations.
   8-bit count: number of relocations to apply (using 'stride').

jump+bitmap encoding is using 64-bit entries:
   8-bit jump: number of words since last offset.
  56-bit bitmap: bitmap for which words to apply relocations to.


While adding a 'stride' field is definitely an improvement over simple
delta+count encoding, it doesn't compare well against the bitmap based
encoding.

I took a look inside the encoding for the Vim binary. There are some instances
in the bitmap based encoding like
  [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
that encode sequences of relocations applying to alternate words. The stride
based encoding works very well on these and turns it into much more compact
  [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
using stride==0x10 and count==0xff.

However, for the vast majority of cases, the stride based encoding ends up with
count <= 2, and that kills it in the end.

I could try something more complex with 16-bit entries, but that can only give
2x improvement at best, so it still won't be better than the bitmap approach.

Thanks,
Rahul


> --
>
>
> Thanks,
> Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00                       ` Rahul Chaudhry via gnu-gabi
@ 2017-01-01  0:00                         ` Cary Coutant
  2017-01-01  0:00                           ` Rahul Chaudhry via gnu-gabi
  0 siblings, 1 reply; 29+ messages in thread
From: Cary Coutant @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rahul Chaudhry
  Cc: Roland McGrath, Sriraman Tallam, Florian Weimer,
	Rahul Chaudhry via gnu-gabi, Suprateeka R Hegde, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

> While adding a 'stride' field is definitely an improvement over simple
> delta+count encoding, it doesn't compare well against the bitmap based
> encoding.
>
> I took a look inside the encoding for the Vim binary. There are some instances
> in the bitmap based encoding like
>   [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
> that encode sequences of relocations applying to alternate words. The stride
> based encoding works very well on these and turns it into much more compact
>   [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
> using stride==0x10 and count==0xff.

Have you looked much at where the RELATIVE relocations are coming from?

I've looked at a PIE build of gold, and they're almost all for
vtables, which mostly have consecutive entries with 8-byte strides.
There are a few for the GOT, a few for static constructors (in
.init_array), and a few for other initialized data, but vtables seem
to account for the vast majority. (Gold has almost 19,000 RELATIVE
dynamic relocs, and only about 500 non-RELATIVE dynamic relocs.)

Where do the 16-byte strides come from? Vim is plain C, right? I'm
guessing its RELATIVE relocation count is fairly low compared to big
C++ apps. I'm also guessing that the pattern comes from some large
structure or structures in the source code where initialized pointers
alternate with non-pointer values. I'm also curious about Roland's
app.

In my opinion, the data and my intuition both support your choice of a
jump + bit vector representation.

-cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00               ` Cary Coutant
@ 2017-01-01  0:00                 ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
  0 siblings, 1 reply; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant
  Cc: Sriraman Tallam, Suprateeka R Hegde, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, Rahul Chaudhry via gnu-gabi, Xinliang David Li,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu,
	Luis Lozano, Peter Collingbourne, Rui Ueyama, llvm-dev

Thanks for your encouraging words, Ian and Cary.

We're drafting a more detailed proposal and would post it on the generic-abi
list this week. I'll also post a link here for cross-reference.

Based on Cary's suggestion here, we're renaming '.relrz.dyn' to
'.relr.dyn' in the
proposal.

Rahul


On Fri, Dec 8, 2017 at 10:36 PM, Cary Coutant <ccoutant@gmail.com> wrote:
>> We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
>> custom encoding to compactly represent the list of offsets. We're calling the
>> new compressed section '.relrz.dyn' (for relocations-relative-compressed).
>
> I'd suggest just using .relr.dyn -- your encoding is straightforward
> enough that I'd just make that the standard representation for this
> section type.
>
>> The encoding used is a simple combination of delta-encoding and a bitmap of
>> offsets. The section consists of 64-bit entries: higher 8-bits contain delta
>> since last offset, and lower 56-bits contain a bitmap for which words to apply
>> the relocation to. This is best described by showing the code for decoding the
>> section:
>>
>> ...
>>
>> The above code is the entirety of the implementation for decoding and
>> processing '.relrz.dyn' sections in glibc dynamic loader.
>>
>> This encoding can represent up to 56 relocation offsets in a single 64-bit
>> word. For many of the binaries we tested, this encoding provides >40x
>> compression for storing offsets over the original `.relr.dyn` section.
>>
>> For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
>> the bitmap.
>
> Very nice! Simple and effective.
>
>> Here are three real world examples that demonstrate the savings:
>
> Impressive numbers. I've gotta admit, the savings are better than I expected.
>
>> However, before that can happen, we need agreement on the ABI side for the new
>> section type and the encoding. We haven't worked on a change of this magnitude
>> before that touches so many different pieces from the linker, elf tools, and
>> the dynamic loader. Specifically, we need agreement and/or guidance on where
>> and how should the new section type and its encoding be documented. We're
>> proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
>> and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.
>
> Yes, as Ian mentioned, the generic ABI discussion is at
> generic-abi@googlegroups.com. Most people who would be interested are
> already on the gnu-gabi@sourceware.org list, but there are a few who
> are not, and who may not yet have seen this discussion. I'll support
> the proposal.
>
> Thanks for taking this idea the extra mile!
>
> -cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00                 ` Sriraman Tallam via gnu-gabi
@ 2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
       [not found]                     ` <CAORpzuMftCGpXUObOyoFY0=jorMBDWEDbQJ23DifTNW3v-WA6Q@mail.gmail.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Sriraman Tallam
  Cc: Florian Weimer, Rahul Chaudhry via gnu-gabi, Suprateeka R Hegde,
	Florian Weimer, David Edelsohn, Rafael Avila de Espindola,
	Binutils Development, Alan Modra, Cary Coutant,
	Xinliang David Li, Sterling Augustine, Paul Pluzhnikov,
	Ian Lance Taylor, H.J. Lu, Luis Lozano, Peter Collingbourne,
	Rui Ueyama, llvm-dev

A simple combination of delta-encoding and run_length-encoding is one of the
first schemes we experimented with (32-bit entries with 24-bit 'delta' and an
8-bit 'count'). This gave really good results, but as Sri mentions, we observed
several cases where the relative relocations were not on consecutive offsets.
There were common cases where the relocations applied to alternate words, and
that totally wrecked the scheme (a bunch of entries with delta==16 and
count==1).

I dug up some numbers on how that scheme compared with the current proposal on
the three examples I posted before:

delta+run_length encoding is using 32-bit entries (24-bit delta, 8-bit count).
delta+bitmap encoding is using 64-bit entries (8-bit delta, 56-bit bitmap).

1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+run_length encoding
         109256 bytes  (0.10MB) using delta+bitmap encoding


2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+run_length encoding
         43744 bytes (0.04MB) using delta+bitmap encoding


3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+run_length encoding
         1992 bytes (0.00MB) using delta+bitmap encoding

Rahul


On Mon, Dec 11, 2017 at 10:41 AM, Sriraman Tallam <tmsriram@google.com> wrote:
> On Sat, Dec 9, 2017 at 3:06 PM, Florian Weimer <fw@deneb.enyo.de> wrote:
>> * Rahul Chaudhry via gnu-gabi:
>>
>>> The encoding used is a simple combination of delta-encoding and a
>>> bitmap of offsets. The section consists of 64-bit entries: higher
>>> 8-bits contain delta since last offset, and lower 56-bits contain a
>>> bitmap for which words to apply the relocation to. This is best
>>> described by showing the code for decoding the section:
>>>
>>> typedef struct
>>> {
>>>   Elf64_Xword  r_data;  /* jump and bitmap for relative relocations */
>>> } Elf64_Relrz;
>>>
>>> #define ELF64_R_JUMP(val)    ((val) >> 56)
>>> #define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)
>>>
>>> #ifdef DO_RELRZ
>>>   {
>>>     ElfW(Addr) offset = 0;
>>>     for (; relative < end; ++relative)
>>>       {
>>>         ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
>>>         ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
>>>         offset += jump * sizeof(ElfW(Addr));
>>>         if (jump == 0)
>>>           {
>>>             ++relative;
>>>             offset = relative->r_data;
>>>           }
>>>         ElfW(Addr) r_offset = offset;
>>>         for (; bits != 0; bits >>= 1)
>>>           {
>>>             if ((bits&1) != 0)
>>>               elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
>>>             r_offset += sizeof(ElfW(Addr));
>>>           }
>>>       }
>>>   }
>>> #endif
>>
>> That data-dependent “if ((bits&1) != 0)” branch looks a bit nasty.
>>
>> Have you investigated whether some sort of RLE-style encoding would be
>> beneficial? If there are blocks of relative relocations, it might even
>> be possible to use vector instructions to process them (although more
>> than four relocations at a time are probably not achievable in a
>> power-efficient manner on current x86-64).
>
> Yes, we originally investigated RLE style encoding but I guess the key
> insight which led us towards the proposed encoding is the following.
> The offset addresses which contain the relocations are close but not
> necessarily contiguous.  We experimented with an encoding strategy
> where we would store the initial offset and the number of words after
> that which contained dynamic relocations.  This gave us good
> compression numbers but the proposed scheme was way better.  I will
> let Rahul say more as he did quite a bit of experiments with different
> strategies.
>
> Thanks
> Sri

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
@ 2017-01-01  0:00               ` Cary Coutant
  2017-01-01  0:00                 ` Rahul Chaudhry via gnu-gabi
  2017-01-01  0:00               ` Ian Lance Taylor via gnu-gabi
  2017-01-01  0:00               ` Florian Weimer
  2 siblings, 1 reply; 29+ messages in thread
From: Cary Coutant @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Rahul Chaudhry
  Cc: Sriraman Tallam, Suprateeka R Hegde, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, gnu-gabi, Xinliang David Li, Sterling Augustine,
	Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu, Luis Lozano,
	Peter Collingbourne, Rui Ueyama, llvm-dev

> We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
> custom encoding to compactly represent the list of offsets. We're calling the
> new compressed section '.relrz.dyn' (for relocations-relative-compressed).

I'd suggest just using .relr.dyn -- your encoding is straightforward
enough that I'd just make that the standard representation for this
section type.

> The encoding used is a simple combination of delta-encoding and a bitmap of
> offsets. The section consists of 64-bit entries: higher 8-bits contain delta
> since last offset, and lower 56-bits contain a bitmap for which words to apply
> the relocation to. This is best described by showing the code for decoding the
> section:
>
> ...
>
> The above code is the entirety of the implementation for decoding and
> processing '.relrz.dyn' sections in glibc dynamic loader.
>
> This encoding can represent up to 56 relocation offsets in a single 64-bit
> word. For many of the binaries we tested, this encoding provides >40x
> compression for storing offsets over the original `.relr.dyn` section.
>
> For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
> the bitmap.

Very nice! Simple and effective.

> Here are three real world examples that demonstrate the savings:

Impressive numbers. I've gotta admit, the savings are better than I expected.

> However, before that can happen, we need agreement on the ABI side for the new
> section type and the encoding. We haven't worked on a change of this magnitude
> before that touches so many different pieces from the linker, elf tools, and
> the dynamic loader. Specifically, we need agreement and/or guidance on where
> and how should the new section type and its encoding be documented. We're
> proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
> and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Yes, as Ian mentioned, the generic ABI discussion is at
generic-abi@googlegroups.com. Most people who would be interested are
already on the gnu-gabi@sourceware.org list, but there are a few who
are not, and who may not yet have seen this discussion. I'll support
the proposal.

Thanks for taking this idea the extra mile!

-cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section
  2017-01-01  0:00                 ` Rahul Chaudhry via gnu-gabi
@ 2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
  0 siblings, 0 replies; 29+ messages in thread
From: Rahul Chaudhry via gnu-gabi @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Cary Coutant
  Cc: Sriraman Tallam, Suprateeka R Hegde, Florian Weimer,
	David Edelsohn, Rafael Avila de Espindola, Binutils Development,
	Alan Modra, Rahul Chaudhry via gnu-gabi, Xinliang David Li,
	Sterling Augustine, Paul Pluzhnikov, Ian Lance Taylor, H.J. Lu,
	Luis Lozano, Peter Collingbourne, Rui Ueyama, llvm-dev

We sent a proposal to generic-abi last week.
Here's the link for cross-reference, as promised:
https://groups.google.com/forum/#!topic/generic-abi/bX460iggiKg

Rahul


On Mon, Dec 11, 2017 at 4:05 PM, Rahul Chaudhry
<rahulchaudhry@google.com> wrote:
> Thanks for your encouraging words, Ian and Cary.
>
> We're drafting a more detailed proposal and would post it on the generic-abi
> list this week. I'll also post a link here for cross-reference.
>
> Based on Cary's suggestion here, we're renaming '.relrz.dyn' to
> '.relr.dyn' in the
> proposal.
>
> Rahul
>
>
> On Fri, Dec 8, 2017 at 10:36 PM, Cary Coutant <ccoutant@gmail.com> wrote:
>>> We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
>>> custom encoding to compactly represent the list of offsets. We're calling the
>>> new compressed section '.relrz.dyn' (for relocations-relative-compressed).
>>
>> I'd suggest just using .relr.dyn -- your encoding is straightforward
>> enough that I'd just make that the standard representation for this
>> section type.
>>
>>> The encoding used is a simple combination of delta-encoding and a bitmap of
>>> offsets. The section consists of 64-bit entries: higher 8-bits contain delta
>>> since last offset, and lower 56-bits contain a bitmap for which words to apply
>>> the relocation to. This is best described by showing the code for decoding the
>>> section:
>>>
>>> ...
>>>
>>> The above code is the entirety of the implementation for decoding and
>>> processing '.relrz.dyn' sections in glibc dynamic loader.
>>>
>>> This encoding can represent up to 56 relocation offsets in a single 64-bit
>>> word. For many of the binaries we tested, this encoding provides >40x
>>> compression for storing offsets over the original `.relr.dyn` section.
>>>
>>> For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
>>> the bitmap.
>>
>> Very nice! Simple and effective.
>>
>>> Here are three real world examples that demonstrate the savings:
>>
>> Impressive numbers. I've gotta admit, the savings are better than I expected.
>>
>>> However, before that can happen, we need agreement on the ABI side for the new
>>> section type and the encoding. We haven't worked on a change of this magnitude
>>> before that touches so many different pieces from the linker, elf tools, and
>>> the dynamic loader. Specifically, we need agreement and/or guidance on where
>>> and how should the new section type and its encoding be documented. We're
>>> proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
>>> and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.
>>
>> Yes, as Ian mentioned, the generic ABI discussion is at
>> generic-abi@googlegroups.com. Most people who would be interested are
>> already on the gnu-gabi@sourceware.org list, but there are a few who
>> are not, and who may not yet have seen this discussion. I'll support
>> the proposal.
>>
>> Thanks for taking this idea the extra mile!
>>
>> -cary

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-01-07 10:31 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-01  0:00 Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section Sriraman Tallam
2017-01-01  0:00 ` Cary Coutant
2017-01-01  0:00   ` Sriraman Tallam
2017-01-01  0:00   ` H.J. Lu
2017-01-01  0:00   ` Florian Weimer
2017-01-01  0:00     ` Alan Modra
2017-01-01  0:00   ` Markus Trippelsdorf
2017-01-01  0:00 ` Rafael Espíndola
2017-01-01  0:00   ` Sriraman Tallam
2017-01-01  0:00 ` H.J. Lu
2017-01-01  0:00   ` Sriraman Tallam
     [not found] <CAGWvnynFwXFGLj3tAVgDatn0zmuHcWHyRNuDvR+wRZCXLnar_A@mail.gmail.com>
2017-01-01  0:00 ` Rafael Avila de Espindola
2017-01-01  0:00   ` David Edelsohn
2017-01-01  0:00     ` Suprateeka R Hegde
2017-01-01  0:00       ` Florian Weimer
2017-01-01  0:00         ` Suprateeka R Hegde
2017-01-01  0:00           ` Sriraman Tallam via gnu-gabi
2017-01-01  0:00             ` Rahul Chaudhry via gnu-gabi
2017-01-01  0:00               ` Cary Coutant
2017-01-01  0:00                 ` Rahul Chaudhry via gnu-gabi
2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
2017-01-01  0:00               ` Ian Lance Taylor via gnu-gabi
2017-01-01  0:00               ` Florian Weimer
2017-01-01  0:00                 ` Sriraman Tallam via gnu-gabi
2017-01-01  0:00                   ` Rahul Chaudhry via gnu-gabi
     [not found]                     ` <CAORpzuMftCGpXUObOyoFY0=jorMBDWEDbQJ23DifTNW3v-WA6Q@mail.gmail.com>
2017-01-01  0:00                       ` Rahul Chaudhry via gnu-gabi
2017-01-01  0:00                         ` Cary Coutant
2017-01-01  0:00                           ` Rahul Chaudhry via gnu-gabi
     [not found]                             ` <CAORpzuPYsSBJtypm3NDcfcgRzos3WO4JjkvgiqpyBYBhoqLVFA@mail.gmail.com>
2018-01-01  0:00                               ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).