* CREL dynamic relocations @ 2024-04-09 15:32 Wilco Dijkstra 2024-04-11 2:41 ` Fangrui Song 0 siblings, 1 reply; 7+ messages in thread From: Wilco Dijkstra @ 2024-04-09 15:32 UTC (permalink / raw) To: maskray; +Cc: 'GNU C Library', Florian Weimer Hi, I like the general idea of more compact relocations, however what I don't get is what the overall goal is. If the goal is more compact object files, why don't we just add a (de)compress pass using a fast compression algorithm? CPU time is cheap today, and real compression easily gives 2-4x reduction of object file size, far more than you could achieve by just compressing relocations. Alternatively, if we wanted to access and process ELF files without any decompression, we could define compact relocations as fixed-size entries. Using 64 bits for a compact RELA relocation gives a straightforward 4x compression. Out of range values could use the next entry to extend the ranges. So my main issue with the proposal is that it tries too hard to compress relocations. For example using offset compression for relocations, symbol indices and even addends seems to have little value: the signed offset means you lose one bit, and if out of range values are rare or not grouped together, offset encodings are actually less efficient. I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would handle all frequently used relocations, so we'd just remap them to fit in the short encoding. Hence I don't see a need at all for a signed offset encoding. So my suggestion is to either go for compression of the whole object file or use a simpler approach that isn't almost a compression algorithm. Cheers, Wilco ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CREL dynamic relocations 2024-04-09 15:32 CREL dynamic relocations Wilco Dijkstra @ 2024-04-11 2:41 ` Fangrui Song 2024-04-12 16:18 ` enh 0 siblings, 1 reply; 7+ messages in thread From: Fangrui Song @ 2024-04-11 2:41 UTC (permalink / raw) To: Wilco Dijkstra; +Cc: GNU C Library, Florian Weimer Thank you for your interest in the CREL relocation format. On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > I like the general idea of more compact relocations, however what I don't get is > what the overall goal is. If the goal is more compact object files, why don't we just > add a (de)compress pass using a fast compression algorithm? CPU time is cheap > today, and real compression easily gives 2-4x reduction of object file size, far more > than you could achieve by just compressing relocations. My primary goal is to make relocatable files smaller (see https://sourceware.org/pipermail/binutils/2024-March/133229.html for a few use cases). Smaller files benefit applications in several ways, including smaller I/O amount and lower linker memory usage (for linkers like gold, lld, and mold that map input files into memory). Generic data compression formats (like zlib or zstd) applied at the filesystem level won't achieve this goal because they don't decrease memory usage. In addition, filesystem compression does not appear too popular. Interestingly, I measured a 5.9% size reduction in .o files even after zstd compression when comparing two Clang builds with and without CREL. % ruby -e 'require "zstd-ruby"; un=com=0; Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x = File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; puts "uncompressed: #{un}\ncompressed: #{com}"' uncompressed: 136086784 compressed: 37173381 % ruby -e 'require "zstd-ruby"; un=com=0; Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x = File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; puts "uncompressed: #{un}\ncompressed: #{com}"' uncompressed: 111655952 compressed: 34964421 1-111655952/136086784 ~= 18.0% (uncompressed) 1-34964421/37173381 ~= 5.9% (zstd) Another objective is to minimize the size of dynamic relocations. Android achieves this through ld.lld --pack-dyn-relocs=android+relr, which compacts RELA relocations in their packed format. While effective, CREL offers a simpler approach that delivers even greater size reductions. > Alternatively, if we wanted to access and process ELF files without any decompression, > we could define compact relocations as fixed-size entries. Using 64 bits for a compact > RELA relocation gives a straightforward 4x compression. Out of range values could > use the next entry to extend the ranges. 64 bits are quite large. CREL typically needs just one to three bytes for one relocation. How do you design a format that is generic enough to work with all relocation types and symbol indexes? > So my main issue with the proposal is that it tries too hard to compress relocations. > For example using offset compression for relocations, symbol indices and even addends > seems to have little value: the signed offset means you lose one bit, and if out of range > values are rare or not grouped together, offset encodings are actually less efficient. I actually use unsigned delta offset to save one bit but signed delta symidx/addend. I have analyzed how many bits are demanded by typical relocations. Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format : Absolute symbol indexes allow one-byte encoding for symbols in the range [0,128) and offer minor size advantage for static relocations when the symbol table is sorted by usage frequency. Delta encoding, on the other hand, might optimize for the scenario when the symbol table presents locality: neighbor symbols are frequently mutually called. Delta symbol index enables one-byte encoding for GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by symbol index. For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3, offset++, symidx++). Delta encoding has a disvantage. It can partial claim the optimization by arranging symbols in a "cold0 hot cold1" pattern. In addition, delta symbol index enables one-byte encoding for GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by symbol index. In my experiments, absolute encoding with ULEB128 results in slightly larger .o file sizes for both x86-64 and AArch64 builds. For a decoder that only supports in-reloc addends (recommended for relocatable files), the C++ implementation is as simple as: const auto hdr = decodeULEB128(p); const size_t count = hdr / 8, shift = hdr % 4; Elf_Addr offset = 0, addend = 0; uint32_t symidx = 0, type = 0; for (size_t i = 0; i != count; ++i) { const uint8_t b = *p++; offset += b >> 3; if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10; if (b & 1) symidx += decodeSLEB128(p); if (b & 2) type += decodeSLEB128(p); if (b & 4) addend += decodeSLEB128(p); rels[i] = {offset << shift, symidx, type, addend}; } += for all of symidx/type/addend is for consistency, but the choice turns out to be very good as well. > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would > handle all frequently used relocations, so we'd just remap them to fit in the short > encoding. Hence I don't see a need at all for a signed offset encoding. The common static relocation types are within [257,313] (before R_AARCH64_PLT32). Delta encoding allows ~all but the first relocation's type to be encoded in a single byte. How do you design a compression scheme without baked-in knowledge (dictionary)? We don't want the generic encoding scheme to hard code relocation type range for each architecture. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CREL dynamic relocations 2024-04-11 2:41 ` Fangrui Song @ 2024-04-12 16:18 ` enh 2024-04-18 4:31 ` Fangrui Song 0 siblings, 1 reply; 7+ messages in thread From: enh @ 2024-04-12 16:18 UTC (permalink / raw) To: Fangrui Song; +Cc: Wilco Dijkstra, GNU C Library, Florian Weimer On Wed, Apr 10, 2024 at 7:41 PM Fangrui Song <maskray@gcc.gnu.org> wrote: > > Thank you for your interest in the CREL relocation format. > > On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > I like the general idea of more compact relocations, however what I don't get is > > what the overall goal is. If the goal is more compact object files, why don't we just > > add a (de)compress pass using a fast compression algorithm? CPU time is cheap > > today, and real compression easily gives 2-4x reduction of object file size, far more > > than you could achieve by just compressing relocations. > > My primary goal is to make relocatable files smaller (see > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a > few use cases). > Smaller files benefit applications in several ways, including smaller > I/O amount and lower linker memory usage (for linkers like gold, lld, > and mold that map input files into memory). > > Generic data compression formats (like zlib or zstd) applied at the > filesystem level won't achieve this goal because they don't decrease > memory usage. > In addition, filesystem compression does not appear too popular. > > Interestingly, I measured a 5.9% size reduction in .o files even after > zstd compression when comparing two Clang builds with and without > CREL. > > % ruby -e 'require "zstd-ruby"; un=com=0; > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x = > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; > puts "uncompressed: #{un}\ncompressed: #{com}"' > uncompressed: 136086784 > compressed: 37173381 > > % ruby -e 'require "zstd-ruby"; un=com=0; > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x = > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; > puts "uncompressed: #{un}\ncompressed: #{com}"' > uncompressed: 111655952 > compressed: 34964421 > > 1-111655952/136086784 ~= 18.0% (uncompressed) > 1-34964421/37173381 ~= 5.9% (zstd) > > Another objective is to minimize the size of dynamic relocations. > Android achieves this through ld.lld --pack-dyn-relocs=android+relr, > which compacts RELA relocations in their packed format. > While effective, CREL offers a simpler approach that delivers even > greater size reductions. yeah, though android+relr has the advantage of having already shipped, so it's usable on just about every device that app developers still support (api 23 -- https://android.googlesource.com/platform/bionic/+/master/android-changes-for-ndk-developers.md#relative-relocations-relr -- versus the effective minimum of api 21) :-) something new today wouldn't be as widely usable on Android for about a decade, so although we'd probably support it if others did, to be really interesting for Android -- the point where we'd implement it even if no-one else did -- it'd have to be 2x better (in space or time, and not sacrificing the other, because app developers are very conscious of both) rather than the little bit better that it actually is. given our long lead time [for app developers to be able to rely on ubiquity] and our "good enough" solution, i'm actually more interested in the "what if we re-thought _everything_?" end of the spectrum than small tweaks here. what does mach-o do here? (that is: "why doesn't Apple care?".) > > Alternatively, if we wanted to access and process ELF files without any decompression, > > we could define compact relocations as fixed-size entries. Using 64 bits for a compact > > RELA relocation gives a straightforward 4x compression. Out of range values could > > use the next entry to extend the ranges. > > 64 bits are quite large. CREL typically needs just one to three bytes > for one relocation. > How do you design a format that is generic enough to work with all > relocation types and symbol indexes? > > > So my main issue with the proposal is that it tries too hard to compress relocations. > > For example using offset compression for relocations, symbol indices and even addends > > seems to have little value: the signed offset means you lose one bit, and if out of range > > values are rare or not grouped together, offset encodings are actually less efficient. > > I actually use unsigned delta offset to save one bit but signed delta > symidx/addend. > I have analyzed how many bits are demanded by typical relocations. > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format > : > > Absolute symbol indexes allow one-byte encoding for symbols in the > range [0,128) and offer minor size advantage for static relocations > when the symbol table is sorted by usage frequency. Delta encoding, on > the other hand, might optimize for the scenario when the symbol table > presents locality: neighbor symbols are frequently mutually called. > > Delta symbol index enables one-byte encoding for GOT/PLT dynamic > relocations when .got/.got.plt entries are ordered by symbol index. > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically > be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3, > offset++, symidx++). Delta encoding has a disvantage. It can partial > claim the optimization by arranging symbols in a "cold0 hot cold1" > pattern. In addition, delta symbol index enables one-byte encoding for > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by > symbol index. > > In my experiments, absolute encoding with ULEB128 results in > slightly larger .o file sizes for both x86-64 and AArch64 builds. > > For a decoder that only supports in-reloc addends (recommended for > relocatable files), the C++ implementation is as simple as: > > const auto hdr = decodeULEB128(p); > const size_t count = hdr / 8, shift = hdr % 4; > Elf_Addr offset = 0, addend = 0; > uint32_t symidx = 0, type = 0; > for (size_t i = 0; i != count; ++i) { > const uint8_t b = *p++; > offset += b >> 3; > if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10; > if (b & 1) symidx += decodeSLEB128(p); > if (b & 2) type += decodeSLEB128(p); > if (b & 4) addend += decodeSLEB128(p); > rels[i] = {offset << shift, symidx, type, addend}; > } > > += for all of symidx/type/addend is for consistency, but the choice > turns out to be very good as well. > > > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would > > handle all frequently used relocations, so we'd just remap them to fit in the short > > encoding. Hence I don't see a need at all for a signed offset encoding. > > The common static relocation types are within [257,313] (before > R_AARCH64_PLT32). > Delta encoding allows ~all but the first relocation's type to be > encoded in a single byte. > > How do you design a compression scheme without baked-in knowledge (dictionary)? > We don't want the generic encoding scheme to hard code relocation type > range for each architecture. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CREL dynamic relocations 2024-04-12 16:18 ` enh @ 2024-04-18 4:31 ` Fangrui Song 0 siblings, 0 replies; 7+ messages in thread From: Fangrui Song @ 2024-04-18 4:31 UTC (permalink / raw) To: enh; +Cc: Fangrui Song, Wilco Dijkstra, GNU C Library, Florian Weimer On Fri, Apr 12, 2024 at 9:19 AM enh <enh@google.com> wrote: > > On Wed, Apr 10, 2024 at 7:41 PM Fangrui Song <maskray@gcc.gnu.org> wrote: > > > > Thank you for your interest in the CREL relocation format. > > > > On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > > I like the general idea of more compact relocations, however what I don't get is > > > what the overall goal is. If the goal is more compact object files, why don't we just > > > add a (de)compress pass using a fast compression algorithm? CPU time is cheap > > > today, and real compression easily gives 2-4x reduction of object file size, far more > > > than you could achieve by just compressing relocations. > > > > My primary goal is to make relocatable files smaller (see > > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a > > few use cases). > > Smaller files benefit applications in several ways, including smaller > > I/O amount and lower linker memory usage (for linkers like gold, lld, > > and mold that map input files into memory). > > > > Generic data compression formats (like zlib or zstd) applied at the > > filesystem level won't achieve this goal because they don't decrease > > memory usage. > > In addition, filesystem compression does not appear too popular. > > > > Interestingly, I measured a 5.9% size reduction in .o files even after > > zstd compression when comparing two Clang builds with and without > > CREL. > > > > % ruby -e 'require "zstd-ruby"; un=com=0; > > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x = > > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; > > puts "uncompressed: #{un}\ncompressed: #{com}"' > > uncompressed: 136086784 > > compressed: 37173381 > > > > % ruby -e 'require "zstd-ruby"; un=com=0; > > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x = > > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size}; > > puts "uncompressed: #{un}\ncompressed: #{com}"' > > uncompressed: 111655952 > > compressed: 34964421 > > > > 1-111655952/136086784 ~= 18.0% (uncompressed) > > 1-34964421/37173381 ~= 5.9% (zstd) > > > > Another objective is to minimize the size of dynamic relocations. > > Android achieves this through ld.lld --pack-dyn-relocs=android+relr, > > which compacts RELA relocations in their packed format. > > While effective, CREL offers a simpler approach that delivers even > > greater size reductions. > > yeah, though android+relr has the advantage of having already shipped, > so it's usable on just about every device that app developers still > support (api 23 -- > https://android.googlesource.com/platform/bionic/+/master/android-changes-for-ndk-developers.md#relative-relocations-relr > -- versus the effective minimum of api 21) :-) > > something new today wouldn't be as widely usable on Android for about > a decade, so although we'd probably support it if others did, to be > really interesting for Android -- the point where we'd implement it > even if no-one else did -- it'd have to be 2x better (in space or > time, and not sacrificing the other, because app developers are very > conscious of both) rather than the little bit better that it actually > is. given our long lead time [for app developers to be able to rely on > ubiquity] and our "good enough" solution, i'm actually more interested > in the "what if we re-thought _everything_?" end of the spectrum than > small tweaks here. > > what does mach-o do here? (that is: "why doesn't Apple care?".) > > > > Alternatively, if we wanted to access and process ELF files without any decompression, > > > we could define compact relocations as fixed-size entries. Using 64 bits for a compact > > > RELA relocation gives a straightforward 4x compression. Out of range values could > > > use the next entry to extend the ranges. > > > > 64 bits are quite large. CREL typically needs just one to three bytes > > for one relocation. > > How do you design a format that is generic enough to work with all > > relocation types and symbol indexes? > > > > > So my main issue with the proposal is that it tries too hard to compress relocations. > > > For example using offset compression for relocations, symbol indices and even addends > > > seems to have little value: the signed offset means you lose one bit, and if out of range > > > values are rare or not grouped together, offset encodings are actually less efficient. > > > > I actually use unsigned delta offset to save one bit but signed delta > > symidx/addend. > > I have analyzed how many bits are demanded by typical relocations. > > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format > > : > > > > Absolute symbol indexes allow one-byte encoding for symbols in the > > range [0,128) and offer minor size advantage for static relocations > > when the symbol table is sorted by usage frequency. Delta encoding, on > > the other hand, might optimize for the scenario when the symbol table > > presents locality: neighbor symbols are frequently mutually called. > > > > Delta symbol index enables one-byte encoding for GOT/PLT dynamic > > relocations when .got/.got.plt entries are ordered by symbol index. > > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically > > be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3, > > offset++, symidx++). Delta encoding has a disvantage. It can partial > > claim the optimization by arranging symbols in a "cold0 hot cold1" > > pattern. In addition, delta symbol index enables one-byte encoding for > > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by > > symbol index. > > > > In my experiments, absolute encoding with ULEB128 results in > > slightly larger .o file sizes for both x86-64 and AArch64 builds. > > > > For a decoder that only supports in-reloc addends (recommended for > > relocatable files), the C++ implementation is as simple as: > > > > const auto hdr = decodeULEB128(p); > > const size_t count = hdr / 8, shift = hdr % 4; > > Elf_Addr offset = 0, addend = 0; > > uint32_t symidx = 0, type = 0; > > for (size_t i = 0; i != count; ++i) { > > const uint8_t b = *p++; > > offset += b >> 3; > > if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10; > > if (b & 1) symidx += decodeSLEB128(p); > > if (b & 2) type += decodeSLEB128(p); > > if (b & 4) addend += decodeSLEB128(p); > > rels[i] = {offset << shift, symidx, type, addend}; > > } > > > > += for all of symidx/type/addend is for consistency, but the choice > > turns out to be very good as well. > > > > > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would > > > handle all frequently used relocations, so we'd just remap them to fit in the short > > > encoding. Hence I don't see a need at all for a signed offset encoding. > > > > The common static relocation types are within [257,313] (before > > R_AARCH64_PLT32). > > Delta encoding allows ~all but the first relocation's type to be > > encoded in a single byte. > > > > How do you design a compression scheme without baked-in knowledge (dictionary)? > > We don't want the generic encoding scheme to hard code relocation type > > range for each architecture. PATH=/tmp/Rel/bin:$PATH ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null xz -fk reladyn androiddyn creldyn zstd -fk reladyn androiddyn creldyn % llvm-readelf -r clang.relr+crel | grep crel.dyn Relocation section '.crel.dyn' at offset 0x5df318 contains 2907 entries: % stat -c '%s %n' reladyn* androiddyn* creldyn* 69768 reladyn 4236 reladyn.xz 4449 reladyn.zst 4604 androiddyn 1484 androiddyn.xz 1481 androiddyn.zst 3980 creldyn 1344 creldyn.xz 1367 creldyn.zst % xz --robot --list -vv androiddyn.xz | grep block block 1 1 1 12 0 1448 4604 0.315 CRC64 bb905bfac8a6ca11 12 -- 1427 8454200 --lzma2=dict=8MiB There are 2907 relocation entries that are not RELATIVE or JUMP_SLOT, which can be compacted with either Android's packed relocation format or CREL. Notably, zstd/LZMA2 filtering on RELA outperforms Android's packed relocation format, while CREL outperforms zstd/LZMA2:) As a byte-oriented format, CREL relocations can be further compressed. (If CREL were made more bit-oriented, it might become uncompressible.) However, achieving double the compression ratio of Android's format is unrealistic for byte-oriented schemes where each relocation requires at least one byte. A byte-oriented format needs at least a bit more than 2907 bytes, exceeding half of the original 4604 bytes. An interesting observation: ld.lld still uses RELA for JUMP_SLOT relocations even with --pack-dyn-relocs=android+relr. CREL handling JUMP_SLOT relocations potentially improves compression further. llvm-objcopy --dump-section .crel.plt=crelplt clang.relr+crel /dev/null llvm-objcopy --dump-section .rela.plt=relaplt clang.relr+android /dev/null xz -fk relaplt crelplt % stat -c '%s %n' *plt* 672 crelplt 116 crelplt.xz 7992 relaplt 628 relaplt.xz In clang.relr+android, .rela.* sections consume 12596 bytes. In clang.relr+crel, .crel.* sections consume 4652 bytes. Since there is no dynamic loader implementation, I cannot comment on the performance. Intuitively CREL is simpler and should be faster. > what does mach-o do here? (that is: "why doesn't Apple care?".) Before chained fixups (WWDC2022), they use a state machine (think of DWARF .debug_line) to set segment_index/symbol_name/dylib_ordinal/offset/addend/type/etc (https://github.com/apple-opensource/dyld/blob/e3f88907bebb8421f50f0943595f6874de70ebe0/src/ImageLoaderMachOCompressed.cpp#L1347). The bind opcodes look like: ... BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM flags symbol_name (BIND_OPCODE_SET_TYPE_IMM|type) BIND_OPCODE_SET_DYLIB_ORDINAL_IMM dylib_ordinal (BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | segment_idx) offset BIND_OPCODE_DO_BIND_ULEB_TIMES_SKIPPING_ULEB opcode consecutive_count data BIND_OPCODE_SET_ADDEND_SLEB addend BIND_OPCODE_DO_BIND_ADD_ADDR_ULEB delta_addend ... While this model is powerful, its complexity might not be ideal for ELF systems. They encode symbol names/flags in place, dylib_ordinal (two-level namespace), and segment index, complexity that ELF systems don't have. Having opcodes in the stream makes their encoding likely less effecient. --- Chained fixups are a new system. They encode just the first relocations along with the relocation type information in each page. Subsequent relocations are encoded using the locations to be modified. * It necessitates kernel understanding of dynamic linking data and communication with the dynamic loader. * The encoding repurposes the ELF "implicit addend" for page offsets, limiting its use for larger addends. Some addend bits are stolen to encode a singly linked list (dyld_chained_ptr_64_bind::next dyld_chained_ptr_64_rebase::next). + For the rebase operation (ELF relative relocation), 36 least significant bits and 8 most significant bits are encoded. This is insufficient if we need a larger addend. + The page offset is muliple of 4, further losing generality. This approach may not be suitable for 32-bit systems or those requiring support for numerous relocation types, like ELF. Additionally, the current encoding of initial relocations and the import table seems less space-efficient. A CREL-like design could potentially improve space utilization. -- 宋方睿 ^ permalink raw reply [flat|nested] 7+ messages in thread
* CREL dynamic relocations @ 2024-03-24 5:50 Fangrui Song 2024-03-25 11:52 ` Florian Weimer 0 siblings, 1 reply; 7+ messages in thread From: Fangrui Song @ 2024-03-24 5:50 UTC (permalink / raw) To: libc-alpha I have proposed a compact relocation format CREL at https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ (previously named RELLEB). CREL primarily targets static relocations, achieving significant .o file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3% for riscv64. CREL holds promise for dynamic relocations as well, surpassing Android's packed relocation format. * R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically be encoded with repeated 0x09 0x01 (when shift==3, offset++, symidx++). * Absolute relocation (__cxa_pure_virtual@CXXABI_1.3 + 0) relocations typically require just one byte (offset+=n) While CREL's benefit for non-relative relocations is smaller than RELR's optimizing for relative relocations, it still shines for programs where non-relative relocations make up more than 5% of the file size (quite a few GUI libraries, probably due to C++ virtual tables). We could even consider phasing out REL/RELA for new architectures in favor of CREL. I'd love to hear your feedback on CREL and hope someone might consider implementing rtld support (and binutils support if you like that as well:) ) https://sourceware.org/bugzilla/show_bug.cgi?id=31541 I've notified binutils and GCC folks at https://sourceware.org/pipermail/binutils/2024-March/133153.html ("CREL relocation format for ELF (was: RELLEB)"). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CREL dynamic relocations 2024-03-24 5:50 Fangrui Song @ 2024-03-25 11:52 ` Florian Weimer 2024-03-25 18:51 ` Fangrui Song 0 siblings, 1 reply; 7+ messages in thread From: Florian Weimer @ 2024-03-25 11:52 UTC (permalink / raw) To: Fangrui Song; +Cc: libc-alpha * Fangrui Song: > I have proposed a compact relocation format CREL at > https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ > (previously named RELLEB). > > CREL primarily targets static relocations, achieving significant .o > file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3% > for riscv64. > CREL holds promise for dynamic relocations as well, surpassing > Android's packed relocation format. As I said elsewhere, I'm concerned about the use of the ULEB128 encoding. It's unnecessarily difficult to decode. Thanks, Florian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CREL dynamic relocations 2024-03-25 11:52 ` Florian Weimer @ 2024-03-25 18:51 ` Fangrui Song 0 siblings, 0 replies; 7+ messages in thread From: Fangrui Song @ 2024-03-25 18:51 UTC (permalink / raw) To: Florian Weimer; +Cc: Fangrui Song, libc-alpha On Mon, Mar 25, 2024 at 4:53 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Fangrui Song: > > > I have proposed a compact relocation format CREL at > > https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ > > (previously named RELLEB). > > > > CREL primarily targets static relocations, achieving significant .o > > file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3% > > for riscv64. > > CREL holds promise for dynamic relocations as well, surpassing > > Android's packed relocation format. > > As I said elsewhere, I'm concerned about the use of the ULEB128 > encoding. It's unnecessarily difficult to decode. > > Thanks, > Florian Thanks. I have seen your question at https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/osMXhg5XAgAJ and replied there that since one-byte encodings dominant for our use cases, LEB128 is actually the best choice (in terms of both performance and simplicity). I've researched the dynamic relocation problem in the weekend and incorporated the following text to my blog post Traditionally, we have two dynamic relocation ranges for executables and shared objects (except static position-dependent executables): * `.rela.dyn` (`[DT_RELA, DT_RELA + DT_RELASZ)`) or `.rel.dyn` (`[DT_REL, DT_REL + DT_RELSZ)`) * `.rela.plt` (`[DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)`): Stored JUMP_SLOT relocations. `DT_PLTREL` specifies `DT_REL` or `DT_RELA`. IRELATIVE relocations can be placed in either range, but preferrably in `.rel[a].dyn`. Some GNU ld ports (e.g. SPARC) treat `.rela.plt` as a subset of `.rela.dyn`, introducing complexity for dynamic loaders. **CREL adoption considerations** * New dynamic tag (`DT_CREL`): To identify CREL relocations, separate from existing `DT_REL`/`DT_RELA`. * No `DT_CRELSZ`: Relocation count can be derived from the CREL header. * Output section description `.rela.dyn : { *(.rela.dyn) *(.rela.plt) }` is incompatible with CREL. **Challenges with lazy binding** glibc's lazy binding scheme relies on [random access to relocation entries within the `DT_JMPREL` table](https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#:~:text=_dl_fixup). CREL's sequential nature prevents this. However, eager binding doesn't require random access. Therefore, when `-z now` (eager binding) is enabled, we can: * Set `DT_PLTREL` to `DT_CREL`. * Replace `.rel[a].plt` with `.crel.plt`. **Challenges with statically linked position-dependent executables** glibc introduces additional complexity for IRELATIVE relocations in statically linked position-dependent executables. They should only contain IRELATIVE relocations and no other dynamic relocations. glibc's `csu/libc-start.c` processes IRELATIVE relocations in the range [`[__rela_iplt_start, __rela_iplt_end)`](https://maskray.me/blog/2021-01-18-gnu-indirect-function#non-preemptible-ifunc#rela_iplt_start-and-__rela_iplt_end) (or `[__rel_iplt_start, __rel_iplt_end)`, determined at build time through `ELF_MACHINE_IREL`). While CREL relocations cannot be decoded in the middle of the section, we can still place IRELATIVE relocations in `.crel.dyn` because there wouldn't be any other relocation types (position-dependent executables don't have RELATIVE relocations). When CREL is enabled, we can define `__crel_iplt_start` and `__crel_iplt_end` for statically linked position-dependent executables. If glibc only intends to support `addend_bit==0`, the code can simply be: ```c extern const uint8_t __crel_iplt_start[] __attribute__ ((weak)); extern const uint8_t __crel_iplt_end[] __attribute__ ((weak)); if (&__crel_iplt_start != &__crel_iplt_end) { const uint8_t *p = __crel_iplt_start; size_t offset = 0, count = read_uleb128 (&p), shift = count & 3; for (count >>= 3; count; count--) { uint8_t rel_head = *p++; offset += rel_head >> 2; if (rel_head & 128) offset += (read_uleb128 (&p) << 5) - 32; if (rel_head & 2) read_sleb128 (&p); elf_crel_irel ((ElfW (Addr) *) (offset << shift)); } } ``` **Considering implicit addends for CREL** Many dynamic relocations have zero addends: * COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends. * Absolute relocations could use non-zero addends with `STT_SECTION` symbol, but linkers convert them to relative relocations. Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF might require non-zero addends. Switching from `DT_RELA` to `DT_REL` offers a minor size advantage. I considered defining two separate dynamic tags (`DT_CREL` and `DT_CRELA`) to distinguish between implicit and explicit addends. However, this would have introduced complexity: * Should `llvm-readelf -r` dump the zero addends for `DT_CRELA`? * Should dynamic loaders support both dynamic tags? I placed the delta addend bit next to offset bits so that it can be reused for offsets. Thanks to Stefan O'Rear's for making me believe that my original thought of reserving a single bit flag (`addend_bit`) within the CREL header is elegant. Dynamic loaders prioritizing simplicity can hardcode the desired `addend_bit` value. `ld.lld -z crel` defaults to implicit addends (`addend_bit==0`), but the option of using in-relocation addends is available with `-z crel -z rela`. **DT_AARCH64_AUTH_RELR vs CREL** The AArch64 PAuth ABI introduces `DT_AARCH64_AUTH_RELR` as a variant of RELR for signed relocations. However, its benefit seems limited. In a release build of Clang 16, using `-z crel -z rela` resulted in a `.crel.dyn` section size of only 1.0% of the file size. Notably, enabling implicit addends with `-z crel -z rel` further reduced the size to just 0.3%. While `DT_AARCH64_AUTH_RELR` will achieve a noticeable smaller relocation size if most relative relocations are encoded with it, the advantage seems less significant considering CREL's already compact size. Furthermore, `DT_AARCH64_AUTH_RLEL` introduces additional complexity to the linker due to its 32-bit addend limitation: the in-place 64 value encodes a 32-bit schema, giving just 32 bits to the implicit addend. If the addend does not fit into 32 bits, `DT_AARCH64_AUTH_RELR` cannot be used. CREL with addends would avoid this complexity. I have filed [Quantifying the benefits of DT_AARCH64_AUTH_RELR](https://github.com/ARM-software/abi-aa/issues/252). -- 宋方睿 ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-04-18 4:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-04-09 15:32 CREL dynamic relocations Wilco Dijkstra 2024-04-11 2:41 ` Fangrui Song 2024-04-12 16:18 ` enh 2024-04-18 4:31 ` Fangrui Song -- strict thread matches above, loose matches on Subject: below -- 2024-03-24 5:50 Fangrui Song 2024-03-25 11:52 ` Florian Weimer 2024-03-25 18:51 ` Fangrui Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).