From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536]) by sourceware.org (Postfix) with ESMTPS id 97A2D3858D37 for ; Thu, 18 Apr 2024 04:32:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 97A2D3858D37 Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=google.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 97A2D3858D37 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::536 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713414727; cv=none; b=tIEAgSta2dNFlU/xNzH6dOIgtvpmoxScZDuagWQ6C38e8YMIWYIYVQ/5AWz0Mw3A1pC7tCMvoVxnBdZwlVBRyLZVgHlO4v53LL90HzKm8jTy8yjIn2ZCkAr36okeBKIcOP9LYOmbcW/nRuFavjiluUSuPfKCbzBHyvzzrbGGyHU= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713414727; c=relaxed/simple; bh=RvX5gG45fuXjejan0MozJJviXJJj9P23qO1CdYovrGw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=LMa5YoeGVtm000vm9/2u0v6My8qyD2+wZri9vrOIm6yd5P+DHsNpgRJYf06J0TCXMUgTtWSgtFssfjp3oJHf9QqyI9b6dTQ9ewA2MeoUeo7d/6LRpSKyPgD3OJjm6hVJ3Wojex3u/nK6AOqQlWctvVJaf/Ko2ESzgOK/SqgXV5Q= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-56e2e851794so4761a12.0 for ; Wed, 17 Apr 2024 21:32:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1713414723; x=1714019523; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KgiWk5+1GuMf2bvgiXMu3KR85sWy30D8QZjDcUE7uEc=; b=BPqB4vBR7+ySieOfQtpdBIUns2zmGoStgQYxX9dpaMtaZGRRRBpyvfzAzkXF9HAuLI ln+y+B7NDFgAsHt84lX+AlRpr4LxJTsJdzfSfO5dHNYdhXgZ2aTGVWkqghYGifU78I31 lwSrfohyDTBVn0/sPj275hmJLS6JYXWQLLqE51u6CU3lGoNdm8DVasE9YHgq+CiNkMPi ke/ApKiH7uDk+8RIFxiY2sg9syQUrRC645ziLLXJ9fssqlEsa3m7Zui4W2anTJ2TRvgv 7bX+M+IMPl9PYXxxjOGoZq0ORA7NNKqqH5exSnW9kYSardZFNhqF8Kfw9wc2MlH7HDVm BD4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713414723; x=1714019523; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KgiWk5+1GuMf2bvgiXMu3KR85sWy30D8QZjDcUE7uEc=; b=NRpcmwV5wLzve3m6FDozpJkabOMklhunStxkcaUG1f94gC5Cs8pJ4bSqQjWwPPHYth K5hhgyGJRjbjCU1r1np88rnmEJQrbvhpEMkAj/wuziNFGHSBOL2Z6Xxp8asbgsCj5x22 gpdiiX8gSF9w421DuRvhkXLsw44cfcnz0HK0b86r6Qbz9OR7sJ27T4HtswzfrCT8F7FK hJTPXGWqv6GF/69WyYBArUzcCk0bOsj7B4oOdTKiKKUvlGWn3W9119V67VdPzfxUQGOR R0u9nkd0/QSJ9gakltmf5ksm7+3UujKMp1JwsOanVgh/hRRBMEYOrYwHFJnyIEIqR8U9 yoSg== X-Forwarded-Encrypted: i=1; AJvYcCUMTjEY6pS3+i6n5BGg+ZxUB/g6BiuabvzXlhJJJ+mIGZ+935AomU8FY8w/Iq94EfwaI6fUK5JCvapeDgr0aE4n4GfVRwbTWwzc X-Gm-Message-State: AOJu0YyhjjVK8gYD1zbijao/3nqHqsduubTmIKNrvIbo1KdfHwGqjCu8 bef/kyeVb8cAuep0Cm2naaOdjYiDhECeWeo3dbhstwGIuMMSD0KcblwOXl50UYdl2DbdAV9q2+j Eb189P3scpq0kTBGpONPV9SdVnxTPTOta2fWv9Rt/Wq8M/xA1G9LJ X-Google-Smtp-Source: AGHT+IFoZ2JtD0nzuPk1JDOyYu3XafLtHCC/ATImrDXxx5Kn5O4x+52ig5MfK6wpo/FtQ2DjRwqXrgAJ0qw6+FMXufg= X-Received: by 2002:a05:6402:50a:b0:571:b9c7:2804 with SMTP id m10-20020a056402050a00b00571b9c72804mr47188edv.5.1713414722731; Wed, 17 Apr 2024 21:32:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Fangrui Song Date: Wed, 17 Apr 2024 21:31:49 -0700 Message-ID: Subject: Re: CREL dynamic relocations To: enh Cc: Fangrui Song , Wilco Dijkstra , GNU C Library , Florian Weimer Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH,KAM_INFOUSMEBIZ,KAM_STOCKGEN,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Apr 12, 2024 at 9:19=E2=80=AFAM enh wrote: > > On Wed, Apr 10, 2024 at 7:41=E2=80=AFPM Fangrui Song wrote: > > > > Thank you for your interest in the CREL relocation format. > > > > On Tue, Apr 9, 2024 at 8:33=E2=80=AFAM Wilco Dijkstra wrote: > > > I like the general idea of more compact relocations, however what I d= on't get is > > > what the overall goal is. If the goal is more compact object files, w= hy don't we just > > > add a (de)compress pass using a fast compression algorithm? CPU time = is cheap > > > today, and real compression easily gives 2-4x reduction of object fil= e size, far more > > > than you could achieve by just compressing relocations. > > > > My primary goal is to make relocatable files smaller (see > > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a > > few use cases). > > Smaller files benefit applications in several ways, including smaller > > I/O amount and lower linker memory usage (for linkers like gold, lld, > > and mold that map input files into memory). > > > > Generic data compression formats (like zlib or zstd) applied at the > > filesystem level won't achieve this goal because they don't decrease > > memory usage. > > In addition, filesystem compression does not appear too popular. > > > > Interestingly, I measured a 5.9% size reduction in .o files even after > > zstd compression when comparing two Clang builds with and without > > CREL. > > > > % ruby -e 'require "zstd-ruby"; un=3Dcom=3D0; > > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =3D > > File.open(f,"rb"){|h|h.read}; un+=3Dx.size; com+=3DZstd.compress(x).siz= e}; > > puts "uncompressed: #{un}\ncompressed: #{com}"' > > uncompressed: 136086784 > > compressed: 37173381 > > > > % ruby -e 'require "zstd-ruby"; un=3Dcom=3D0; > > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =3D > > File.open(f,"rb"){|h|h.read}; un+=3Dx.size; com+=3DZstd.compress(x).siz= e}; > > puts "uncompressed: #{un}\ncompressed: #{com}"' > > uncompressed: 111655952 > > compressed: 34964421 > > > > 1-111655952/136086784 ~=3D 18.0% (uncompressed) > > 1-34964421/37173381 ~=3D 5.9% (zstd) > > > > Another objective is to minimize the size of dynamic relocations. > > Android achieves this through ld.lld --pack-dyn-relocs=3Dandroid+relr, > > which compacts RELA relocations in their packed format. > > While effective, CREL offers a simpler approach that delivers even > > greater size reductions. > > yeah, though android+relr has the advantage of having already shipped, > so it's usable on just about every device that app developers still > support (api 23 -- > https://android.googlesource.com/platform/bionic/+/master/android-changes= -for-ndk-developers.md#relative-relocations-relr > -- versus the effective minimum of api 21) :-) > > something new today wouldn't be as widely usable on Android for about > a decade, so although we'd probably support it if others did, to be > really interesting for Android -- the point where we'd implement it > even if no-one else did -- it'd have to be 2x better (in space or > time, and not sacrificing the other, because app developers are very > conscious of both) rather than the little bit better that it actually > is. given our long lead time [for app developers to be able to rely on > ubiquity] and our "good enough" solution, i'm actually more interested > in the "what if we re-thought _everything_?" end of the spectrum than > small tweaks here. > > what does mach-o do here? (that is: "why doesn't Apple care?".) > > > > Alternatively, if we wanted to access and process ELF files without a= ny decompression, > > > we could define compact relocations as fixed-size entries. Using 64 b= its for a compact > > > RELA relocation gives a straightforward 4x compression. Out of range = values could > > > use the next entry to extend the ranges. > > > > 64 bits are quite large. CREL typically needs just one to three bytes > > for one relocation. > > How do you design a format that is generic enough to work with all > > relocation types and symbol indexes? > > > > > So my main issue with the proposal is that it tries too hard to compr= ess relocations. > > > For example using offset compression for relocations, symbol indices = and even addends > > > seems to have little value: the signed offset means you lose one bit,= and if out of range > > > values are rare or not grouped together, offset encodings are actuall= y less efficient. > > > > I actually use unsigned delta offset to save one bit but signed delta > > symidx/addend. > > I have analyzed how many bits are demanded by typical relocations. > > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-fo= r-elf#crel-relocation-format > > : > > > > Absolute symbol indexes allow one-byte encoding for symbols in the > > range [0,128) and offer minor size advantage for static relocations > > when the symbol table is sorted by usage frequency. Delta encoding, on > > the other hand, might optimize for the scenario when the symbol table > > presents locality: neighbor symbols are frequently mutually called. > > > > Delta symbol index enables one-byte encoding for GOT/PLT dynamic > > relocations when .got/.got.plt entries are ordered by symbol index. > > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically > > be encoded with repeated 0x05 0x01 (when addend_bit=3D=3D0 && shift=3D= =3D3, > > offset++, symidx++). Delta encoding has a disvantage. It can partial > > claim the optimization by arranging symbols in a "cold0 hot cold1" > > pattern. In addition, delta symbol index enables one-byte encoding for > > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by > > symbol index. > > > > In my experiments, absolute encoding with ULEB128 results in > > slightly larger .o file sizes for both x86-64 and AArch64 builds. > > > > For a decoder that only supports in-reloc addends (recommended for > > relocatable files), the C++ implementation is as simple as: > > > > const auto hdr =3D decodeULEB128(p); > > const size_t count =3D hdr / 8, shift =3D hdr % 4; > > Elf_Addr offset =3D 0, addend =3D 0; > > uint32_t symidx =3D 0, type =3D 0; > > for (size_t i =3D 0; i !=3D count; ++i) { > > const uint8_t b =3D *p++; > > offset +=3D b >> 3; > > if (b >=3D 0x80) offset +=3D (decodeULEB128(p) << 4) - 0x10; > > if (b & 1) symidx +=3D decodeSLEB128(p); > > if (b & 2) type +=3D decodeSLEB128(p); > > if (b & 4) addend +=3D decodeSLEB128(p); > > rels[i] =3D {offset << shift, symidx, type, addend}; > > } > > > > +=3D for all of symidx/type/addend is for consistency, but the choice > > turns out to be very good as well. > > > > > I don't get the discussion about relocation numbers on AArch64 - 4 or= 5 bits would > > > handle all frequently used relocations, so we'd just remap them to fi= t in the short > > > encoding. Hence I don't see a need at all for a signed offset encodin= g. > > > > The common static relocation types are within [257,313] (before > > R_AARCH64_PLT32). > > Delta encoding allows ~all but the first relocation's type to be > > encoded in a single byte. > > > > How do you design a compression scheme without baked-in knowledge (dict= ionary)? > > We don't want the generic encoding scheme to hard code relocation type > > range for each architecture. PATH=3D/tmp/Rel/bin:$PATH ld.lld @response.txt -z now --pack-dyn-relocs=3Drelr -o clang.relr ld.lld @response.txt -z now --pack-dyn-relocs=3Dandroid+relr -o clang.relr+= android ld.lld @response.txt -z now --pack-dyn-relocs=3Drelr -z crel -o clang.relr+= crel llvm-objcopy --dump-section .rela.dyn=3Dreladyn clang.relr /dev/null llvm-objcopy --dump-section .crel.dyn=3Dcreldyn clang.relr+crel /dev/null llvm-objcopy --dump-section .rela.dyn=3Dandroiddyn clang.relr+android /dev/= null xz -fk reladyn androiddyn creldyn zstd -fk reladyn androiddyn creldyn % llvm-readelf -r clang.relr+crel | grep crel.dyn Relocation section '.crel.dyn' at offset 0x5df318 contains 2907 entries: % stat -c '%s %n' reladyn* androiddyn* creldyn* 69768 reladyn 4236 reladyn.xz 4449 reladyn.zst 4604 androiddyn 1484 androiddyn.xz 1481 androiddyn.zst 3980 creldyn 1344 creldyn.xz 1367 creldyn.zst % xz --robot --list -vv androiddyn.xz | grep block block 1 1 1 12 0 1448 4604 0.315 CRC64 bb905bfac8a6ca11 12 -- 1427 8454200 --lzma2=3Ddict=3D8MiB There are 2907 relocation entries that are not RELATIVE or JUMP_SLOT, which can be compacted with either Android's packed relocation format or CREL. Notably, zstd/LZMA2 filtering on RELA outperforms Android's packed relocation format, while CREL outperforms zstd/LZMA2:) As a byte-oriented format, CREL relocations can be further compressed. (If CREL were made more bit-oriented, it might become uncompressible.) However, achieving double the compression ratio of Android's format is unrealistic for byte-oriented schemes where each relocation requires at least one byte. A byte-oriented format needs at least a bit more than 2907 bytes, exceeding half of the original 4604 bytes. An interesting observation: ld.lld still uses RELA for JUMP_SLOT relocations even with --pack-dyn-relocs=3Dandroid+relr. CREL handling JUMP_SLOT relocations potentially improves compression furthe= r. llvm-objcopy --dump-section .crel.plt=3Dcrelplt clang.relr+crel /dev/null llvm-objcopy --dump-section .rela.plt=3Drelaplt clang.relr+android /dev/nul= l xz -fk relaplt crelplt % stat -c '%s %n' *plt* 672 crelplt 116 crelplt.xz 7992 relaplt 628 relaplt.xz In clang.relr+android, .rela.* sections consume 12596 bytes. In clang.relr+crel, .crel.* sections consume 4652 bytes. Since there is no dynamic loader implementation, I cannot comment on the performance. Intuitively CREL is simpler and should be faster. > what does mach-o do here? (that is: "why doesn't Apple care?".) Before chained fixups (WWDC2022), they use a state machine (think of DWARF .debug_line) to set segment_index/symbol_name/dylib_ordinal/offset/addend/type/etc (https://github.com/apple-opensource/dyld/blob/e3f88907bebb8421f50f0943595f= 6874de70ebe0/src/ImageLoaderMachOCompressed.cpp#L1347). The bind opcodes look like: ... BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM flags symbol_name (BIND_OPCODE_SET_TYPE_IMM|type) BIND_OPCODE_SET_DYLIB_ORDINAL_IMM dylib_ordinal (BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | segment_idx) offset BIND_OPCODE_DO_BIND_ULEB_TIMES_SKIPPING_ULEB opcode consecutive_count data BIND_OPCODE_SET_ADDEND_SLEB addend BIND_OPCODE_DO_BIND_ADD_ADDR_ULEB delta_addend ... While this model is powerful, its complexity might not be ideal for ELF sys= tems. They encode symbol names/flags in place, dylib_ordinal (two-level namespace), and segment index, complexity that ELF systems don't have. Having opcodes in the stream makes their encoding likely less effecient. --- Chained fixups are a new system. They encode just the first relocations along with the relocation type information in each page. Subsequent relocations are encoded using the locations to be modified. * It necessitates kernel understanding of dynamic linking data and communication with the dynamic loader. * The encoding repurposes the ELF "implicit addend" for page offsets, limiting its use for larger addends. Some addend bits are stolen to encode a singly linked list (dyld_chained_ptr_64_bind::next dyld_chained_ptr_64_rebase::next). + For the rebase operation (ELF relative relocation), 36 least significant bits and 8 most significant bits are encoded. This is insufficient if we need a larger addend. + The page offset is muliple of 4, further losing generality. This approach may not be suitable for 32-bit systems or those requiring support for numerous relocation types, like ELF. Additionally, the current encoding of initial relocations and the import table seems less space-efficient. A CREL-like design could potentially improve space utilization. --=20 =E5=AE=8B=E6=96=B9=E7=9D=BF