From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Imq1=LX=google.com=maskray@sourceware.org>
Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536])
	by sourceware.org (Postfix) with ESMTPS id 97A2D3858D37
	for <libc-alpha@sourceware.org>; Thu, 18 Apr 2024 04:32:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 97A2D3858D37
Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=google.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 97A2D3858D37
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::536
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713414727; cv=none;
	b=tIEAgSta2dNFlU/xNzH6dOIgtvpmoxScZDuagWQ6C38e8YMIWYIYVQ/5AWz0Mw3A1pC7tCMvoVxnBdZwlVBRyLZVgHlO4v53LL90HzKm8jTy8yjIn2ZCkAr36okeBKIcOP9LYOmbcW/nRuFavjiluUSuPfKCbzBHyvzzrbGGyHU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1713414727; c=relaxed/simple;
	bh=RvX5gG45fuXjejan0MozJJviXJJj9P23qO1CdYovrGw=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=LMa5YoeGVtm000vm9/2u0v6My8qyD2+wZri9vrOIm6yd5P+DHsNpgRJYf06J0TCXMUgTtWSgtFssfjp3oJHf9QqyI9b6dTQ9ewA2MeoUeo7d/6LRpSKyPgD3OJjm6hVJ3Wojex3u/nK6AOqQlWctvVJaf/Ko2ESzgOK/SqgXV5Q=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-56e2e851794so4761a12.0
        for <libc-alpha@sourceware.org>; Wed, 17 Apr 2024 21:32:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1713414723; x=1714019523; darn=sourceware.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=KgiWk5+1GuMf2bvgiXMu3KR85sWy30D8QZjDcUE7uEc=;
        b=BPqB4vBR7+ySieOfQtpdBIUns2zmGoStgQYxX9dpaMtaZGRRRBpyvfzAzkXF9HAuLI
         ln+y+B7NDFgAsHt84lX+AlRpr4LxJTsJdzfSfO5dHNYdhXgZ2aTGVWkqghYGifU78I31
         lwSrfohyDTBVn0/sPj275hmJLS6JYXWQLLqE51u6CU3lGoNdm8DVasE9YHgq+CiNkMPi
         ke/ApKiH7uDk+8RIFxiY2sg9syQUrRC645ziLLXJ9fssqlEsa3m7Zui4W2anTJ2TRvgv
         7bX+M+IMPl9PYXxxjOGoZq0ORA7NNKqqH5exSnW9kYSardZFNhqF8Kfw9wc2MlH7HDVm
         BD4g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1713414723; x=1714019523;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=KgiWk5+1GuMf2bvgiXMu3KR85sWy30D8QZjDcUE7uEc=;
        b=NRpcmwV5wLzve3m6FDozpJkabOMklhunStxkcaUG1f94gC5Cs8pJ4bSqQjWwPPHYth
         K5hhgyGJRjbjCU1r1np88rnmEJQrbvhpEMkAj/wuziNFGHSBOL2Z6Xxp8asbgsCj5x22
         gpdiiX8gSF9w421DuRvhkXLsw44cfcnz0HK0b86r6Qbz9OR7sJ27T4HtswzfrCT8F7FK
         hJTPXGWqv6GF/69WyYBArUzcCk0bOsj7B4oOdTKiKKUvlGWn3W9119V67VdPzfxUQGOR
         R0u9nkd0/QSJ9gakltmf5ksm7+3UujKMp1JwsOanVgh/hRRBMEYOrYwHFJnyIEIqR8U9
         yoSg==
X-Forwarded-Encrypted: i=1; AJvYcCUMTjEY6pS3+i6n5BGg+ZxUB/g6BiuabvzXlhJJJ+mIGZ+935AomU8FY8w/Iq94EfwaI6fUK5JCvapeDgr0aE4n4GfVRwbTWwzc
X-Gm-Message-State: AOJu0YyhjjVK8gYD1zbijao/3nqHqsduubTmIKNrvIbo1KdfHwGqjCu8
	bef/kyeVb8cAuep0Cm2naaOdjYiDhECeWeo3dbhstwGIuMMSD0KcblwOXl50UYdl2DbdAV9q2+j
	Eb189P3scpq0kTBGpONPV9SdVnxTPTOta2fWv9Rt/Wq8M/xA1G9LJ
X-Google-Smtp-Source: AGHT+IFoZ2JtD0nzuPk1JDOyYu3XafLtHCC/ATImrDXxx5Kn5O4x+52ig5MfK6wpo/FtQ2DjRwqXrgAJ0qw6+FMXufg=
X-Received: by 2002:a05:6402:50a:b0:571:b9c7:2804 with SMTP id
 m10-20020a056402050a00b00571b9c72804mr47188edv.5.1713414722731; Wed, 17 Apr
 2024 21:32:02 -0700 (PDT)
MIME-Version: 1.0
References: <PAWPR08MB89828687CE183D4EC04DAAF483072@PAWPR08MB8982.eurprd08.prod.outlook.com>
 <CAN30aBGaKwqCYSEhpjL6LX0+T65dzgpKR4pkXTudWEs1qWQ68g@mail.gmail.com> <CAJgzZoox6GqTUwdjD0mqXs=A=O8YSm_qC_hzhtf10_egmAmrDw@mail.gmail.com>
In-Reply-To: <CAJgzZoox6GqTUwdjD0mqXs=A=O8YSm_qC_hzhtf10_egmAmrDw@mail.gmail.com>
From: Fangrui Song <maskray@google.com>
Date: Wed, 17 Apr 2024 21:31:49 -0700
Message-ID: <CAFP8O3LoUMJzf6ZfKKWKaBKsWaC8bkzDjFcbKS05tFQMT6iCaA@mail.gmail.com>
Subject: Re: CREL dynamic relocations
To: enh <enh@google.com>
Cc: Fangrui Song <maskray@gcc.gnu.org>, Wilco Dijkstra <Wilco.Dijkstra@arm.com>, 
	GNU C Library <libc-alpha@sourceware.org>, Florian Weimer <fweimer@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,ENV_AND_HDR_SPF_MATCH,KAM_INFOUSMEBIZ,KAM_STOCKGEN,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

On Fri, Apr 12, 2024 at 9:19=E2=80=AFAM enh <enh@google.com> wrote:
>
> On Wed, Apr 10, 2024 at 7:41=E2=80=AFPM Fangrui Song <maskray@gcc.gnu.org=
> wrote:
> >
> > Thank you for your interest in the CREL relocation format.
> >
> > On Tue, Apr 9, 2024 at 8:33=E2=80=AFAM Wilco Dijkstra <Wilco.Dijkstra@a=
rm.com> wrote:
> > > I like the general idea of more compact relocations, however what I d=
on't get is
> > > what the overall goal is. If the goal is more compact object files, w=
hy don't we just
> > > add a (de)compress pass using a fast compression algorithm? CPU time =
is cheap
> > > today, and real compression easily gives 2-4x reduction of object fil=
e size, far more
> > > than you could achieve by just compressing relocations.
> >
> > My primary goal is to make relocatable files smaller (see
> > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a
> > few use cases).
> > Smaller files benefit applications in several ways, including smaller
> > I/O amount and lower linker memory usage (for linkers like gold, lld,
> > and mold that map input files into memory).
> >
> > Generic data compression formats (like zlib or zstd) applied at the
> > filesystem level won't achieve this goal because they don't decrease
> > memory usage.
> > In addition, filesystem compression does not appear too popular.
> >
> > Interestingly, I measured a 5.9% size reduction in .o files even after
> > zstd compression when comparing two Clang builds with and without
> > CREL.
> >
> >     % ruby -e 'require "zstd-ruby"; un=3Dcom=3D0;
> > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =3D
> > File.open(f,"rb"){|h|h.read}; un+=3Dx.size; com+=3DZstd.compress(x).siz=
e};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 136086784
> >     compressed: 37173381
> >
> >     % ruby -e 'require "zstd-ruby"; un=3Dcom=3D0;
> > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =3D
> > File.open(f,"rb"){|h|h.read}; un+=3Dx.size; com+=3DZstd.compress(x).siz=
e};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 111655952
> >     compressed: 34964421
> >
> >     1-111655952/136086784 ~=3D 18.0% (uncompressed)
> >     1-34964421/37173381 ~=3D 5.9%    (zstd)
> >
> > Another objective is to minimize the size of dynamic relocations.
> > Android achieves this through ld.lld --pack-dyn-relocs=3Dandroid+relr,
> > which compacts RELA relocations in their packed format.
> > While effective, CREL offers a simpler approach that delivers even
> > greater size reductions.
>
> yeah, though android+relr has the advantage of having already shipped,
> so it's usable on just about every device that app developers still
> support (api 23 --
> https://android.googlesource.com/platform/bionic/+/master/android-changes=
-for-ndk-developers.md#relative-relocations-relr
> -- versus the effective minimum of api 21) :-)
>
> something new today wouldn't be as widely usable on Android for about
> a decade, so although we'd probably support it if others did, to be
> really interesting for Android -- the point where we'd implement it
> even if no-one else did -- it'd have to be 2x better (in space or
> time, and not sacrificing the other, because app developers are very
> conscious of both) rather than the little bit better that it actually
> is. given our long lead time [for app developers to be able to rely on
> ubiquity] and our "good enough" solution, i'm actually more interested
> in the "what if we re-thought _everything_?" end of the spectrum than
> small tweaks here.
>
> what does mach-o do here? (that is: "why doesn't Apple care?".)
>
> > > Alternatively, if we wanted to access and process ELF files without a=
ny decompression,
> > > we could define compact relocations as fixed-size entries. Using 64 b=
its for a compact
> > > RELA relocation gives a straightforward 4x compression. Out of range =
values could
> > > use the next entry to extend the ranges.
> >
> > 64 bits are quite large. CREL typically needs just one to three bytes
> > for one relocation.
> > How do you design a format that is generic enough to work with all
> > relocation types and symbol indexes?
> >
> > > So my main issue with the proposal is that it tries too hard to compr=
ess relocations.
> > > For example using offset compression for relocations, symbol indices =
and even addends
> > > seems to have little value: the signed offset means you lose one bit,=
 and if out of range
> > > values are rare or not grouped together, offset encodings are actuall=
y less efficient.
> >
> > I actually use unsigned delta offset to save one bit but signed delta
> > symidx/addend.
> > I have analyzed how many bits are demanded by typical relocations.
> > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-fo=
r-elf#crel-relocation-format
> > :
> >
> >     Absolute symbol indexes allow one-byte encoding for symbols in the
> > range [0,128) and offer minor size advantage for static relocations
> > when the symbol table is sorted by usage frequency. Delta encoding, on
> > the other hand, might optimize for the scenario when the symbol table
> > presents locality: neighbor symbols are frequently mutually called.
> >
> >     Delta symbol index enables one-byte encoding for GOT/PLT dynamic
> > relocations when .got/.got.plt entries are ordered by symbol index.
> > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically
> > be encoded with repeated 0x05 0x01 (when addend_bit=3D=3D0 && shift=3D=
=3D3,
> > offset++, symidx++). Delta encoding has a disvantage. It can partial
> > claim the optimization by arranging symbols in a "cold0 hot cold1"
> > pattern. In addition, delta symbol index enables one-byte encoding for
> > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by
> > symbol index.
> >
> >     In my experiments, absolute encoding with ULEB128 results in
> > slightly larger .o file sizes for both x86-64 and AArch64 builds.
> >
> > For a decoder that only supports in-reloc addends (recommended for
> > relocatable files), the C++ implementation is as simple as:
> >
> >   const auto hdr =3D decodeULEB128(p);
> >   const size_t count =3D hdr / 8, shift =3D hdr % 4;
> >   Elf_Addr offset =3D 0, addend =3D 0;
> >   uint32_t symidx =3D 0, type =3D 0;
> >   for (size_t i =3D 0; i !=3D count; ++i) {
> >     const uint8_t b =3D *p++;
> >     offset +=3D b >> 3;
> >     if (b >=3D 0x80) offset +=3D (decodeULEB128(p) << 4) - 0x10;
> >     if (b & 1) symidx +=3D decodeSLEB128(p);
> >     if (b & 2) type +=3D decodeSLEB128(p);
> >     if (b & 4) addend +=3D decodeSLEB128(p);
> >     rels[i] =3D {offset << shift, symidx, type, addend};
> >   }
> >
> > +=3D for all of symidx/type/addend is for consistency, but the choice
> > turns out to be very good as well.
> >
> > > I don't get the discussion about relocation numbers on AArch64 - 4 or=
 5 bits would
> > > handle all frequently used relocations, so we'd just remap them to fi=
t in the short
> > > encoding. Hence I don't see a need at all for a signed offset encodin=
g.
> >
> > The common static relocation types are within [257,313] (before
> > R_AARCH64_PLT32).
> > Delta encoding allows ~all but the first relocation's type to be
> > encoded in a single byte.
> >
> > How do you design a compression scheme without baked-in knowledge (dict=
ionary)?
> > We don't want the generic encoding scheme to hard code relocation type
> > range for each architecture.

PATH=3D/tmp/Rel/bin:$PATH
ld.lld @response.txt -z now --pack-dyn-relocs=3Drelr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=3Dandroid+relr -o clang.relr+=
android
ld.lld @response.txt -z now --pack-dyn-relocs=3Drelr -z crel -o clang.relr+=
crel
llvm-objcopy --dump-section .rela.dyn=3Dreladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=3Dcreldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=3Dandroiddyn clang.relr+android /dev/=
null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn

% llvm-readelf -r clang.relr+crel | grep crel.dyn
Relocation section '.crel.dyn' at offset 0x5df318 contains 2907 entries:
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst
% xz --robot --list -vv androiddyn.xz | grep block
block   1       1       1       12      0       1448    4604    0.315
 CRC64   bb905bfac8a6ca11        12      --      1427    8454200
--lzma2=3Ddict=3D8MiB

There are 2907 relocation entries that are not RELATIVE or JUMP_SLOT,
which can be compacted with either Android's packed relocation format
or CREL.
Notably, zstd/LZMA2 filtering on RELA outperforms Android's packed
relocation format, while CREL outperforms zstd/LZMA2:)
As a byte-oriented format, CREL relocations can be further compressed.
(If CREL were made more bit-oriented, it might become uncompressible.)

However, achieving double the compression ratio of Android's format is
unrealistic for byte-oriented schemes where each relocation requires
at least one byte.
A byte-oriented format needs at least a bit more than 2907 bytes,
exceeding half of the original 4604 bytes.

An interesting observation: ld.lld still uses RELA for JUMP_SLOT
relocations even with --pack-dyn-relocs=3Dandroid+relr.
CREL handling JUMP_SLOT relocations potentially improves compression furthe=
r.

llvm-objcopy --dump-section .crel.plt=3Dcrelplt clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.plt=3Drelaplt clang.relr+android /dev/nul=
l
xz -fk relaplt crelplt
% stat -c '%s %n' *plt*
672 crelplt
116 crelplt.xz
7992 relaplt
628 relaplt.xz

In clang.relr+android, .rela.* sections consume 12596 bytes.
In clang.relr+crel, .crel.* sections consume 4652 bytes.

Since there is no dynamic loader implementation, I cannot comment on
the performance.
Intuitively CREL is simpler and should be faster.

> what does mach-o do here? (that is: "why doesn't Apple care?".)

Before chained fixups (WWDC2022), they use a state machine (think of
DWARF .debug_line) to set
segment_index/symbol_name/dylib_ordinal/offset/addend/type/etc
(https://github.com/apple-opensource/dyld/blob/e3f88907bebb8421f50f0943595f=
6874de70ebe0/src/ImageLoaderMachOCompressed.cpp#L1347).
The bind opcodes look like:

...
BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM
flags
symbol_name
(BIND_OPCODE_SET_TYPE_IMM|type)
BIND_OPCODE_SET_DYLIB_ORDINAL_IMM
dylib_ordinal
(BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | segment_idx)
offset
BIND_OPCODE_DO_BIND_ULEB_TIMES_SKIPPING_ULEB
opcode
consecutive_count
data
BIND_OPCODE_SET_ADDEND_SLEB
addend
BIND_OPCODE_DO_BIND_ADD_ADDR_ULEB
delta_addend
...

While this model is powerful, its complexity might not be ideal for ELF sys=
tems.
They encode symbol names/flags in place, dylib_ordinal (two-level
namespace), and segment index, complexity that ELF systems don't have.
Having opcodes in the stream makes their encoding likely less effecient.

---

Chained fixups are a new system.
They encode just the first relocations along with the relocation type
information in each page.
Subsequent relocations are encoded using the locations to be modified.

* It necessitates kernel understanding of dynamic linking data and
communication with the dynamic loader.
* The encoding repurposes the ELF "implicit addend" for page offsets,
limiting its use for larger addends. Some addend bits are stolen to
encode a singly linked list (dyld_chained_ptr_64_bind::next
dyld_chained_ptr_64_rebase::next).
  + For the rebase operation (ELF relative relocation), 36 least
significant bits and 8 most significant bits are encoded. This is
insufficient if we need a larger addend.
  + The page offset is muliple of 4, further losing generality.

This approach may not be suitable for 32-bit systems or those
requiring support for numerous relocation types, like ELF.
Additionally, the current encoding of initial relocations and the
import table seems less space-efficient. A CREL-like design could
potentially improve space utilization.


--=20
=E5=AE=8B=E6=96=B9=E7=9D=BF