public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed
* More compact SFrames through deduplication
@ 2025-11-08 16:35 Tatsuyuki Ishi
  2025-11-09  1:21 ` Jose E. Marchesi
  2026-01-26 16:06 ` Tatsuyuki Ishi
  0 siblings, 2 replies; 8+ messages in thread
From: Tatsuyuki Ishi @ 2025-11-08 16:35 UTC (permalink / raw)
  To: Binutils
  Cc: Indu Bhagat, Fangrui Song, Florian Weimer, Michael Matz,
	Steven Rostedt, Matthias Klose, Sterling Augustine, Pavel Labath,
	Sam James, amonakov, Jose E. Marchesi

[-- Attachment #1: Type: text/plain, Size: 8411 bytes --]

[markdown version available at https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md] <https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md%5D>

As many have suggested, compiling with frame pointers [1] is the lowest overhead and least complicated way to profile on
Linux. But recently there have been lots of discussions for alternatives to this. My personal case is to unwind Windows
applications without recompiling them; notably, with perf's `-g dwarf` recording mode a userspace tool can later read PE
unwind info and unwind the stack accurately; and having a flamegraph has helped me fix countless performance issues.

Most recently there has been discussion on SFrame, which presents itself as a simpler alternative to .eh_frame, that
runs fast, does not require stepping on CFI opcodes, and simple enough to implement as an in-kernel unwinder. I am very
excited that we might finally be able to unwind without recompiling for frame pointers. (Note, there needs to be some
way to generate SFrame, but it seems conversion is very much possible from DWARF and even PE.)

Now, as MaskRay [2] and other have highlighted, the size overhead of SFrame is at least as much as that of .eh_frame,
and this is large enough that it's a valid reason for people to turn this off for embedded or mobile OSes. So reducing
size is an important goal if SFrame wants to be successful.

## The prior arts

There are a few prior arts that we can draw ideas from.

The first is Mach-O's compact unwind format [3], which is fairly similar to a FRE in SFrame (which are expressions to
calculate old_rsp and old_rbp directly), except that it throws away all the FREs for prologs and epilogs and treat the
entire function as function body. I personally see this as "SFrame but with less accuracy and smaller size".

More interesting is PE's x64 and arm64 unwind code format [4,5]. The x64 format is represented in terms of "opcodes",
i.e. bytecode that maps 1:1 to HW instructions that are executed in a prolog. This has a size benefit since instructions
like push, which generates multiple CFI updates, can be compactly represented. In arm64 this idea is taken further to
also encode the length of the instruction itself into the opcode. On the downside, I think this is a lot of
(arch-specific) semantics that needs to be implemented by the unwinder, and recovering the rule requires stepping
through the opcodes, which is slower than having the FRE directly. It also imposes restrictions on how compilers can
emit the prolog (e.g. LLVM has some Windows-specific paths [6]).

It's also worth noting that PE unwind codes can do in-kernel async unwinds as well as C++-compatible full unwinds, with
the same metadata. This saves space.

The most interesting idea from arm64 unwind code format is canonical prologs as a common-path optimization. In addition
to constraining the instructions, the compiler needs to push and pop registers in a very specific order. On the bright
side, the unwind info for the entire function can now be represented with just (# of GP regs saved) + (# of FP regs
saved) + (stack alloc size) + (function length). This all fits in 32 bits and is extremely compact.

## Chunking CFIs

With the diversity of toolchains we have on Linux (and absence of a pre-existing restriction on prologs), it's unlikely
we can define a canonical prolog format. But this made me wonder how diverse are things in practice. With some simple
dwarfdump experiments, it seems that each compiler does push register in a fairly consistent order. And there is way
fewer unique CFI patterns compared to the number of FDEs we have.

Here are some sample numbers. I took the AMD Vulkan driver from Mesa, because it's fast to compile and contains a
diverse set of libraries.

| Binary              | Compiler | Frame Pointer | `.text` (bytes) | `.eh_frame_hdr` (bytes) | `.eh_frame` (bytes) | FDE count | .text/FDE (bytes) | FRE count | Unique CFA states |
|---------------------|----------|---------------|-----------------|-------------------------|---------------------|-----------|-------------------|-----------|-------------------|
| libvulkan_radeon.so | GCC      | No            | 7,141,795       | 116,244                 | 646,512             | 14,529    | 491               | 103,178   | 912               |
| libvulkan_radeon.so | GCC      | Yes           | 7,312,627       | 116,244                 | 520,888             | 14,529    | 503               | 60,024    | 39                |
| libvulkan_radeon.so | Clang    | No            | 6,562,563       | 68,604                  | 484,104             | 8,574     | 765               | 91,608    | 619               |
| libvulkan_radeon.so | Clang    | Yes           | 6,663,139       | 68,604                  | 394,240             | 8,574     | 777               | 54,982    | 17                |

My main idea is to have a registry of CFI chunks that is deduplicated, and build up the main binary search table from
these chunks. Remember, the number of unique CFI pattern is very low, so we can spend lots of bytes in the CFI itself,
making it self-describing and not needing to impose complexity on the unwinder or restrictions on the compiler.

As an example, using x86 assembly:

```
[chunk A --- prolog]
    .cfi_def_cfa_offset 8
    push %<reg> ; callee-saved reg
    .cfi_def_cfa_offset (offset from retaddr)
    ...

    sub $STACK_SIZE, %rsp
[chunk B --- inside function]
    .cfi_def_cfa_offset (offset from retaddr)
    .cfi_offset %<reg>, (offset from rsp) ; for all registers
    
    ... ; function body

    add $STACK_SIZE, %rsp
[chunk C --- epilog]
    .cfi_def_cfa_offset (offset from retaddr)
    pop %reg ; callee-saved reg
    .cfi_def_cfa_offset (offset from retaddr)
    ...
```

Chunk A and C is independent of stack size and can be reused across functions easily. Chunk B depends on the (saved
register, stack size), so we keep the # of duplicated rows minimal this way.

In the end state it will be up to the compiler to decide on a chunking scheme that deduplicates well. But until then,
external tool seeking to chunk on their own can use a simple heuristic: chunk at where the stack is deepest.

## How can this be adopted in SFrame?

The hierarchy of SFrame currently looks like:

- Header
- FDEs (fixed size: initial addr, size, FRE offset, FRE size, other metadata)
- FREs (variable size)

We could repurpose Function Descriptor Entries (FDEs) into Chunk Descriptor Entries (CDEs), since both represent an
array of FREs. The main difference is that a chunk is not associated with a particular code address, nor it has a
fixed code size (e.g. in the above example, chunk B corresponds to function body and have variable size). Size can be
defined implicitly by treating the next initial address as the end of current chunk.

So the end result would be breaking up FDEs into a two-level indirection:

- Header
- Addr-chunk table (fixed size: initial addr, CDE index)
- CDEs (fixed size; FRE offset, FRE size, other metadata)
- FREs (variable size)

The addr-chunk is most size-sensitive. Let's assume for now we use 32 bits for addr and 16 bits for CDE index. Assuming
an average function is chunked into A+B+C+Gap, that would be four chunk entries in the table, or 24 bytes per function.

Let's compare this to current SFrame, consider a function that does not establish FP, does not clobber FP and save N
non-FP callee-saved registers. The size per function is `20+(2+1)*(2*N+3)` bytes. To get a very rough ballpark estimate,
at N=4 SFrame would be 53 bytes, and the compressed scheme would be less than half of that.

There are further size optimization ideas I want to pursue. But that would be another long writing, so let's first
discuss whether this is a direction we want to go for, and we can come up with more ideas later.

[1]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
[2]: https://maskray.me/blog/2025-09-28-remarks-on-sframe
[3]: https://faultlore.com/blah/compact-unwinding/
[4]: https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64
[5]: https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling
[6]: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86FrameLowering.cpp

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2025-11-08 16:35 More compact SFrames through deduplication Tatsuyuki Ishi
@ 2025-11-09  1:21 ` Jose E. Marchesi
  2025-11-10  8:28   ` Indu Bhagat
  2026-01-26 16:06 ` Tatsuyuki Ishi
  1 sibling, 1 reply; 8+ messages in thread
From: Jose E. Marchesi @ 2025-11-09  1:21 UTC (permalink / raw)
  To: Tatsuyuki Ishi
  Cc: Binutils, Indu Bhagat, Fangrui Song, Florian Weimer,
	Michael Matz, Steven Rostedt, Matthias Klose, Sterling Augustine,
	Pavel Labath, Sam James, amonakov, elena.zannoni


Adding Elena in CC.

> [markdown version available at
> https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md]
> <https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md%5D>
>
> As many have suggested, compiling with frame pointers [1] is the
> lowest overhead and least complicated way to profile on
> Linux. But recently there have been lots of discussions for
> alternatives to this. My personal case is to unwind Windows
> applications without recompiling them; notably, with perf's `-g dwarf`
> recording mode a userspace tool can later read PE
> unwind info and unwind the stack accurately; and having a flamegraph
> has helped me fix countless performance issues.
>
> Most recently there has been discussion on SFrame, which presents
> itself as a simpler alternative to .eh_frame, that
> runs fast, does not require stepping on CFI opcodes, and simple enough
> to implement as an in-kernel unwinder. I am very
> excited that we might finally be able to unwind without recompiling
> for frame pointers. (Note, there needs to be some
> way to generate SFrame, but it seems conversion is very much possible from DWARF and even PE.)
>
> Now, as MaskRay [2] and other have highlighted, the size overhead of
> SFrame is at least as much as that of .eh_frame,
> and this is large enough that it's a valid reason for people to turn
> this off for embedded or mobile OSes. So reducing
> size is an important goal if SFrame wants to be successful.
>
> ## The prior arts
>
> There are a few prior arts that we can draw ideas from.
>
> The first is Mach-O's compact unwind format [3], which is fairly
> similar to a FRE in SFrame (which are expressions to
> calculate old_rsp and old_rbp directly), except that it throws away
> all the FREs for prologs and epilogs and treat the
> entire function as function body. I personally see this as "SFrame but with less accuracy and smaller size".
>
> More interesting is PE's x64 and arm64 unwind code format [4,5]. The
> x64 format is represented in terms of "opcodes",
> i.e. bytecode that maps 1:1 to HW instructions that are executed in a
> prolog. This has a size benefit since instructions
> like push, which generates multiple CFI updates, can be compactly
> represented. In arm64 this idea is taken further to
> also encode the length of the instruction itself into the opcode. On the downside, I think this is a lot of
> (arch-specific) semantics that needs to be implemented by the unwinder, and recovering the rule requires stepping
> through the opcodes, which is slower than having the FRE directly. It
> also imposes restrictions on how compilers can
> emit the prolog (e.g. LLVM has some Windows-specific paths [6]).
>
> It's also worth noting that PE unwind codes can do in-kernel async
> unwinds as well as C++-compatible full unwinds, with
> the same metadata. This saves space.
>
> The most interesting idea from arm64 unwind code format is canonical
> prologs as a common-path optimization. In addition
> to constraining the instructions, the compiler needs to push and pop
> registers in a very specific order. On the bright
> side, the unwind info for the entire function can now be represented
> with just (# of GP regs saved) + (# of FP regs
> saved) + (stack alloc size) + (function length). This all fits in 32 bits and is extremely compact.
>
> ## Chunking CFIs
>
> With the diversity of toolchains we have on Linux (and absence of a
> pre-existing restriction on prologs), it's unlikely
> we can define a canonical prolog format. But this made me wonder how
> diverse are things in practice. With some simple
> dwarfdump experiments, it seems that each compiler does push register
> in a fairly consistent order. And there is way
> fewer unique CFI patterns compared to the number of FDEs we have.
>
> Here are some sample numbers. I took the AMD Vulkan driver from Mesa, because it's fast to compile and contains a
> diverse set of libraries.
>
> | Binary | Compiler | Frame Pointer | `.text` (bytes) |
> | `.eh_frame_hdr` (bytes) | `.eh_frame` (bytes) | FDE count |
> | .text/FDE (bytes) | FRE count | Unique CFA states |
> |---------------------|----------|---------------|-----------------|-------------------------|---------------------|-----------|-------------------|-----------|-------------------|
> | libvulkan_radeon.so | GCC | No | 7,141,795 | 116,244 | 646,512 |
> | 14,529 | 491 | 103,178 | 912 |
> | libvulkan_radeon.so | GCC | Yes | 7,312,627 | 116,244 | 520,888 |
> | 14,529 | 503 | 60,024 | 39 |
> | libvulkan_radeon.so | Clang | No | 6,562,563 | 68,604 | 484,104 |
> | 8,574 | 765 | 91,608 | 619 |
> | libvulkan_radeon.so | Clang | Yes | 6,663,139 | 68,604 | 394,240 |
> | 8,574 | 777 | 54,982 | 17 |
>
> My main idea is to have a registry of CFI chunks that is deduplicated,
> and build up the main binary search table from
> these chunks. Remember, the number of unique CFI pattern is very low,
> so we can spend lots of bytes in the CFI itself,
> making it self-describing and not needing to impose complexity on the unwinder or restrictions on the compiler.
>
> As an example, using x86 assembly:
>
> ```
> [chunk A --- prolog]
>     .cfi_def_cfa_offset 8
>     push %<reg> ; callee-saved reg
>     .cfi_def_cfa_offset (offset from retaddr)
>     ...
>
>     sub $STACK_SIZE, %rsp
> [chunk B --- inside function]
>     .cfi_def_cfa_offset (offset from retaddr)
>     .cfi_offset %<reg>, (offset from rsp) ; for all registers
>     
>     ... ; function body
>
>     add $STACK_SIZE, %rsp
> [chunk C --- epilog]
>     .cfi_def_cfa_offset (offset from retaddr)
>     pop %reg ; callee-saved reg
>     .cfi_def_cfa_offset (offset from retaddr)
>     ...
> ```
>
> Chunk A and C is independent of stack size and can be reused across
> functions easily. Chunk B depends on the (saved
> register, stack size), so we keep the # of duplicated rows minimal this way.
>
> In the end state it will be up to the compiler to decide on a chunking
> scheme that deduplicates well. But until then,
> external tool seeking to chunk on their own can use a simple heuristic: chunk at where the stack is deepest.
>
> ## How can this be adopted in SFrame?
>
> The hierarchy of SFrame currently looks like:
>
> - Header
> - FDEs (fixed size: initial addr, size, FRE offset, FRE size, other metadata)
> - FREs (variable size)
>
> We could repurpose Function Descriptor Entries (FDEs) into Chunk
> Descriptor Entries (CDEs), since both represent an
> array of FREs. The main difference is that a chunk is not associated with a particular code address, nor it has a
> fixed code size (e.g. in the above example, chunk B corresponds to
> function body and have variable size). Size can be
> defined implicitly by treating the next initial address as the end of current chunk.
>
> So the end result would be breaking up FDEs into a two-level indirection:
>
> - Header
> - Addr-chunk table (fixed size: initial addr, CDE index)
> - CDEs (fixed size; FRE offset, FRE size, other metadata)
> - FREs (variable size)
>
> The addr-chunk is most size-sensitive. Let's assume for now we use 32
> bits for addr and 16 bits for CDE index. Assuming
> an average function is chunked into A+B+C+Gap, that would be four
> chunk entries in the table, or 24 bytes per function.
>
> Let's compare this to current SFrame, consider a function that does
> not establish FP, does not clobber FP and save N
> non-FP callee-saved registers. The size per function is
> `20+(2+1)*(2*N+3)` bytes. To get a very rough ballpark estimate,
> at N=4 SFrame would be 53 bytes, and the compressed scheme would be less than half of that.
>
> There are further size optimization ideas I want to pursue. But that
> would be another long writing, so let's first
> discuss whether this is a direction we want to go for, and we can come up with more ideas later.
>
> [1]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
> [2]: https://maskray.me/blog/2025-09-28-remarks-on-sframe
> [3]: https://faultlore.com/blah/compact-unwinding/
> [4]: https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64
> [5]: https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling
> [6]: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86FrameLowering.cpp

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2025-11-09  1:21 ` Jose E. Marchesi
@ 2025-11-10  8:28   ` Indu Bhagat
  2025-11-10 16:26     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Indu Bhagat @ 2025-11-10  8:28 UTC (permalink / raw)
  To: Jose E. Marchesi, Tatsuyuki Ishi
  Cc: Binutils, Fangrui Song, Florian Weimer, Michael Matz,
	Steven Rostedt, Matthias Klose, Sterling Augustine, Pavel Labath,
	Sam James, amonakov, elena.zannoni

Hi Tatsuyuki,

On 11/8/25 5:21 PM, Jose E. Marchesi wrote:
> 
> Adding Elena in CC.
> 
>> [markdown version available at
>> https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md]
>> <https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md%5D>
>>

Thanks for your interest in SFrame and for sharing your usecase.

>> As many have suggested, compiling with frame pointers [1] is the
>> lowest overhead and least complicated way to profile on
>> Linux. But recently there have been lots of discussions for
>> alternatives to this. My personal case is to unwind Windows
>> applications without recompiling them; notably, with perf's `-g dwarf`
>> recording mode a userspace tool can later read PE
>> unwind info and unwind the stack accurately; and having a flamegraph
>> has helped me fix countless performance issues.
>>
>> Most recently there has been discussion on SFrame, which presents
>> itself as a simpler alternative to .eh_frame, that
>> runs fast, does not require stepping on CFI opcodes, and simple enough
>> to implement as an in-kernel unwinder. I am very
>> excited that we might finally be able to unwind without recompiling
>> for frame pointers. (Note, there needs to be some
>> way to generate SFrame, but it seems conversion is very much possible from DWARF and even PE.)
>>
>> Now, as MaskRay [2] and other have highlighted, the size overhead of
>> SFrame is at least as much as that of .eh_frame,
>> and this is large enough that it's a valid reason for people to turn
>> this off for embedded or mobile OSes. So reducing
>> size is an important goal if SFrame wants to be successful.
>>
>> ## The prior arts
>>
>> There are a few prior arts that we can draw ideas from.
>>
>> The first is Mach-O's compact unwind format [3], which is fairly
>> similar to a FRE in SFrame (which are expressions to
>> calculate old_rsp and old_rbp directly), except that it throws away
>> all the FREs for prologs and epilogs and treat the
>> entire function as function body. I personally see this as "SFrame but with less accuracy and smaller size".
>>
>> More interesting is PE's x64 and arm64 unwind code format [4,5]. The
>> x64 format is represented in terms of "opcodes",
>> i.e. bytecode that maps 1:1 to HW instructions that are executed in a
>> prolog. This has a size benefit since instructions
>> like push, which generates multiple CFI updates, can be compactly
>> represented. In arm64 this idea is taken further to
>> also encode the length of the instruction itself into the opcode. On the downside, I think this is a lot of
>> (arch-specific) semantics that needs to be implemented by the unwinder, and recovering the rule requires stepping
>> through the opcodes, which is slower than having the FRE directly. It
>> also imposes restrictions on how compilers can
>> emit the prolog (e.g. LLVM has some Windows-specific paths [6]).
>>
>> It's also worth noting that PE unwind codes can do in-kernel async
>> unwinds as well as C++-compatible full unwinds, with
>> the same metadata. This saves space.
>>
>> The most interesting idea from arm64 unwind code format is canonical
>> prologs as a common-path optimization. In addition
>> to constraining the instructions, the compiler needs to push and pop
>> registers in a very specific order. On the bright
>> side, the unwind info for the entire function can now be represented
>> with just (# of GP regs saved) + (# of FP regs
>> saved) + (stack alloc size) + (function length). This all fits in 32 bits and is extremely compact.
>>
>> ## Chunking CFIs
>>
>> With the diversity of toolchains we have on Linux (and absence of a
>> pre-existing restriction on prologs), it's unlikely
>> we can define a canonical prolog format. But this made me wonder how
>> diverse are things in practice. With some simple
>> dwarfdump experiments, it seems that each compiler does push register
>> in a fairly consistent order. And there is way
>> fewer unique CFI patterns compared to the number of FDEs we have.
>>
>> Here are some sample numbers. I took the AMD Vulkan driver from Mesa, because it's fast to compile and contains a
>> diverse set of libraries.
>>
>> | Binary | Compiler | Frame Pointer | `.text` (bytes) |
>> | `.eh_frame_hdr` (bytes) | `.eh_frame` (bytes) | FDE count |
>> | .text/FDE (bytes) | FRE count | Unique CFA states |
>> |---------------------|----------|---------------|-----------------|-------------------------|---------------------|-----------|-------------------|-----------|-------------------|
>> | libvulkan_radeon.so | GCC | No | 7,141,795 | 116,244 | 646,512 |
>> | 14,529 | 491 | 103,178 | 912 |
>> | libvulkan_radeon.so | GCC | Yes | 7,312,627 | 116,244 | 520,888 |
>> | 14,529 | 503 | 60,024 | 39 |
>> | libvulkan_radeon.so | Clang | No | 6,562,563 | 68,604 | 484,104 |
>> | 8,574 | 765 | 91,608 | 619 |
>> | libvulkan_radeon.so | Clang | Yes | 6,663,139 | 68,604 | 394,240 |
>> | 8,574 | 777 | 54,982 | 17 |
>>
>> My main idea is to have a registry of CFI chunks that is deduplicated,
>> and build up the main binary search table from
>> these chunks. Remember, the number of unique CFI pattern is very low,
>> so we can spend lots of bytes in the CFI itself,
>> making it self-describing and not needing to impose complexity on the unwinder or restrictions on the compiler.
>>
>> As an example, using x86 assembly:
>>
>> ```
>> [chunk A --- prolog]
>>      .cfi_def_cfa_offset 8
>>      push %<reg> ; callee-saved reg
>>      .cfi_def_cfa_offset (offset from retaddr)
>>      ...
>>
>>      sub $STACK_SIZE, %rsp
>> [chunk B --- inside function]
>>      .cfi_def_cfa_offset (offset from retaddr)
>>      .cfi_offset %<reg>, (offset from rsp) ; for all registers
>>      
>>      ... ; function body
>>
>>      add $STACK_SIZE, %rsp
>> [chunk C --- epilog]
>>      .cfi_def_cfa_offset (offset from retaddr)
>>      pop %reg ; callee-saved reg
>>      .cfi_def_cfa_offset (offset from retaddr)
>>      ...
>> ```
>>
>> Chunk A and C is independent of stack size and can be reused across
>> functions easily. Chunk B depends on the (saved
>> register, stack size), so we keep the # of duplicated rows minimal this way.
>>

While it is true that we may see many functions using Chunk A and Chunk 
C, the number of chunks used per function, in practice (for at least 
x86_64) may be larger than 4, creating enough diversity, and hence a 
potentially large(r) "chunk-based index".

[I think you know this already, as you have done the measurements above :)]

>> In the end state it will be up to the compiler to decide on a chunking
>> scheme that deduplicates well. But until then,
>> external tool seeking to chunk on their own can use a simple heuristic: chunk at where the stack is deepest.
>>

IMHO making an optimizing compiler adapt optimization decisions/code 
generation for optimal chunking/compact stack tracing info should 
ideally be avoided.

That said, the good thing is that, ATM I dont see the need to influence 
the compiler for incorporating your suggestion of de-duplicating.

>> ## How can this be adopted in SFrame?
>>
>> The hierarchy of SFrame currently looks like:
>>
>> - Header
>> - FDEs (fixed size: initial addr, size, FRE offset, FRE size, other metadata)
>> - FREs (variable size)
>>
>> We could repurpose Function Descriptor Entries (FDEs) into Chunk
>> Descriptor Entries (CDEs), since both represent an
>> array of FREs. The main difference is that a chunk is not associated with a particular code address, nor it has a
>> fixed code size (e.g. in the above example, chunk B corresponds to
>> function body and have variable size). Size can be
>> defined implicitly by treating the next initial address as the end of current chunk.
>>
>> So the end result would be breaking up FDEs into a two-level indirection:
>>
>> - Header
>> - Addr-chunk table (fixed size: initial addr, CDE index)
>> - CDEs (fixed size; FRE offset, FRE size, other metadata)
>> - FREs (variable size)
>>
>> The addr-chunk is most size-sensitive. Let's assume for now we use 32
>> bits for addr and 16 bits for CDE index. Assuming
>> an average function is chunked into A+B+C+Gap, that would be four
>> chunk entries in the table, or 24 bytes per function.
>>

We too have observed that many of the FRE stack offsets are duplicates 
in a binary.  FRE data de-duplication was previously proposed by some 
folks too, but we did not evaluate this further.

https://sourceware.org/binutils/wiki/sframe/sframev3todo#Approach_2:_Deduplicated_stack_offsets

Given the interest in reducing the size of SFrame and your usecase, I 
think its worth revisiting.

IIUC, your proposal is on similar lines, just that the indexing of 
information is chunk-based rather than function-based.

>> Let's compare this to current SFrame, consider a function that does
>> not establish FP, does not clobber FP and save N
>> non-FP callee-saved registers. The size per function is
>> `20+(2+1)*(2*N+3)` bytes. To get a very rough ballpark estimate,
>> at N=4 SFrame would be 53 bytes, and the compressed scheme would be less than half of that.
>>

I dont understand the `20+(2+1)*(2*N+3)` bytes heuristic. Can you 
explain it ?

>> There are further size optimization ideas I want to pursue. But that
>> would be another long writing, so let's first
>> discuss whether this is a direction we want to go for, and we can come up with more ideas later.
>>

In principle, yes, improvements to SFrame as a stack tracing format are 
welcome and we are interested in evaluating them.

As for your current idea of chunk-based index with de-duplicated FRE 
data, I think it will be good to prototype this at some point and get 
some measurements.  I think we can prototype this such that the new 
layout is emitted at link-time in sframe_encoder_write () / 
sframe_encoder_write_sframe ().  SFrame data creation for PLT also 
invokes this path, but using --no-ld-generated-unwind-info should help 
for prototyping.

Before you prototype though, it will be good to discuss the exact 
representations that your proposal puts forth, to make sure the current 
flexibility of the format is not sacrificed.

Thanks,

>> [1]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
>> [2]: https://maskray.me/blog/2025-09-28-remarks-on-sframe
>> [3]: https://faultlore.com/blah/compact-unwinding/
>> [4]: https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64
>> [5]: https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling
>> [6]: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86FrameLowering.cpp


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2025-11-10  8:28   ` Indu Bhagat
@ 2025-11-10 16:26     ` Steven Rostedt
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2025-11-10 16:26 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Jose E. Marchesi, Tatsuyuki Ishi, Binutils, Fangrui Song,
	Florian Weimer, Michael Matz, Matthias Klose, Sterling Augustine,
	Pavel Labath, Sam James, amonakov, elena.zannoni

On Mon, 10 Nov 2025 00:28:31 -0800
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> Before you prototype though, it will be good to discuss the exact 
> representations that your proposal puts forth, to make sure the current 
> flexibility of the format is not sacrificed.

I want to point out that Tatsuyuki will be at Linux Plumbers next month. I
know Indu will not be able to make it, but Jose and Elena will be there. We
could set up a hackroom through BBB, and if Indu registers for a virtual
pass, we could have a meeting there.

Anyone else that wants to join virtually, if they are not
going to be there physically, could also attend. The cost for a virtual
pass is $50.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2025-11-08 16:35 More compact SFrames through deduplication Tatsuyuki Ishi
  2025-11-09  1:21 ` Jose E. Marchesi
@ 2026-01-26 16:06 ` Tatsuyuki Ishi
  2026-01-26 16:18   ` Steven Rostedt
  2026-03-11  0:19   ` Indu Bhagat
  1 sibling, 2 replies; 8+ messages in thread
From: Tatsuyuki Ishi @ 2026-01-26 16:06 UTC (permalink / raw)
  To: Binutils
  Cc: Indu Bhagat, Fangrui Song, Florian Weimer, Michael Matz,
	Steven Rostedt, Matthias Klose, Sterling Augustine, Pavel Labath,
	Sam James, amonakov, Jose E. Marchesi

[-- Attachment #1: Type: text/plain, Size: 1652 bytes --]

I took the time to turn this into a full proposal:
https://github.com/ishitatsuyuki/binutils/blob/sframe2cframe/libsframe/packed-table.md

This is a PoC written with lots of assist from AI, but there is also a ./binutils/sframe2cframe tool
in the tree, that takes a binary with SFrame and outputs a CFrame formatted file.  When
--chunk-dwarf is specified it also does the heuristics chunking described in the proposal. The
heuristics requires parsing .eh_frame.

Using some locally created eh_frame to SFrame converter, I gathered the size numbers on
libvulkan_radeon.so (~7MB .text) for Clang and GCC generated binaries:

- Clang fp: SFrame 394702 bytes, CFrame 146670 bytes
- Clang no-fp: SFrame 601900 bytes, CFrame 153561 bytes
- GCC fp: SFrame 511711 bytes, CFrame 174045 bytes
- GCC no-fp: SFrame 824558 bytes, CFrame 398175 bytes

Deduplication performs worse on GCC, which is because it schedules prologs/epilogs into regular
instructions. That said, the packed scheme yields a net benefit.

The proposed design also puts the burden on the linker to pack up the table. As long as chunking is
done beforehand by the compiler, the linker does not need to parse the FRE. It only needs to
parse the FDE table to do the bit packing. Deduplication can be done on FRE strings blindly.

Let me know about your thoughts on the trade-offs.

Tatsuyuki Ishi

> On Nov 9, 2025, at 1:35, Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:
> 
> [markdown version available at https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md] <https://github.com/ishitatsuyuki/compact-frame/blob/master/pre-proposal.md%5D>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2026-01-26 16:06 ` Tatsuyuki Ishi
@ 2026-01-26 16:18   ` Steven Rostedt
  2026-03-11  0:19   ` Indu Bhagat
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2026-01-26 16:18 UTC (permalink / raw)
  To: Tatsuyuki Ishi
  Cc: Binutils, Indu Bhagat, Fangrui Song, Florian Weimer,
	Michael Matz, Matthias Klose, Sterling Augustine, Pavel Labath,
	Sam James, amonakov, Jose E. Marchesi

On Tue, 27 Jan 2026 01:06:16 +0900
Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:

> I took the time to turn this into a full proposal:
> https://github.com/ishitatsuyuki/binutils/blob/sframe2cframe/libsframe/packed-table.md
> 
> This is a PoC written with lots of assist from AI, but there is also a ./binutils/sframe2cframe tool

Hmm, I'd be interested in what AI prompts you used ;-)

> in the tree, that takes a binary with SFrame and outputs a CFrame formatted file.  When
> --chunk-dwarf is specified it also does the heuristics chunking described in the proposal. The
> heuristics requires parsing .eh_frame.
> 
> Using some locally created eh_frame to SFrame converter, I gathered the size numbers on
> libvulkan_radeon.so (~7MB .text) for Clang and GCC generated binaries:
> 
> - Clang fp: SFrame 394702 bytes, CFrame 146670 bytes
> - Clang no-fp: SFrame 601900 bytes, CFrame 153561 bytes
> - GCC fp: SFrame 511711 bytes, CFrame 174045 bytes
> - GCC no-fp: SFrame 824558 bytes, CFrame 398175 bytes

These are great results!

> 
> Deduplication performs worse on GCC, which is because it schedules prologs/epilogs into regular
> instructions. That said, the packed scheme yields a net benefit.
> 
> The proposed design also puts the burden on the linker to pack up the table. As long as chunking is
> done beforehand by the compiler, the linker does not need to parse the FRE. It only needs to
> parse the FDE table to do the bit packing. Deduplication can be done on FRE strings blindly.
> 
> Let me know about your thoughts on the trade-offs.

I'll let the compiler experts comment on it.

Thanks Tatsuyuki!

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2026-01-26 16:06 ` Tatsuyuki Ishi
  2026-01-26 16:18   ` Steven Rostedt
@ 2026-03-11  0:19   ` Indu Bhagat
  2026-03-11 14:36     ` Steven Rostedt
  1 sibling, 1 reply; 8+ messages in thread
From: Indu Bhagat @ 2026-03-11  0:19 UTC (permalink / raw)
  To: Tatsuyuki Ishi, Binutils
  Cc: Fangrui Song, Florian Weimer, Michael Matz, Steven Rostedt,
	Matthias Klose, Sterling Augustine, Pavel Labath, Sam James,
	amonakov, Jose E. Marchesi

On 1/26/26 8:06 AM, Tatsuyuki Ishi wrote:
> I took the time to turn this into a full proposal:
> https://github.com/ishitatsuyuki/binutils/blob/sframe2cframe/libsframe/ 
> packed-table.md
> 
> This is a PoC written with lots of assist from AI, but there is also 
> a ./binutils/sframe2cframe tool
> in the tree, that takes a binary with SFrame and outputs a CFrame 
> formatted file.  When
> --chunk-dwarf is specified it also does the heuristics chunking 
> described in the proposal. The
> heuristics requires parsing .eh_frame.
> 
> Using some locally created eh_frame to SFrame converter, I gathered the 
> size numbers on
> libvulkan_radeon.so (~7MB .text) for Clang and GCC generated binaries:
> 
> - Clang fp: SFrame 394702 bytes, CFrame 146670 bytes
> - Clang no-fp: SFrame 601900 bytes, CFrame 153561 bytes
> - GCC fp: SFrame 511711 bytes, CFrame 174045 bytes
> - GCC no-fp: SFrame 824558 bytes, CFrame 398175 bytes
> 
> Deduplication performs worse on GCC, which is because it schedules 
> prologs/epilogs into regular
> instructions. That said, the packed scheme yields a net benefit.
> 
> The proposed design also puts the burden on the linker to pack up the 
> table. As long as chunking is
> done beforehand by the compiler, the linker does not need to parse the 
> FRE. It only needs to
> parse the FDE table to do the bit packing. Deduplication can be done on 
> FRE strings blindly.
> 
> Let me know about your thoughts on the trade-offs.
> 

Thanks for the well-documented POC and your proposal.

For benefit of readers and to ensure I got a grasp of your proposal, 
here are some notes (summary) from my understanding.  (BTW, Tests added 
in POC commit give a quick overview of expected CFrame for the curious 
reader).

* Design differences from SFrame V3
   Today’s SFrame V3 FDE model is conceptually simple:
     - each function has a fixed-size FDE entry,
     - the FDE gives you start PC, size, and where the FRE data lives,
     - then you parse the function’s FRE rows.  FRE data is the main 
carrier of the stack trace info (e.g., CFA offsets, stack offsets, 
register info etc).

The proposal replaces that with (CFrame):
   - bit-wise PC lookup tables (using outer_bits, inner_bits, bottom_trim),
   - an "outer table" that maps PC high bits to an inner-table buckets,
   - an "inner table" containing address boundaries and FDE indices (or 
GAP sentinels),
   - an "FDE index table" that maps logical FDE (Chunk) IDs to 
deduplicated FRE payload,
   - intensive bit-packing in the above structures. Each CFrame section 
defines the number of bits used for the specific tables above (based on 
the entropy of that section). E.g., sample output of a CFrame section:
     Bit parameters: bottom_trim=0, outer_bits=8, inner_bits=8
     Bit-packing: fde_idx=8, outer_start=10, outer_count=5, 
fre_offset=13, fre_count=5
   - Chunking a single function into multiple ranges to increase FRE 
payload reuse.  With the CHUNK_STRATEGY_CALLEE_SAVED strategy, the POC 
finds "interesting" sequences (put roughly: the stack increment/stack 
frame stable/stack decrement stubs) across functions.
   - De-duplication of these FDE chunks (stack increment/stack frame 
stable/stack decrement) across functions in a CFrame section.

PS: This is a substantial redesign to the specification, not an 
incremental refinement to SFrame V3.

* There is some clear value addition by the CFrame proposal:
   - Reduced size
   - Information lookup (using the two level index along with the FDE 
index) is not asymptotically worse for stacktracers.
   - Flexibility to add new FDE types for future needs is still 
maintained AFAICT.

* Workflow in POC: The POC parses eh_frame/debug_frame data to ascertain 
the split of a function into epilogue/body/prologue. Then deduplicates 
these "chunks" across functions.

* Miscellaneous critique points:
   - On first look, the explicit GAP sentinels make me a bit 
uncomfortable, but may be its just me. (As compared to SFrame's explicit 
data on size of function) I think it may become a problem when there is 
"sparse" layout of function data in .text* sections.
   - For a few test runs on sample libraries, I tried to compare:
     1. sframe2cframe -v --chunk-dwarf  --dump
     2. sframe2cframe -v --chunk-dwarf --chunk-strategy=none
     IIUC, #2 basically shows the benefit of bit-packing in the various 
lookup tables, and #1 shows the benefit of bit-packing + de-dup of FRE 
payloads.  Is that correct ?  If so, I #2 brings a major chunk of 
benefits as compared to #1.  Is that consistent with your findings too ?

Now, at this time, my two main concerns are:

* Binary incompatibility with SFrame V3
   As I mentioned above, this breaks binary compatibility with SFrame 
V3.  CFrame is a substantial redesign of the information layout. 
Readers of SFrame V3 will need a new decoder altogether, given the 
differences in information layout (in lookup tables, FDE index).

* Implementation Complexity in toolchains
   - Most of the unknowns/risks lie here ATM, as you point out.
   - While the compiler is likely the most natural place to identify 
semantically meaningful chunk boundaries (stack increment/stack frame 
stable/stack decrement), its the linker that is best suited to a) create 
that bit-packed, two level lookup table with FDE index and b) 
deduplicate chunks across functions.
   - Next, the linker must be able to see the mapping of chunks for each 
function/symbol to implement garbage collection, ODR needing to discard 
functions from output, etc.  Handling relocations also means its best to 
avoid the CFrame index creation until link time.
   - CFrame is not relocation friendly. The input to linker needs to be 
"relocation friendly". CFrame as is cannot be fed to the linker (the 
compact index of de-duped entries is one of the problem for linker 
input).  This means that either a distinct relocatable representation 
would be needed for ET_REL, and/or CFrame would be restricted to final 
linked outputs (ET_DYN/ET_EXEC). In either case, the toolchain contract 
does become more complicated than for current SFrame V3 IMO.
   - Although obvious, but as we can see, the above points suggest a 
newfound need for additional CFI directives as the new interface between 
compiler and assembler (to identify the chunks).

Thoughts on managing these aspects of additional work/complexity (and 
binary incompatibility with SFrame V3) ?

> Tatsuyuki Ishi
> 
>> On Nov 9, 2025, at 1:35, Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:
>>
>> [markdown version available at https://github.com/ishitatsuyuki/ 
>> compact-frame/blob/master/pre-proposal.md] <https://github.com/ 
>> ishitatsuyuki/compact-frame/blob/master/pre-proposal.md%5D>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: More compact SFrames through deduplication
  2026-03-11  0:19   ` Indu Bhagat
@ 2026-03-11 14:36     ` Steven Rostedt
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2026-03-11 14:36 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Tatsuyuki Ishi, Binutils, Fangrui Song, Florian Weimer,
	Michael Matz, Matthias Klose, Sterling Augustine, Pavel Labath,
	Sam James, amonakov, Jose E. Marchesi

On Tue, 10 Mar 2026 17:19:28 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

Hi Indu,

> Thanks for the well-documented POC and your proposal.
> 
> For benefit of readers and to ensure I got a grasp of your proposal, 
> here are some notes (summary) from my understanding.  (BTW, Tests added 
> in POC commit give a quick overview of expected CFrame for the curious 
> reader).

Thanks for the detailed summary.

> 
> * Design differences from SFrame V3
>    Today’s SFrame V3 FDE model is conceptually simple:
>      - each function has a fixed-size FDE entry,
>      - the FDE gives you start PC, size, and where the FRE data lives,
>      - then you parse the function’s FRE rows.  FRE data is the main 
> carrier of the stack trace info (e.g., CFA offsets, stack offsets, 
> register info etc).
> 
> The proposal replaces that with (CFrame):
>    - bit-wise PC lookup tables (using outer_bits, inner_bits, bottom_trim),
>    - an "outer table" that maps PC high bits to an inner-table buckets,
>    - an "inner table" containing address boundaries and FDE indices (or 
> GAP sentinels),
>    - an "FDE index table" that maps logical FDE (Chunk) IDs to 
> deduplicated FRE payload,
>    - intensive bit-packing in the above structures. Each CFrame section 
> defines the number of bits used for the specific tables above (based on 
> the entropy of that section). E.g., sample output of a CFrame section:
>      Bit parameters: bottom_trim=0, outer_bits=8, inner_bits=8
>      Bit-packing: fde_idx=8, outer_start=10, outer_count=5, 
> fre_offset=13, fre_count=5
>    - Chunking a single function into multiple ranges to increase FRE 
> payload reuse.  With the CHUNK_STRATEGY_CALLEE_SAVED strategy, the POC 
> finds "interesting" sequences (put roughly: the stack increment/stack 
> frame stable/stack decrement stubs) across functions.
>    - De-duplication of these FDE chunks (stack increment/stack frame 
> stable/stack decrement) across functions in a CFrame section.
> 
> PS: This is a substantial redesign to the specification, not an 
> incremental refinement to SFrame V3.

Right. My opinion is to keep moving forward on Sframe V3. This proposal may
need to be done separately and perhaps under a different section ("cframe"?).

I'm not a tool chain developer so I'm not sure how hard it would be to have
support for multiple stack walkers, but it shouldn't be too much of a
burden in the kernel. The unwind logic is made to handle multiple stack walkers.

> 
> * There is some clear value addition by the CFrame proposal:
>    - Reduced size

>    - Information lookup (using the two level index along with the FDE 
> index) is not asymptotically worse for stacktracers.

Yeah, as long as the stack tracer doesn't have to interpret code (DWARF),
multi-layered indexing should not be an issue. Unless it is determined to
be reading too much of user space (requiring more page faults).

>    - Flexibility to add new FDE types for future needs is still 
> maintained AFAICT.
> 
> * Workflow in POC: The POC parses eh_frame/debug_frame data to ascertain 
> the split of a function into epilogue/body/prologue. Then deduplicates 
> these "chunks" across functions.
> 
> * Miscellaneous critique points:
>    - On first look, the explicit GAP sentinels make me a bit 
> uncomfortable, but may be its just me. (As compared to SFrame's explicit 
> data on size of function) I think it may become a problem when there is 
> "sparse" layout of function data in .text* sections.
>    - For a few test runs on sample libraries, I tried to compare:
>      1. sframe2cframe -v --chunk-dwarf  --dump
>      2. sframe2cframe -v --chunk-dwarf --chunk-strategy=none
>      IIUC, #2 basically shows the benefit of bit-packing in the various 
> lookup tables, and #1 shows the benefit of bit-packing + de-dup of FRE 
> payloads.  Is that correct ?  If so, I #2 brings a major chunk of 
> benefits as compared to #1.  Is that consistent with your findings too ?
> 
> Now, at this time, my two main concerns are:
> 
> * Binary incompatibility with SFrame V3
>    As I mentioned above, this breaks binary compatibility with SFrame 
> V3.  CFrame is a substantial redesign of the information layout. 
> Readers of SFrame V3 will need a new decoder altogether, given the 
> differences in information layout (in lookup tables, FDE index).
> 
> * Implementation Complexity in toolchains
>    - Most of the unknowns/risks lie here ATM, as you point out.
>    - While the compiler is likely the most natural place to identify 
> semantically meaningful chunk boundaries (stack increment/stack frame 
> stable/stack decrement), its the linker that is best suited to a) create 
> that bit-packed, two level lookup table with FDE index and b) 
> deduplicate chunks across functions.
>    - Next, the linker must be able to see the mapping of chunks for each 
> function/symbol to implement garbage collection, ODR needing to discard 
> functions from output, etc.  Handling relocations also means its best to 
> avoid the CFrame index creation until link time.
>    - CFrame is not relocation friendly. The input to linker needs to be 
> "relocation friendly". CFrame as is cannot be fed to the linker (the 
> compact index of de-duped entries is one of the problem for linker 
> input).  This means that either a distinct relocatable representation 
> would be needed for ET_REL, and/or CFrame would be restricted to final 
> linked outputs (ET_DYN/ET_EXEC). In either case, the toolchain contract 
> does become more complicated than for current SFrame V3 IMO.
>    - Although obvious, but as we can see, the above points suggest a 
> newfound need for additional CFI directives as the new interface between 
> compiler and assembler (to identify the chunks).
> 
> Thoughts on managing these aspects of additional work/complexity (and 
> binary incompatibility with SFrame V3) ?

This Cframe idea looks promising, but as Indu pointed out, it has a lot of
hurdles to overcome before it can be added to production. It may be years
before something like his could be used. But I do like the brainstorming
of ideas ;-)

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-11 14:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-08 16:35 More compact SFrames through deduplication Tatsuyuki Ishi
2025-11-09  1:21 ` Jose E. Marchesi
2025-11-10  8:28   ` Indu Bhagat
2025-11-10 16:26     ` Steven Rostedt
2026-01-26 16:06 ` Tatsuyuki Ishi
2026-01-26 16:18   ` Steven Rostedt
2026-03-11  0:19   ` Indu Bhagat
2026-03-11 14:36     ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).