public inbox for libabigail@sourceware.org
 help / color / mirror / Atom feed
From: Giuliano Procida <gprocida@google.com>
To: Ben Woodard <woodard@redhat.com>
Cc: Giuliano Procida via Libabigail <libabigail@sourceware.org>,
	Matthias Maennich <maennich@google.com>,
	kernel-team@android.com
Subject: Re: [RFC PATCH] Add an experimental ABI pruning utility.
Date: Fri, 12 Mar 2021 16:51:37 +0000	[thread overview]
Message-ID: <CAGvU0H=ou0cEP5Z=-WNOdAfHvggKE50eKpmxX2tkDsURkSEPQQ@mail.gmail.com> (raw)
In-Reply-To: <BD4291B2-6521-41BC-A1A2-AD74C74634F9@redhat.com>

Hi.

On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com> wrote:

> I think that you have pointed out a problem that I’ve been discussing with
> some users that I’m working with. The sheer volume of the abixml output is
> pretty extreme. I don’t think that they were really expecting it. However,
> they are not running into the particular situation that you are seeing
> since they don’t have the coding model of the problem library. They are
> more likely to fall in the 5-10% range that you reported above.
>
> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”.
> The ELF reachability aspect that you point out seems to be part of the
> difference between those two concepts.
>
>
We want a distilled description of the ABI (the surface), not what may lie
on the other side. More precisely, we want a typed interface where "type"
encompasses extra things like architecture, calling convention, ISA, ELF
symbol properties etc.

One worry is that just chasing references from the exported ELF symbols may
somehow miss something important, say for libraries that ship with bits of
implementation in header files.


> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <
> libabigail@sourceware.org> wrote:
> >
> > In response to internal reports about very large ABIs for libraires
> > implemented in C++ but exposing only a C API, I wrote a script as a
> > means of investigating the issue.
> >
> > The script aims to remove parts of the XML that are irrelevant to
> > abidiff. It has been run on the current libabigail tests as well as
> > some locally-generated ABI files.
> >
> > On most XML files, this results in a modest saving (5-10%). However,
> > for the library that sparked the invesigaton, the resulting XML was
> > more than 2300 times smaller.
> >
> > The script may have some merit on its own but perhaps should be
> > implemented differently.
> >
> > Question: How hard would it be to implement this functionality (under
> > flag control) within libabigail itself?
>
> I can’t directly answer “how hard” that is one of the things which we
> would discover by actually doing it. As for should it be done. I think it
> should be carefully looked into. However, as for priority since it is an
> optimization rather than a functional problem, if it were me, I would file
> this as a bug and put examples of the test data and results you are seeing
> as well as the script below in that bz and then start working on the
> process of refining libabigail’s output so that it matches the results of
> your script.
>
>
It's an optimisation analogous to DCE in a compiler. I'll certainly follow
up with a bug. It's one library in particular that is ~50M before pruning
and ~22k after.

I've already made changes to the script so that it can be given a list of
symbols and produce a reduced ABI. At the moment, we achieve this by
passing a list to abidw (with -w). In fact, we extract full and reduced
ABIs concurrently in the Android kernel build as each run takes a while. It
should be straightforward to reimplement the script in C++, though it's not
too slow already.

In terms of implementation, it would need to work with libabigail IR and
ir_node_visitor might be the right place to start.

Giuliano.


> -ben
>
> >
> > A couple of tangential things worth mentioning (see the comments at
> > the very beginning and very end of the script):
> >
> > - abidiff rejects XML files with an XML declaration at the top
> > - abidiff rejects completely empty files
> >
> > Resolving the first would make the second moot.
> >
> >       * scripts/abiprune.pl: Add script to prune ABI XML.
> >
> > Signed-off-by: Giuliano Procida <gprocida@google.com>
> > ---
> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 323 insertions(+)
> > create mode 100755 scripts/abiprune.pl
> >
> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
> > new file mode 100755
> > index 00000000..8e04a273
> > --- /dev/null
> > +++ b/scripts/abiprune.pl
> > @@ -0,0 +1,323 @@
> > +#!/usr/bin/perl
> > +
> > +# This script is intended to consume libabigail ABI XML as generated
> > +# by abidw and produce a possibly smaller representation that captures
> > +# the same ABI. In particular, the output should be such that abidiff
> > +# --harmless reports no differences (or is empty).
> > +#
> > +# It does this by modelling symbols and types dependencies as a
> > +# directed graph, marking nodes reachable from the ELF symbols and
> > +# dropping unreachable nodes.
> > +
> > +use strict;
> > +use warnings;
> > +no warnings 'recursion';
> > +
> > +use autodie;
> > +
> > +use XML::LibXML;
> > +
> > +# Making a graph from ABI XML.
> > +#
> > +# The following are the types of "node" we care about. The "edges" are
> > +# obtained from a few XML attributes as well as via XML element
> > +# containment.
> > +#
> > +#  ELF (exported) symbols
> > +#
> > +#   elf-symbol (has a name; the code here currently knows nothing
> > +#     about aliases)
> > +#
> > +#  Declarations (that mention a symbol)
> > +#
> > +#   These live in a scope. In C++ scopes can be nested and include
> > +#   namespaces and class types.
> > +#
> > +#  var-decl (also used for member variables)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    type-id links to a type
> > +#  function-decl (also used for member functions)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    parameter and return type-ids link to types
> > +#      (alas, not just a link to a function type)
> > +#
> > +#  Types
> > +#
> > +#   These occupy pretty much all the other elements, besides those
> > +#   that act as simple containers.
> > +#
> > +# XML elements and their roles in more detail
> > +#
> > +# ELF
> > +#
> > +# elf-needed - container
> > +#  dependency - names a library
> > +# elf-function-symbols - contains a list of symbols
> > +# elf-variable-symbols - contains a list of symbols
> > +#  elf-symbol - describes an ELF variable or function
> > +#
> > +# Grouping and scoping
> > +#
> > +# abi-corpus-group
> > +#  abi-corpus
> > +#   abi-instr - compilation unit containers
> > +#    namespace-decl - pure container, possibly named
> > +#
> > +# Types (some introduce scopes, only in C++)
> > +#
> > +# type-decl - defines a primitive type
> > +# typedef-decl - defines a type, links to a type
> > +# qualified-type-def - defines a type, links to a type
> > +# pointer-type-def - defines a type, links to a type
> > +# reference-type-def - defines a type, links to a type
> > +# array-type-def - defines a (multidimensional array) type, refers to
> element type, contains subranges
> > +#  subrange - contains array length, refers to element type; defines
> types (never referred to; duplicated)
> > +# function-type - defines a type
> > +#  parameter - belongs to function-type and -decl, links to a type
> > +#  return - belongs to function-type and -decl, links to a type
> > +# enum-decl - defines a type, names it, contains a list of enumerators
> and an underlying-type
> > +#  underlying-type - belongs to enum-decl
> > +#  enumerator - belongs to enum-decl
> > +# union-decl - defines and names a type, contains member elements
> linked to other things
> > +# class-decl - defines and names a type, contains base type, member
> elements linking to other things
> > +#  base-class - belongs to class-decl
> > +#  data-member - container for a member; holds access level
> > +#  member-function - container for a member; holds access level
> > +#  member-type - container for a type declaration; holds access level
> > +#  member-template - container for a (function?) template declaration;
> holds access level
> > +#
> > +# Higher order Things
> > +#
> > +# class-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +# function-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +#  template-type-parameter - defines a type (formal variable), perhaps
> one which should be excluded from the real type graph
> > +#  template-non-type-parameter - names a template parameter, links to a
> type
> > +#  template-parameter-type-composition - container?
> > +#
> > +# Values
> > +#
> > +# function-decl - names a function, can link to a symbol
> > +#     has same children as function-type, rather than linking to a type
> > +# var-decl - names a variable, can link to a symbol, links to a type
> > +
> > +# Graph nodes.
> > +my %symbols;
> > +my %exported_symbols;
> > +my %types;
> > +# Graph edges.
> > +my %type_type_edges;
> > +my %symbol_type_edges;
> > +my %type_symbol_edges;
> > +
> > +# Keep track of the types being defined.
> > +my @id_stack;
> > +# Keep track of the current (value) declaration.
> > +my @symbol_stack;
> > +sub process($);
> > +sub process($) {
> > +  my ($node) = @_;
> > +
> > +  # The XML attributes we care about.
> > +  my $name;
> > +  my $id;
> > +  my $type_id;
> > +  my $symbol;
> > +  my $naming_typedef_id;
> > +
> > +  # Not every node we encounter is an XML element.
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $name = $node->getAttribute('name');
> > +    $id = $node->getAttribute('id');
> > +    $type_id = $node->getAttribute('type-id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
> > +    $exported_symbols{$name} = undef;
> > +    # Early return is safe, but not necessary.
> > +    return;
> > +  }
> > +
> > +  if (defined $id) {
> > +    # This element defines a type (but there may be more than one
> > +    # defining the same type - we cannot rely on uniqueness).
> > +    $types{$id} = undef;
> > +    if (defined $naming_typedef_id) {
> > +      # This is an odd one, there can be a backwards link from an
> > +      # anonymous type to the typedef that refers to it, so we need to
> > +      # pull in the typedef, even if nothing else refers to it.
> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> > +    }
> > +    if (@id_stack) {
> > +      # Type parent<->child dependencies; record dependencies both
> > +      # ways to avoid making holes in the XML types.
> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> > +    }
> > +    push @id_stack, $id;
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    # This element is a declaration linked to a symbol (whether or not
> > +    # exported).
> > +    $symbols{$symbol} = undef;
> > +    if (@id_stack) {
> > +      # The enclosing scope may include a C++ class, in which case we
> > +      # need to record the dependency.
> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> > +    }
> > +    # The symbol depends on the types mentioned in this element, so
> > +    # record it.
> > +    push @symbol_stack, $symbol;
> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> > +  }
> > +
> > +  if (defined $type_id) {
> > +    if (@id_stack) {
> > +      # The enclosing type refers to this type, record the dependency.
> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> > +    }
> > +    if (@symbol_stack) {
> > +      # The enclosing declaration (linked to a symbol) refers to this
> > +      # type, record the dependency.
> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> > +    }
> > +  }
> > +
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    process($child);
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    pop @symbol_stack;
> > +  }
> > +  if (defined $id) {
> > +    pop @id_stack;
> > +  }
> > +}
> > +
> > +# Load the XML.
> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> > +# Build a graph.
> > +process($dom);
> > +die if @id_stack or @symbol_stack;
> > +
> > +# DFS visited state.
> > +my %seen_types;
> > +my %seen_symbols;
> > +sub dfs_type($);
> > +sub dfs_symbol($);
> > +
> > +sub dfs_type($) {
> > +  my ($type) = @_;
> > +  return if exists $seen_types{$type};
> > +  $seen_types{$type} = undef;
> > +
> > +  my $types = $type_type_edges{$type};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      for my $link (keys %{$types->{$type}}) {
> > +        if (exists $types->{$type}{$link}) {
> > +          dfs_type($type);
> > +        }
> > +      }
> > +    }
> > +  }
> > +  my $symbols = $type_symbol_edges{$type};
> > +  if (defined $symbols) {
> > +    for my $symbol (keys %$symbols) {
> > +      dfs_symbol($symbol);
> > +    }
> > +  }
> > +}
> > +
> > +sub dfs_symbol($) {
> > +  my ($symbol) = @_;
> > +  return if exists $seen_symbols{$symbol};
> > +  $seen_symbols{$symbol} = undef;
> > +
> > +  my $types = $symbol_type_edges{$symbol};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      dfs_type($type);
> > +    }
> > +  } else {
> > +    warn "no declaration found for exported symbol $symbol\n";
> > +  }
> > +}
> > +
> > +# Traverse the graph, using the exported symbols as the roots.
> > +for my $symbol (keys %exported_symbols) {
> > +  dfs_symbol($symbol);
> > +}
> > +
> > +# Useful counts.
> > +sub print_report() {
> > +  my $count_symbols = scalar keys %symbols;
> > +  my $count_exported_symbols = scalar keys %exported_symbols;
> > +  my $count_types = scalar keys %types;
> > +  my $count_reachable_types = scalar keys %seen_types;
> > +
> > +  print STDERR qq{symbols = $count_symbols
> > +exported symbols = $count_exported_symbols
> > +types = $count_types
> > +reachable types = $count_reachable_types
> > +};
> > +}
> > +
> > +my %drop_if_empty = map { $_ => undef } qw(
> > +  namespace-decl
> > +  abi-instr
> > +  abi-corpus
> > +  abi-corpus-group
> > +);
> > +
> > +# This is another DFS traversal of the XML.
> > +sub prune($);
> > +sub prune($) {
> > +  my ($node) = @_;
> > +
> > +  my $name = $node->getName();
> > +  my $id;
> > +  my $symbol;
> > +
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $id = $node->getAttribute('id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  # Return if we know that this is a type or declaration to keep or
> > +  # drop in its entirety.
> > +  if (defined $id) {
> > +    return !exists $seen_types{$id};
> > +  }
> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
> > +  }
> > +
> > +  # Otherwise, this is not a type, declaration or part thereof, so
> > +  # process child elements.
> > +  my $empty = 1;
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    if (prune($child)) {
> > +      $node->removeChild($child);
> > +    } else {
> > +      $empty = 0;
> > +    }
> > +  }
> > +
> > +  # Empty container elements can be dropped.
> > +  return $empty && exists $drop_if_empty{$name};
> > +}
> > +
> > +# Drop unreachable nodes and any container nodes that end up empty.
> > +prune($dom);
> > +my $out = $dom->toString();
> > +# Remove blank line debris.
> > +$out =~ s;^ *\n;;mg;
> > +# Remove the XML declaration as abidiff doesn't like it.
> > +$out =~ s;^<\?xml .*\?>\n;;m;
> > +print $out;
> > --
> > 2.31.0.rc2.261.g7f71774620-goog
> >
>
>

  reply	other threads:[~2021-03-12 16:52 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-11 11:53 Giuliano Procida
2021-03-11 20:39 ` Ben Woodard
2021-03-12 16:51   ` Giuliano Procida [this message]
2021-03-12 18:41     ` Ben Woodard
2021-03-15 11:08       ` Giuliano Procida
2021-03-12 16:59 ` [RFC PATCH v2] " Giuliano Procida
2021-03-16 16:55   ` [RFC PATCH v3] " Giuliano Procida
2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 1/9] Add ABI tidying utility Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 2/9] Add pass to drop empty XML elements Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 3/9] Add pass to prune unreachable parts of the ABI Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 4/9] Add pass to filter symbols Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 5/9] Add pass to normalise anonymous type names Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 6/9] Add pass to report duplicate type ids Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 7/9] Add pass to eliminate duplicate member-type fragments Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 8/9] Add pass to stabilise types and declarations Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 9/9] Add pass to resolve stray forward type declarations Giuliano Procida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGvU0H=ou0cEP5Z=-WNOdAfHvggKE50eKpmxX2tkDsURkSEPQQ@mail.gmail.com' \
    --to=gprocida@google.com \
    --cc=kernel-team@android.com \
    --cc=libabigail@sourceware.org \
    --cc=maennich@google.com \
    --cc=woodard@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).