public inbox for libabigail@sourceware.org
 help / color / mirror / Atom feed
* [RFC PATCH] Add an experimental ABI pruning utility.
@ 2021-03-11 11:53 Giuliano Procida
  2021-03-11 20:39 ` Ben Woodard
  2021-03-12 16:59 ` [RFC PATCH v2] " Giuliano Procida
  0 siblings, 2 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-11 11:53 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich

In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue.

The script aims to remove parts of the XML that are irrelevant to
abidiff. It has been run on the current libabigail tests as well as
some locally-generated ABI files.

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

The script may have some merit on its own but perhaps should be
implemented differently.

Question: How hard would it be to implement this functionality (under
flag control) within libabigail itself?

A couple of tangential things worth mentioning (see the comments at
the very beginning and very end of the script):

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot.

	* scripts/abiprune.pl: Add script to prune ABI XML.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 323 insertions(+)
 create mode 100755 scripts/abiprune.pl

diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
new file mode 100755
index 00000000..8e04a273
--- /dev/null
+++ b/scripts/abiprune.pl
@@ -0,0 +1,323 @@
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+#
+# It does this by modelling symbols and types dependencies as a
+# directed graph, marking nodes reachable from the ELF symbols and
+# dropping unreachable nodes.
+
+use strict;
+use warnings;
+no warnings 'recursion';
+
+use autodie;
+
+use XML::LibXML;
+
+# Making a graph from ABI XML.
+#
+# The following are the types of "node" we care about. The "edges" are
+# obtained from a few XML attributes as well as via XML element
+# containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+#
+# XML elements and their roles in more detail
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-function-symbols - contains a list of symbols
+# elf-variable-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a "type", but without instantiation this isn't usable
+# function-template-decl - defines a "type", but without instantiation this isn't usable
+#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+# var-decl - names a variable, can link to a symbol, links to a type
+
+# Graph nodes.
+my %symbols;
+my %exported_symbols;
+my %types;
+# Graph edges.
+my %type_type_edges;
+my %symbol_type_edges;
+my %type_symbol_edges;
+
+# Keep track of the types being defined.
+my @id_stack;
+# Keep track of the current (value) declaration.
+my @symbol_stack;
+sub process($);
+sub process($) {
+  my ($node) = @_;
+
+  # The XML attributes we care about.
+  my $name;
+  my $id;
+  my $type_id;
+  my $symbol;
+  my $naming_typedef_id;
+
+  # Not every node we encounter is an XML element.
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $type_id = $node->getAttribute('type-id');
+    $symbol = $node->getAttribute('mangled-name');
+    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+    die if defined $id && defined $symbol;
+  }
+
+  if (defined $name && $node->getName() eq 'elf-symbol') {
+    $exported_symbols{$name} = undef;
+    # Early return is safe, but not necessary.
+    return;
+  }
+
+  if (defined $id) {
+    # This element defines a type (but there may be more than one
+    # defining the same type - we cannot rely on uniqueness).
+    $types{$id} = undef;
+    if (defined $naming_typedef_id) {
+      # This is an odd one, there can be a backwards link from an
+      # anonymous type to the typedef that refers to it, so we need to
+      # pull in the typedef, even if nothing else refers to it.
+      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
+    }
+    if (@id_stack) {
+      # Type parent<->child dependencies; record dependencies both
+      # ways to avoid making holes in the XML types.
+      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
+      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
+    }
+    push @id_stack, $id;
+  }
+
+  if (defined $symbol) {
+    # This element is a declaration linked to a symbol (whether or not
+    # exported).
+    $symbols{$symbol} = undef;
+    if (@id_stack) {
+      # The enclosing scope may include a C++ class, in which case we
+      # need to record the dependency.
+      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
+    }
+    # The symbol depends on the types mentioned in this element, so
+    # record it.
+    push @symbol_stack, $symbol;
+    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
+  }
+
+  if (defined $type_id) {
+    if (@id_stack) {
+      # The enclosing type refers to this type, record the dependency.
+      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
+    }
+    if (@symbol_stack) {
+      # The enclosing declaration (linked to a symbol) refers to this
+      # type, record the dependency.
+      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
+    }
+  }
+
+  for my $child ($node->nonBlankChildNodes()) {
+    process($child);
+  }
+
+  if (defined $symbol) {
+    pop @symbol_stack;
+  }
+  if (defined $id) {
+    pop @id_stack;
+  }
+}
+
+# Load the XML.
+my $dom = XML::LibXML->load_xml(IO => \*STDIN);
+# Build a graph.
+process($dom);
+die if @id_stack or @symbol_stack;
+
+# DFS visited state.
+my %seen_types;
+my %seen_symbols;
+sub dfs_type($);
+sub dfs_symbol($);
+
+sub dfs_type($) {
+  my ($type) = @_;
+  return if exists $seen_types{$type};
+  $seen_types{$type} = undef;
+
+  my $types = $type_type_edges{$type};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      for my $link (keys %{$types->{$type}}) {
+        if (exists $types->{$type}{$link}) {
+          dfs_type($type);
+        }
+      }
+    }
+  }
+  my $symbols = $type_symbol_edges{$type};
+  if (defined $symbols) {
+    for my $symbol (keys %$symbols) {
+      dfs_symbol($symbol);
+    }
+  }
+}
+
+sub dfs_symbol($) {
+  my ($symbol) = @_;
+  return if exists $seen_symbols{$symbol};
+  $seen_symbols{$symbol} = undef;
+
+  my $types = $symbol_type_edges{$symbol};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      dfs_type($type);
+    }
+  } else {
+    warn "no declaration found for exported symbol $symbol\n";
+  }
+}
+
+# Traverse the graph, using the exported symbols as the roots.
+for my $symbol (keys %exported_symbols) {
+  dfs_symbol($symbol);
+}
+
+# Useful counts.
+sub print_report() {
+  my $count_symbols = scalar keys %symbols;
+  my $count_exported_symbols = scalar keys %exported_symbols;
+  my $count_types = scalar keys %types;
+  my $count_reachable_types = scalar keys %seen_types;
+
+  print STDERR qq{symbols = $count_symbols
+exported symbols = $count_exported_symbols
+types = $count_types
+reachable types = $count_reachable_types
+};
+}
+
+my %drop_if_empty = map { $_ => undef } qw(
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is another DFS traversal of the XML.
+sub prune($);
+sub prune($) {
+  my ($node) = @_;
+
+  my $name = $node->getName();
+  my $id;
+  my $symbol;
+
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $id = $node->getAttribute('id');
+    $symbol = $node->getAttribute('mangled-name');
+    die if defined $id && defined $symbol;
+  }
+
+  # Return if we know that this is a type or declaration to keep or
+  # drop in its entirety.
+  if (defined $id) {
+    return !exists $seen_types{$id};
+  }
+  if ($name eq 'var-decl' || $name eq 'function-decl') {
+    return !defined $symbol || !exists $exported_symbols{$symbol};
+  }
+
+  # Otherwise, this is not a type, declaration or part thereof, so
+  # process child elements.
+  my $empty = 1;
+  for my $child ($node->nonBlankChildNodes()) {
+    if (prune($child)) {
+      $node->removeChild($child);
+    } else {
+      $empty = 0;
+    }
+  }
+
+  # Empty container elements can be dropped.
+  return $empty && exists $drop_if_empty{$name};
+}
+
+# Drop unreachable nodes and any container nodes that end up empty.
+prune($dom);
+my $out = $dom->toString();
+# Remove blank line debris.
+$out =~ s;^ *\n;;mg;
+# Remove the XML declaration as abidiff doesn't like it.
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $out;
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Add an experimental ABI pruning utility.
  2021-03-11 11:53 [RFC PATCH] Add an experimental ABI pruning utility Giuliano Procida
@ 2021-03-11 20:39 ` Ben Woodard
  2021-03-12 16:51   ` Giuliano Procida
  2021-03-12 16:59 ` [RFC PATCH v2] " Giuliano Procida
  1 sibling, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2021-03-11 20:39 UTC (permalink / raw)
  To: Giuliano Procida; +Cc: libabigail, Matthias Maennich, kernel-team

I think that you have pointed out a problem that I’ve been discussing with some users that I’m working with. The sheer volume of the abixml output is pretty extreme. I don’t think that they were really expecting it. However, they are not running into the particular situation that you are seeing since they don’t have the coding model of the problem library. They are more likely to fall in the 5-10% range that you reported above. 

We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”. The ELF reachability aspect that you point out seems to be part of the difference between those two concepts.

> On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <libabigail@sourceware.org> wrote:
> 
> In response to internal reports about very large ABIs for libraires
> implemented in C++ but exposing only a C API, I wrote a script as a
> means of investigating the issue.
> 
> The script aims to remove parts of the XML that are irrelevant to
> abidiff. It has been run on the current libabigail tests as well as
> some locally-generated ABI files.
> 
> On most XML files, this results in a modest saving (5-10%). However,
> for the library that sparked the invesigaton, the resulting XML was
> more than 2300 times smaller.
> 
> The script may have some merit on its own but perhaps should be
> implemented differently.
> 
> Question: How hard would it be to implement this functionality (under
> flag control) within libabigail itself?

I can’t directly answer “how hard” that is one of the things which we would discover by actually doing it. As for should it be done. I think it should be carefully looked into. However, as for priority since it is an optimization rather than a functional problem, if it were me, I would file this as a bug and put examples of the test data and results you are seeing as well as the script below in that bz and then start working on the process of refining libabigail’s output so that it matches the results of your script.

-ben

> 
> A couple of tangential things worth mentioning (see the comments at
> the very beginning and very end of the script):
> 
> - abidiff rejects XML files with an XML declaration at the top
> - abidiff rejects completely empty files
> 
> Resolving the first would make the second moot.
> 
> 	* scripts/abiprune.pl: Add script to prune ABI XML.
> 
> Signed-off-by: Giuliano Procida <gprocida@google.com>
> ---
> scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 323 insertions(+)
> create mode 100755 scripts/abiprune.pl
> 
> diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
> new file mode 100755
> index 00000000..8e04a273
> --- /dev/null
> +++ b/scripts/abiprune.pl
> @@ -0,0 +1,323 @@
> +#!/usr/bin/perl
> +
> +# This script is intended to consume libabigail ABI XML as generated
> +# by abidw and produce a possibly smaller representation that captures
> +# the same ABI. In particular, the output should be such that abidiff
> +# --harmless reports no differences (or is empty).
> +#
> +# It does this by modelling symbols and types dependencies as a
> +# directed graph, marking nodes reachable from the ELF symbols and
> +# dropping unreachable nodes.
> +
> +use strict;
> +use warnings;
> +no warnings 'recursion';
> +
> +use autodie;
> +
> +use XML::LibXML;
> +
> +# Making a graph from ABI XML.
> +#
> +# The following are the types of "node" we care about. The "edges" are
> +# obtained from a few XML attributes as well as via XML element
> +# containment.
> +#
> +#  ELF (exported) symbols
> +#
> +#   elf-symbol (has a name; the code here currently knows nothing
> +#     about aliases)
> +#
> +#  Declarations (that mention a symbol)
> +#
> +#   These live in a scope. In C++ scopes can be nested and include
> +#   namespaces and class types.
> +#
> +#  var-decl (also used for member variables)
> +#    elf-symbol linked to symbol via mangled-name
> +#    type-id links to a type
> +#  function-decl (also used for member functions)
> +#    elf-symbol linked to symbol via mangled-name
> +#    parameter and return type-ids link to types
> +#      (alas, not just a link to a function type)
> +#
> +#  Types
> +#
> +#   These occupy pretty much all the other elements, besides those
> +#   that act as simple containers.
> +#
> +# XML elements and their roles in more detail
> +#
> +# ELF
> +#
> +# elf-needed - container
> +#  dependency - names a library
> +# elf-function-symbols - contains a list of symbols
> +# elf-variable-symbols - contains a list of symbols
> +#  elf-symbol - describes an ELF variable or function
> +#
> +# Grouping and scoping
> +#
> +# abi-corpus-group
> +#  abi-corpus
> +#   abi-instr - compilation unit containers
> +#    namespace-decl - pure container, possibly named
> +#
> +# Types (some introduce scopes, only in C++)
> +#
> +# type-decl - defines a primitive type
> +# typedef-decl - defines a type, links to a type
> +# qualified-type-def - defines a type, links to a type
> +# pointer-type-def - defines a type, links to a type
> +# reference-type-def - defines a type, links to a type
> +# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
> +#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
> +# function-type - defines a type
> +#  parameter - belongs to function-type and -decl, links to a type
> +#  return - belongs to function-type and -decl, links to a type
> +# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
> +#  underlying-type - belongs to enum-decl
> +#  enumerator - belongs to enum-decl
> +# union-decl - defines and names a type, contains member elements linked to other things
> +# class-decl - defines and names a type, contains base type, member elements linking to other things
> +#  base-class - belongs to class-decl
> +#  data-member - container for a member; holds access level
> +#  member-function - container for a member; holds access level
> +#  member-type - container for a type declaration; holds access level
> +#  member-template - container for a (function?) template declaration; holds access level
> +#
> +# Higher order Things
> +#
> +# class-template-decl - defines a "type", but without instantiation this isn't usable
> +# function-template-decl - defines a "type", but without instantiation this isn't usable
> +#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
> +#  template-non-type-parameter - names a template parameter, links to a type
> +#  template-parameter-type-composition - container?
> +#
> +# Values
> +#
> +# function-decl - names a function, can link to a symbol
> +#     has same children as function-type, rather than linking to a type
> +# var-decl - names a variable, can link to a symbol, links to a type
> +
> +# Graph nodes.
> +my %symbols;
> +my %exported_symbols;
> +my %types;
> +# Graph edges.
> +my %type_type_edges;
> +my %symbol_type_edges;
> +my %type_symbol_edges;
> +
> +# Keep track of the types being defined.
> +my @id_stack;
> +# Keep track of the current (value) declaration.
> +my @symbol_stack;
> +sub process($);
> +sub process($) {
> +  my ($node) = @_;
> +
> +  # The XML attributes we care about.
> +  my $name;
> +  my $id;
> +  my $type_id;
> +  my $symbol;
> +  my $naming_typedef_id;
> +
> +  # Not every node we encounter is an XML element.
> +  if ($node->nodeType == XML_ELEMENT_NODE) {
> +    $name = $node->getAttribute('name');
> +    $id = $node->getAttribute('id');
> +    $type_id = $node->getAttribute('type-id');
> +    $symbol = $node->getAttribute('mangled-name');
> +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> +    die if defined $id && defined $symbol;
> +  }
> +
> +  if (defined $name && $node->getName() eq 'elf-symbol') {
> +    $exported_symbols{$name} = undef;
> +    # Early return is safe, but not necessary.
> +    return;
> +  }
> +
> +  if (defined $id) {
> +    # This element defines a type (but there may be more than one
> +    # defining the same type - we cannot rely on uniqueness).
> +    $types{$id} = undef;
> +    if (defined $naming_typedef_id) {
> +      # This is an odd one, there can be a backwards link from an
> +      # anonymous type to the typedef that refers to it, so we need to
> +      # pull in the typedef, even if nothing else refers to it.
> +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> +    }
> +    if (@id_stack) {
> +      # Type parent<->child dependencies; record dependencies both
> +      # ways to avoid making holes in the XML types.
> +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> +    }
> +    push @id_stack, $id;
> +  }
> +
> +  if (defined $symbol) {
> +    # This element is a declaration linked to a symbol (whether or not
> +    # exported).
> +    $symbols{$symbol} = undef;
> +    if (@id_stack) {
> +      # The enclosing scope may include a C++ class, in which case we
> +      # need to record the dependency.
> +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> +    }
> +    # The symbol depends on the types mentioned in this element, so
> +    # record it.
> +    push @symbol_stack, $symbol;
> +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> +  }
> +
> +  if (defined $type_id) {
> +    if (@id_stack) {
> +      # The enclosing type refers to this type, record the dependency.
> +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> +    }
> +    if (@symbol_stack) {
> +      # The enclosing declaration (linked to a symbol) refers to this
> +      # type, record the dependency.
> +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> +    }
> +  }
> +
> +  for my $child ($node->nonBlankChildNodes()) {
> +    process($child);
> +  }
> +
> +  if (defined $symbol) {
> +    pop @symbol_stack;
> +  }
> +  if (defined $id) {
> +    pop @id_stack;
> +  }
> +}
> +
> +# Load the XML.
> +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> +# Build a graph.
> +process($dom);
> +die if @id_stack or @symbol_stack;
> +
> +# DFS visited state.
> +my %seen_types;
> +my %seen_symbols;
> +sub dfs_type($);
> +sub dfs_symbol($);
> +
> +sub dfs_type($) {
> +  my ($type) = @_;
> +  return if exists $seen_types{$type};
> +  $seen_types{$type} = undef;
> +
> +  my $types = $type_type_edges{$type};
> +  if (defined $types) {
> +    for my $type (keys %$types) {
> +      for my $link (keys %{$types->{$type}}) {
> +        if (exists $types->{$type}{$link}) {
> +          dfs_type($type);
> +        }
> +      }
> +    }
> +  }
> +  my $symbols = $type_symbol_edges{$type};
> +  if (defined $symbols) {
> +    for my $symbol (keys %$symbols) {
> +      dfs_symbol($symbol);
> +    }
> +  }
> +}
> +
> +sub dfs_symbol($) {
> +  my ($symbol) = @_;
> +  return if exists $seen_symbols{$symbol};
> +  $seen_symbols{$symbol} = undef;
> +
> +  my $types = $symbol_type_edges{$symbol};
> +  if (defined $types) {
> +    for my $type (keys %$types) {
> +      dfs_type($type);
> +    }
> +  } else {
> +    warn "no declaration found for exported symbol $symbol\n";
> +  }
> +}
> +
> +# Traverse the graph, using the exported symbols as the roots.
> +for my $symbol (keys %exported_symbols) {
> +  dfs_symbol($symbol);
> +}
> +
> +# Useful counts.
> +sub print_report() {
> +  my $count_symbols = scalar keys %symbols;
> +  my $count_exported_symbols = scalar keys %exported_symbols;
> +  my $count_types = scalar keys %types;
> +  my $count_reachable_types = scalar keys %seen_types;
> +
> +  print STDERR qq{symbols = $count_symbols
> +exported symbols = $count_exported_symbols
> +types = $count_types
> +reachable types = $count_reachable_types
> +};
> +}
> +
> +my %drop_if_empty = map { $_ => undef } qw(
> +  namespace-decl
> +  abi-instr
> +  abi-corpus
> +  abi-corpus-group
> +);
> +
> +# This is another DFS traversal of the XML.
> +sub prune($);
> +sub prune($) {
> +  my ($node) = @_;
> +
> +  my $name = $node->getName();
> +  my $id;
> +  my $symbol;
> +
> +  if ($node->nodeType == XML_ELEMENT_NODE) {
> +    $id = $node->getAttribute('id');
> +    $symbol = $node->getAttribute('mangled-name');
> +    die if defined $id && defined $symbol;
> +  }
> +
> +  # Return if we know that this is a type or declaration to keep or
> +  # drop in its entirety.
> +  if (defined $id) {
> +    return !exists $seen_types{$id};
> +  }
> +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> +    return !defined $symbol || !exists $exported_symbols{$symbol};
> +  }
> +
> +  # Otherwise, this is not a type, declaration or part thereof, so
> +  # process child elements.
> +  my $empty = 1;
> +  for my $child ($node->nonBlankChildNodes()) {
> +    if (prune($child)) {
> +      $node->removeChild($child);
> +    } else {
> +      $empty = 0;
> +    }
> +  }
> +
> +  # Empty container elements can be dropped.
> +  return $empty && exists $drop_if_empty{$name};
> +}
> +
> +# Drop unreachable nodes and any container nodes that end up empty.
> +prune($dom);
> +my $out = $dom->toString();
> +# Remove blank line debris.
> +$out =~ s;^ *\n;;mg;
> +# Remove the XML declaration as abidiff doesn't like it.
> +$out =~ s;^<\?xml .*\?>\n;;m;
> +print $out;
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Add an experimental ABI pruning utility.
  2021-03-11 20:39 ` Ben Woodard
@ 2021-03-12 16:51   ` Giuliano Procida
  2021-03-12 18:41     ` Ben Woodard
  0 siblings, 1 reply; 17+ messages in thread
From: Giuliano Procida @ 2021-03-12 16:51 UTC (permalink / raw)
  To: Ben Woodard
  Cc: Giuliano Procida via Libabigail, Matthias Maennich, kernel-team

Hi.

On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com> wrote:

> I think that you have pointed out a problem that I’ve been discussing with
> some users that I’m working with. The sheer volume of the abixml output is
> pretty extreme. I don’t think that they were really expecting it. However,
> they are not running into the particular situation that you are seeing
> since they don’t have the coding model of the problem library. They are
> more likely to fall in the 5-10% range that you reported above.
>
> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”.
> The ELF reachability aspect that you point out seems to be part of the
> difference between those two concepts.
>
>
We want a distilled description of the ABI (the surface), not what may lie
on the other side. More precisely, we want a typed interface where "type"
encompasses extra things like architecture, calling convention, ISA, ELF
symbol properties etc.

One worry is that just chasing references from the exported ELF symbols may
somehow miss something important, say for libraries that ship with bits of
implementation in header files.


> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <
> libabigail@sourceware.org> wrote:
> >
> > In response to internal reports about very large ABIs for libraires
> > implemented in C++ but exposing only a C API, I wrote a script as a
> > means of investigating the issue.
> >
> > The script aims to remove parts of the XML that are irrelevant to
> > abidiff. It has been run on the current libabigail tests as well as
> > some locally-generated ABI files.
> >
> > On most XML files, this results in a modest saving (5-10%). However,
> > for the library that sparked the invesigaton, the resulting XML was
> > more than 2300 times smaller.
> >
> > The script may have some merit on its own but perhaps should be
> > implemented differently.
> >
> > Question: How hard would it be to implement this functionality (under
> > flag control) within libabigail itself?
>
> I can’t directly answer “how hard” that is one of the things which we
> would discover by actually doing it. As for should it be done. I think it
> should be carefully looked into. However, as for priority since it is an
> optimization rather than a functional problem, if it were me, I would file
> this as a bug and put examples of the test data and results you are seeing
> as well as the script below in that bz and then start working on the
> process of refining libabigail’s output so that it matches the results of
> your script.
>
>
It's an optimisation analogous to DCE in a compiler. I'll certainly follow
up with a bug. It's one library in particular that is ~50M before pruning
and ~22k after.

I've already made changes to the script so that it can be given a list of
symbols and produce a reduced ABI. At the moment, we achieve this by
passing a list to abidw (with -w). In fact, we extract full and reduced
ABIs concurrently in the Android kernel build as each run takes a while. It
should be straightforward to reimplement the script in C++, though it's not
too slow already.

In terms of implementation, it would need to work with libabigail IR and
ir_node_visitor might be the right place to start.

Giuliano.


> -ben
>
> >
> > A couple of tangential things worth mentioning (see the comments at
> > the very beginning and very end of the script):
> >
> > - abidiff rejects XML files with an XML declaration at the top
> > - abidiff rejects completely empty files
> >
> > Resolving the first would make the second moot.
> >
> >       * scripts/abiprune.pl: Add script to prune ABI XML.
> >
> > Signed-off-by: Giuliano Procida <gprocida@google.com>
> > ---
> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 323 insertions(+)
> > create mode 100755 scripts/abiprune.pl
> >
> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
> > new file mode 100755
> > index 00000000..8e04a273
> > --- /dev/null
> > +++ b/scripts/abiprune.pl
> > @@ -0,0 +1,323 @@
> > +#!/usr/bin/perl
> > +
> > +# This script is intended to consume libabigail ABI XML as generated
> > +# by abidw and produce a possibly smaller representation that captures
> > +# the same ABI. In particular, the output should be such that abidiff
> > +# --harmless reports no differences (or is empty).
> > +#
> > +# It does this by modelling symbols and types dependencies as a
> > +# directed graph, marking nodes reachable from the ELF symbols and
> > +# dropping unreachable nodes.
> > +
> > +use strict;
> > +use warnings;
> > +no warnings 'recursion';
> > +
> > +use autodie;
> > +
> > +use XML::LibXML;
> > +
> > +# Making a graph from ABI XML.
> > +#
> > +# The following are the types of "node" we care about. The "edges" are
> > +# obtained from a few XML attributes as well as via XML element
> > +# containment.
> > +#
> > +#  ELF (exported) symbols
> > +#
> > +#   elf-symbol (has a name; the code here currently knows nothing
> > +#     about aliases)
> > +#
> > +#  Declarations (that mention a symbol)
> > +#
> > +#   These live in a scope. In C++ scopes can be nested and include
> > +#   namespaces and class types.
> > +#
> > +#  var-decl (also used for member variables)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    type-id links to a type
> > +#  function-decl (also used for member functions)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    parameter and return type-ids link to types
> > +#      (alas, not just a link to a function type)
> > +#
> > +#  Types
> > +#
> > +#   These occupy pretty much all the other elements, besides those
> > +#   that act as simple containers.
> > +#
> > +# XML elements and their roles in more detail
> > +#
> > +# ELF
> > +#
> > +# elf-needed - container
> > +#  dependency - names a library
> > +# elf-function-symbols - contains a list of symbols
> > +# elf-variable-symbols - contains a list of symbols
> > +#  elf-symbol - describes an ELF variable or function
> > +#
> > +# Grouping and scoping
> > +#
> > +# abi-corpus-group
> > +#  abi-corpus
> > +#   abi-instr - compilation unit containers
> > +#    namespace-decl - pure container, possibly named
> > +#
> > +# Types (some introduce scopes, only in C++)
> > +#
> > +# type-decl - defines a primitive type
> > +# typedef-decl - defines a type, links to a type
> > +# qualified-type-def - defines a type, links to a type
> > +# pointer-type-def - defines a type, links to a type
> > +# reference-type-def - defines a type, links to a type
> > +# array-type-def - defines a (multidimensional array) type, refers to
> element type, contains subranges
> > +#  subrange - contains array length, refers to element type; defines
> types (never referred to; duplicated)
> > +# function-type - defines a type
> > +#  parameter - belongs to function-type and -decl, links to a type
> > +#  return - belongs to function-type and -decl, links to a type
> > +# enum-decl - defines a type, names it, contains a list of enumerators
> and an underlying-type
> > +#  underlying-type - belongs to enum-decl
> > +#  enumerator - belongs to enum-decl
> > +# union-decl - defines and names a type, contains member elements
> linked to other things
> > +# class-decl - defines and names a type, contains base type, member
> elements linking to other things
> > +#  base-class - belongs to class-decl
> > +#  data-member - container for a member; holds access level
> > +#  member-function - container for a member; holds access level
> > +#  member-type - container for a type declaration; holds access level
> > +#  member-template - container for a (function?) template declaration;
> holds access level
> > +#
> > +# Higher order Things
> > +#
> > +# class-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +# function-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +#  template-type-parameter - defines a type (formal variable), perhaps
> one which should be excluded from the real type graph
> > +#  template-non-type-parameter - names a template parameter, links to a
> type
> > +#  template-parameter-type-composition - container?
> > +#
> > +# Values
> > +#
> > +# function-decl - names a function, can link to a symbol
> > +#     has same children as function-type, rather than linking to a type
> > +# var-decl - names a variable, can link to a symbol, links to a type
> > +
> > +# Graph nodes.
> > +my %symbols;
> > +my %exported_symbols;
> > +my %types;
> > +# Graph edges.
> > +my %type_type_edges;
> > +my %symbol_type_edges;
> > +my %type_symbol_edges;
> > +
> > +# Keep track of the types being defined.
> > +my @id_stack;
> > +# Keep track of the current (value) declaration.
> > +my @symbol_stack;
> > +sub process($);
> > +sub process($) {
> > +  my ($node) = @_;
> > +
> > +  # The XML attributes we care about.
> > +  my $name;
> > +  my $id;
> > +  my $type_id;
> > +  my $symbol;
> > +  my $naming_typedef_id;
> > +
> > +  # Not every node we encounter is an XML element.
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $name = $node->getAttribute('name');
> > +    $id = $node->getAttribute('id');
> > +    $type_id = $node->getAttribute('type-id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
> > +    $exported_symbols{$name} = undef;
> > +    # Early return is safe, but not necessary.
> > +    return;
> > +  }
> > +
> > +  if (defined $id) {
> > +    # This element defines a type (but there may be more than one
> > +    # defining the same type - we cannot rely on uniqueness).
> > +    $types{$id} = undef;
> > +    if (defined $naming_typedef_id) {
> > +      # This is an odd one, there can be a backwards link from an
> > +      # anonymous type to the typedef that refers to it, so we need to
> > +      # pull in the typedef, even if nothing else refers to it.
> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> > +    }
> > +    if (@id_stack) {
> > +      # Type parent<->child dependencies; record dependencies both
> > +      # ways to avoid making holes in the XML types.
> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> > +    }
> > +    push @id_stack, $id;
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    # This element is a declaration linked to a symbol (whether or not
> > +    # exported).
> > +    $symbols{$symbol} = undef;
> > +    if (@id_stack) {
> > +      # The enclosing scope may include a C++ class, in which case we
> > +      # need to record the dependency.
> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> > +    }
> > +    # The symbol depends on the types mentioned in this element, so
> > +    # record it.
> > +    push @symbol_stack, $symbol;
> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> > +  }
> > +
> > +  if (defined $type_id) {
> > +    if (@id_stack) {
> > +      # The enclosing type refers to this type, record the dependency.
> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> > +    }
> > +    if (@symbol_stack) {
> > +      # The enclosing declaration (linked to a symbol) refers to this
> > +      # type, record the dependency.
> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> > +    }
> > +  }
> > +
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    process($child);
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    pop @symbol_stack;
> > +  }
> > +  if (defined $id) {
> > +    pop @id_stack;
> > +  }
> > +}
> > +
> > +# Load the XML.
> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> > +# Build a graph.
> > +process($dom);
> > +die if @id_stack or @symbol_stack;
> > +
> > +# DFS visited state.
> > +my %seen_types;
> > +my %seen_symbols;
> > +sub dfs_type($);
> > +sub dfs_symbol($);
> > +
> > +sub dfs_type($) {
> > +  my ($type) = @_;
> > +  return if exists $seen_types{$type};
> > +  $seen_types{$type} = undef;
> > +
> > +  my $types = $type_type_edges{$type};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      for my $link (keys %{$types->{$type}}) {
> > +        if (exists $types->{$type}{$link}) {
> > +          dfs_type($type);
> > +        }
> > +      }
> > +    }
> > +  }
> > +  my $symbols = $type_symbol_edges{$type};
> > +  if (defined $symbols) {
> > +    for my $symbol (keys %$symbols) {
> > +      dfs_symbol($symbol);
> > +    }
> > +  }
> > +}
> > +
> > +sub dfs_symbol($) {
> > +  my ($symbol) = @_;
> > +  return if exists $seen_symbols{$symbol};
> > +  $seen_symbols{$symbol} = undef;
> > +
> > +  my $types = $symbol_type_edges{$symbol};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      dfs_type($type);
> > +    }
> > +  } else {
> > +    warn "no declaration found for exported symbol $symbol\n";
> > +  }
> > +}
> > +
> > +# Traverse the graph, using the exported symbols as the roots.
> > +for my $symbol (keys %exported_symbols) {
> > +  dfs_symbol($symbol);
> > +}
> > +
> > +# Useful counts.
> > +sub print_report() {
> > +  my $count_symbols = scalar keys %symbols;
> > +  my $count_exported_symbols = scalar keys %exported_symbols;
> > +  my $count_types = scalar keys %types;
> > +  my $count_reachable_types = scalar keys %seen_types;
> > +
> > +  print STDERR qq{symbols = $count_symbols
> > +exported symbols = $count_exported_symbols
> > +types = $count_types
> > +reachable types = $count_reachable_types
> > +};
> > +}
> > +
> > +my %drop_if_empty = map { $_ => undef } qw(
> > +  namespace-decl
> > +  abi-instr
> > +  abi-corpus
> > +  abi-corpus-group
> > +);
> > +
> > +# This is another DFS traversal of the XML.
> > +sub prune($);
> > +sub prune($) {
> > +  my ($node) = @_;
> > +
> > +  my $name = $node->getName();
> > +  my $id;
> > +  my $symbol;
> > +
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $id = $node->getAttribute('id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  # Return if we know that this is a type or declaration to keep or
> > +  # drop in its entirety.
> > +  if (defined $id) {
> > +    return !exists $seen_types{$id};
> > +  }
> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
> > +  }
> > +
> > +  # Otherwise, this is not a type, declaration or part thereof, so
> > +  # process child elements.
> > +  my $empty = 1;
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    if (prune($child)) {
> > +      $node->removeChild($child);
> > +    } else {
> > +      $empty = 0;
> > +    }
> > +  }
> > +
> > +  # Empty container elements can be dropped.
> > +  return $empty && exists $drop_if_empty{$name};
> > +}
> > +
> > +# Drop unreachable nodes and any container nodes that end up empty.
> > +prune($dom);
> > +my $out = $dom->toString();
> > +# Remove blank line debris.
> > +$out =~ s;^ *\n;;mg;
> > +# Remove the XML declaration as abidiff doesn't like it.
> > +$out =~ s;^<\?xml .*\?>\n;;m;
> > +print $out;
> > --
> > 2.31.0.rc2.261.g7f71774620-goog
> >
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2] Add an experimental ABI pruning utility.
  2021-03-11 11:53 [RFC PATCH] Add an experimental ABI pruning utility Giuliano Procida
  2021-03-11 20:39 ` Ben Woodard
@ 2021-03-12 16:59 ` Giuliano Procida
  2021-03-16 16:55   ` [RFC PATCH v3] " Giuliano Procida
  1 sibling, 1 reply; 17+ messages in thread
From: Giuliano Procida @ 2021-03-12 16:59 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue.

The script aims to remove parts of the XML that are irrelevant to
abidiff. It has been run on current libabigail test data ABI files as
well as some locally-generated ones.

The script by default considers all exported ELF symbols as the roots
of interest (but this list can be overridden).

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

The script may have some merit on its own but perhaps should be
implemented differently (say in C++).

Question: How hard would it be to implement this functionality (under
flag control) within libabigail itself?

A couple of tangential things worth mentioning (see the comments at
the very beginning and very end of the script):

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot.

	* scripts/abiprune.pl: Add script to prune ABI XML or restrict
	it to a subset of symbols.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abiprune.pl | 356 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100755 scripts/abiprune.pl

diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
new file mode 100755
index 00000000..45592d29
--- /dev/null
+++ b/scripts/abiprune.pl
@@ -0,0 +1,356 @@
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+#
+# It does this by modelling symbols and types dependencies as a
+# directed graph, marking nodes reachable from the ELF symbols and
+# dropping unreachable nodes.
+
+use strict;
+use warnings;
+no warnings 'recursion';
+
+use autodie;
+
+use Getopt::Long;
+use IO::File;
+use XML::LibXML;
+
+# Making a graph from ABI XML.
+#
+# The following are the types of "node" we care about. The "edges" are
+# obtained from a few XML attributes as well as via XML element
+# containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+#
+# XML elements and their roles in more detail
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-variable-symbols - contains a list of symbols
+# elf-function-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a type (function), but without instantiation this isn't usable
+# function-template-decl - defines a type (function), but without instantiation this isn't usable
+#  template-type-parameter - defines a type (variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+# var-decl - names a variable, can link to a symbol, links to a type
+
+my $symbols_file;
+GetOptions ("symbols=s" => \$symbols_file)
+  && scalar @ARGV == 0
+  or die "usage: $0 [-s symbols_file]\n";
+
+my %wanted_symbols;
+if ($symbols_file) {
+  my $fh = new IO::File $symbols_file, '<';
+  while (<$fh>) {
+    chomp;
+    $wanted_symbols{$_} = undef;
+  }
+  close $fh;
+}
+
+# Graph nodes.
+my %symbols;
+my %exported_symbols;
+my %types;
+# Graph edges.
+my %type_type_edges;
+my %symbol_type_edges;
+my %type_symbol_edges;
+
+# Keep track of the types being defined.
+my @id_stack;
+# Keep track of the current (value) declaration.
+my @symbol_stack;
+# Traverse the whole XML DOM.
+sub process($);
+sub process($) {
+  my ($node) = @_;
+
+  # The XML attributes we care about.
+  my $name;
+  my $id;
+  my $type_id;
+  my $symbol;
+  my $naming_typedef_id;
+
+  # Not every node we encounter is an XML element.
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $type_id = $node->getAttribute('type-id');
+    $symbol = $node->getAttribute('mangled-name');
+    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+    die if defined $id && defined $symbol;
+  }
+
+  if (defined $name && $node->getName() eq 'elf-symbol') {
+    $exported_symbols{$name} = undef;
+    # Early return is safe, but not necessary.
+    return;
+  }
+
+  if (defined $id) {
+    # This element defines a type (but there may be more than one
+    # defining the same type - we cannot rely on uniqueness).
+    $types{$id} = undef;
+    if (defined $naming_typedef_id) {
+      # This is an odd one, there can be a backwards link from an
+      # anonymous type to the typedef that refers to it, so we need to
+      # pull in the typedef, even if nothing else refers to it.
+      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
+    }
+    if (@id_stack) {
+      # Type parent<->child dependencies; record dependencies both
+      # ways to avoid making holes in the XML types.
+      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
+      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
+    }
+    push @id_stack, $id;
+  }
+
+  if (defined $symbol) {
+    # This element is a declaration linked to a symbol (whether or not
+    # exported).
+    $symbols{$symbol} = undef;
+    if (@id_stack) {
+      # The enclosing scope may include a C++ class, in which case we
+      # need to record the dependency.
+      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
+    }
+    # The symbol depends on the types mentioned in this element, so
+    # record it.
+    push @symbol_stack, $symbol;
+    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
+  }
+
+  if (defined $type_id) {
+    if (@id_stack) {
+      # The enclosing type refers to this type, record the dependency.
+      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
+    }
+    if (@symbol_stack) {
+      # The enclosing declaration (linked to a symbol) refers to this
+      # type, record the dependency.
+      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
+    }
+  }
+
+  for my $child ($node->nonBlankChildNodes()) {
+    process($child);
+  }
+
+  if (defined $symbol) {
+    pop @symbol_stack;
+  }
+  if (defined $id) {
+    pop @id_stack;
+  }
+}
+
+# Load the XML.
+my $dom = XML::LibXML->load_xml(IO => \*STDIN);
+# Build a graph.
+process($dom);
+die if @id_stack or @symbol_stack;
+
+%wanted_symbols = %exported_symbols unless defined $symbols_file;
+
+# DFS visited state.
+my %seen_types;
+my %seen_symbols;
+sub dfs_type($);
+sub dfs_symbol($);
+
+sub dfs_type($) {
+  my ($type) = @_;
+  if (not exists $types{$type}) {
+    warn "edge to unknown type $type\n";
+    return;
+  }
+  return if exists $seen_types{$type};
+  $seen_types{$type} = undef;
+
+  my $types = $type_type_edges{$type};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      for my $link (keys %{$types->{$type}}) {
+        if (exists $types->{$type}{$link}) {
+          dfs_type($type);
+        }
+      }
+    }
+  }
+  my $symbols = $type_symbol_edges{$type};
+  if (defined $symbols) {
+    for my $symbol (keys %$symbols) {
+      dfs_symbol($symbol);
+    }
+  }
+}
+
+sub dfs_symbol($) {
+  my ($symbol) = @_;
+  return if exists $seen_symbols{$symbol};
+  $seen_symbols{$symbol} = undef;
+
+  my $types = $symbol_type_edges{$symbol};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      dfs_type($type);
+    }
+  } else {
+    warn "no declaration found for exported symbol $symbol\n";
+  }
+}
+
+# Traverse the graph, starting from the wanted symbols.
+for my $symbol (keys %wanted_symbols) {
+  dfs_symbol($symbol);
+}
+
+# Useful counts.
+sub print_report() {
+  my $count_symbols = scalar keys %symbols;
+  my $count_exported_symbols = scalar keys %exported_symbols;
+  my $count_types = scalar keys %types;
+  my $count_reachable_types = scalar keys %seen_types;
+
+  print STDERR qq{symbols = $count_symbols
+exported symbols = $count_exported_symbols
+types = $count_types
+reachable types = $count_reachable_types
+};
+}
+
+my %drop_if_empty = map { $_ => undef } qw(
+  elf-variable-symbols
+  elf-function-symbols
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is another XML DOM traversal.
+sub prune($);
+sub prune($) {
+  my ($node) = @_;
+
+  my $node_name = $node->getName();
+  my $name;
+  my $id;
+  my $symbol;
+
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $symbol = $node->getAttribute('mangled-name');
+    die if defined $id && defined $symbol;
+  }
+
+  # Keep / prune wanted / unwanted symbols
+  if (defined $name && $node_name eq 'elf-symbol') {
+    return !exists $exported_symbols{$name};
+  }
+
+  # Return if we know that this is a type or declaration to keep or
+  # drop in its entirety.
+  if (defined $id) {
+    return !exists $seen_types{$id};
+  }
+  if ($node_name eq 'var-decl' || $node_name eq 'function-decl') {
+    return !defined $symbol || !exists $seen_symbols{$symbol};
+  }
+
+  # Otherwise, this is not a type, declaration or part thereof, so
+  # process child elements.
+  my $empty = 1;
+  for my $child ($node->nonBlankChildNodes()) {
+    if (prune($child)) {
+      $node->removeChild($child);
+    } else {
+      $empty = 0;
+    }
+  }
+
+  # Empty container elements can be dropped.
+  return $empty && exists $drop_if_empty{$node_name};
+}
+
+# Drop unreachable nodes and any container nodes that end up empty.
+prune($dom);
+my $out = $dom->toString();
+# Remove blank line debris.
+$out =~ s;^ *\n;;mg;
+# Remove the XML declaration as abidiff doesn't like it.
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $out;
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Add an experimental ABI pruning utility.
  2021-03-12 16:51   ` Giuliano Procida
@ 2021-03-12 18:41     ` Ben Woodard
  2021-03-15 11:08       ` Giuliano Procida
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2021-03-12 18:41 UTC (permalink / raw)
  To: Giuliano Procida
  Cc: Giuliano Procida via Libabigail, Matthias Maennich, kernel-team

if you look at abicompat there are functions like: 
perform_compat_check_in_*_mode(options& opts,
                                    diff_context_sptr& ctxt,
                                    corpus_sptr app_corpus,
                                    corpus_sptr lib1_corpus,
                                    corpus_sptr lib2_corpus)
which ultimately does something like your perl script’s ELF reachability check when computing compatibility between libraries. It iterates through the undefined symbols and then calls get_sym_ids_of_*_to_keep(). 

I would think that you would be able to essentially do the same thing inside of abidw to have it prune you abixml file in a similar way to how your perl script does. 

(Paranthetical Note: one thing that really bothers me about those functions which I hope to work on as soon as I get a chance is I believe that they need to take const corpus_sptr’s for the app and lib1 lib2 corpuses. Of course this means that the data structures would have to change and the sym_ids_of_*_to_keep() would need to be some sort of external data structure rather than something inside the corpus itself, I’ve been calling this a subcorpus as I’ve been thinking about how to make this change happen. 

The problem with the way that it is now is when using libabigail as a library, rather than as it is used in a tool like abicompat you want to be able to load a collection of corpora which takes non-trivial time in many cases and use them over and over. For example say you want to load the corpora of many variations of libraries and find the ones that are compatible with each other. You are going to want to use the app corpus over and over without having to reload it from the ELF or abixml each and every time. Yeah as a first pass you could clear the vectors with the *_to_keep but then you would only be able to use the corpus in one thread at a time. I think it would make better library design to move those things to the side.)

> On Mar 12, 2021, at 8:51 AM, Giuliano Procida <gprocida@google.com> wrote:
> 
> Hi.
> 
> On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com <mailto:woodard@redhat.com>> wrote:
> I think that you have pointed out a problem that I’ve been discussing with some users that I’m working with. The sheer volume of the abixml output is pretty extreme. I don’t think that they were really expecting it. However, they are not running into the particular situation that you are seeing since they don’t have the coding model of the problem library. They are more likely to fall in the 5-10% range that you reported above. 
> 
> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”. The ELF reachability aspect that you point out seems to be part of the difference between those two concepts.
> 
> 
> We want a distilled description of the ABI (the surface), not what may lie on the other side. More precisely, we want a typed interface where "type" encompasses extra things like architecture, calling convention, ISA, ELF symbol properties etc.
> 
> One worry is that just chasing references from the exported ELF symbols may somehow miss something important, say for libraries that ship with bits of implementation in header files.

Right now given the implementation above, it looks like ELF reachability is the only thing that abicompat takes into consideration. What you are saying potentially has important consequences because it could suggest that abicompat may be missing things.

Do you know of or could you construct an example that demonstrates a case that would be missed by this ELF reachability? It may mean that we need to look at ways to enhance the checks in abicompat and in doing so they could also be part of the code that you use to prune the output of abidw.

The thing that pops into my mind would be inline functions defined in the header files. We may be able to find those by searching through the TUs but then there are things like cpp macros even with -g3 I’m not sure we could find those but I’m not entirely sure we would need to.

I really do believe that this topic needs additional investigation.

-ben

>  
> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <libabigail@sourceware.org <mailto:libabigail@sourceware.org>> wrote:
> > 
> > In response to internal reports about very large ABIs for libraires
> > implemented in C++ but exposing only a C API, I wrote a script as a
> > means of investigating the issue.
> > 
> > The script aims to remove parts of the XML that are irrelevant to
> > abidiff. It has been run on the current libabigail tests as well as
> > some locally-generated ABI files.
> > 
> > On most XML files, this results in a modest saving (5-10%). However,
> > for the library that sparked the invesigaton, the resulting XML was
> > more than 2300 times smaller.
> > 
> > The script may have some merit on its own but perhaps should be
> > implemented differently.
> > 
> > Question: How hard would it be to implement this functionality (under
> > flag control) within libabigail itself?
> 
> I can’t directly answer “how hard” that is one of the things which we would discover by actually doing it. As for should it be done. I think it should be carefully looked into. However, as for priority since it is an optimization rather than a functional problem, if it were me, I would file this as a bug and put examples of the test data and results you are seeing as well as the script below in that bz and then start working on the process of refining libabigail’s output so that it matches the results of your script.
> 
> 
> It's an optimisation analogous to DCE in a compiler. I'll certainly follow up with a bug. It's one library in particular that is ~50M before pruning and ~22k after.
> 
> I've already made changes to the script so that it can be given a list of symbols and produce a reduced ABI. At the moment, we achieve this by passing a list to abidw (with -w). In fact, we extract full and reduced ABIs concurrently in the Android kernel build as each run takes a while. It should be straightforward to reimplement the script in C++, though it's not too slow already.
> 
> In terms of implementation, it would need to work with libabigail IR and ir_node_visitor might be the right place to start.
> 
> Giuliano.
>  
> -ben
> 
> > 
> > A couple of tangential things worth mentioning (see the comments at
> > the very beginning and very end of the script):
> > 
> > - abidiff rejects XML files with an XML declaration at the top
> > - abidiff rejects completely empty files
> > 
> > Resolving the first would make the second moot.
> > 
> >       * scripts/abiprune.pl <http://abiprune.pl/>: Add script to prune ABI XML.
> > 
> > Signed-off-by: Giuliano Procida <gprocida@google.com <mailto:gprocida@google.com>>
> > ---
> > scripts/abiprune.pl <http://abiprune.pl/> | 323 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 323 insertions(+)
> > create mode 100755 scripts/abiprune.pl <http://abiprune.pl/>
> > 
> > diff --git a/scripts/abiprune.pl <http://abiprune.pl/> b/scripts/abiprune.pl <http://abiprune.pl/>
> > new file mode 100755
> > index 00000000..8e04a273
> > --- /dev/null
> > +++ b/scripts/abiprune.pl <http://abiprune.pl/>
> > @@ -0,0 +1,323 @@
> > +#!/usr/bin/perl
> > +
> > +# This script is intended to consume libabigail ABI XML as generated
> > +# by abidw and produce a possibly smaller representation that captures
> > +# the same ABI. In particular, the output should be such that abidiff
> > +# --harmless reports no differences (or is empty).
> > +#
> > +# It does this by modelling symbols and types dependencies as a
> > +# directed graph, marking nodes reachable from the ELF symbols and
> > +# dropping unreachable nodes.
> > +
> > +use strict;
> > +use warnings;
> > +no warnings 'recursion';
> > +
> > +use autodie;
> > +
> > +use XML::LibXML;
> > +
> > +# Making a graph from ABI XML.
> > +#
> > +# The following are the types of "node" we care about. The "edges" are
> > +# obtained from a few XML attributes as well as via XML element
> > +# containment.
> > +#
> > +#  ELF (exported) symbols
> > +#
> > +#   elf-symbol (has a name; the code here currently knows nothing
> > +#     about aliases)
> > +#
> > +#  Declarations (that mention a symbol)
> > +#
> > +#   These live in a scope. In C++ scopes can be nested and include
> > +#   namespaces and class types.
> > +#
> > +#  var-decl (also used for member variables)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    type-id links to a type
> > +#  function-decl (also used for member functions)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    parameter and return type-ids link to types
> > +#      (alas, not just a link to a function type)
> > +#
> > +#  Types
> > +#
> > +#   These occupy pretty much all the other elements, besides those
> > +#   that act as simple containers.
> > +#
> > +# XML elements and their roles in more detail
> > +#
> > +# ELF
> > +#
> > +# elf-needed - container
> > +#  dependency - names a library
> > +# elf-function-symbols - contains a list of symbols
> > +# elf-variable-symbols - contains a list of symbols
> > +#  elf-symbol - describes an ELF variable or function
> > +#
> > +# Grouping and scoping
> > +#
> > +# abi-corpus-group
> > +#  abi-corpus
> > +#   abi-instr - compilation unit containers
> > +#    namespace-decl - pure container, possibly named
> > +#
> > +# Types (some introduce scopes, only in C++)
> > +#
> > +# type-decl - defines a primitive type
> > +# typedef-decl - defines a type, links to a type
> > +# qualified-type-def - defines a type, links to a type
> > +# pointer-type-def - defines a type, links to a type
> > +# reference-type-def - defines a type, links to a type
> > +# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
> > +#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
> > +# function-type - defines a type
> > +#  parameter - belongs to function-type and -decl, links to a type
> > +#  return - belongs to function-type and -decl, links to a type
> > +# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
> > +#  underlying-type - belongs to enum-decl
> > +#  enumerator - belongs to enum-decl
> > +# union-decl - defines and names a type, contains member elements linked to other things
> > +# class-decl - defines and names a type, contains base type, member elements linking to other things
> > +#  base-class - belongs to class-decl
> > +#  data-member - container for a member; holds access level
> > +#  member-function - container for a member; holds access level
> > +#  member-type - container for a type declaration; holds access level
> > +#  member-template - container for a (function?) template declaration; holds access level
> > +#
> > +# Higher order Things
> > +#
> > +# class-template-decl - defines a "type", but without instantiation this isn't usable
> > +# function-template-decl - defines a "type", but without instantiation this isn't usable
> > +#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
> > +#  template-non-type-parameter - names a template parameter, links to a type
> > +#  template-parameter-type-composition - container?
> > +#
> > +# Values
> > +#
> > +# function-decl - names a function, can link to a symbol
> > +#     has same children as function-type, rather than linking to a type
> > +# var-decl - names a variable, can link to a symbol, links to a type
> > +
> > +# Graph nodes.
> > +my %symbols;
> > +my %exported_symbols;
> > +my %types;
> > +# Graph edges.
> > +my %type_type_edges;
> > +my %symbol_type_edges;
> > +my %type_symbol_edges;
> > +
> > +# Keep track of the types being defined.
> > +my @id_stack;
> > +# Keep track of the current (value) declaration.
> > +my @symbol_stack;
> > +sub process($);
> > +sub process($) {
> > +  my ($node) = @_;
> > +
> > +  # The XML attributes we care about.
> > +  my $name;
> > +  my $id;
> > +  my $type_id;
> > +  my $symbol;
> > +  my $naming_typedef_id;
> > +
> > +  # Not every node we encounter is an XML element.
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $name = $node->getAttribute('name');
> > +    $id = $node->getAttribute('id');
> > +    $type_id = $node->getAttribute('type-id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
> > +    $exported_symbols{$name} = undef;
> > +    # Early return is safe, but not necessary.
> > +    return;
> > +  }
> > +
> > +  if (defined $id) {
> > +    # This element defines a type (but there may be more than one
> > +    # defining the same type - we cannot rely on uniqueness).
> > +    $types{$id} = undef;
> > +    if (defined $naming_typedef_id) {
> > +      # This is an odd one, there can be a backwards link from an
> > +      # anonymous type to the typedef that refers to it, so we need to
> > +      # pull in the typedef, even if nothing else refers to it.
> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> > +    }
> > +    if (@id_stack) {
> > +      # Type parent<->child dependencies; record dependencies both
> > +      # ways to avoid making holes in the XML types.
> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> > +    }
> > +    push @id_stack, $id;
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    # This element is a declaration linked to a symbol (whether or not
> > +    # exported).
> > +    $symbols{$symbol} = undef;
> > +    if (@id_stack) {
> > +      # The enclosing scope may include a C++ class, in which case we
> > +      # need to record the dependency.
> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> > +    }
> > +    # The symbol depends on the types mentioned in this element, so
> > +    # record it.
> > +    push @symbol_stack, $symbol;
> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> > +  }
> > +
> > +  if (defined $type_id) {
> > +    if (@id_stack) {
> > +      # The enclosing type refers to this type, record the dependency.
> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> > +    }
> > +    if (@symbol_stack) {
> > +      # The enclosing declaration (linked to a symbol) refers to this
> > +      # type, record the dependency.
> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> > +    }
> > +  }
> > +
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    process($child);
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    pop @symbol_stack;
> > +  }
> > +  if (defined $id) {
> > +    pop @id_stack;
> > +  }
> > +}
> > +
> > +# Load the XML.
> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> > +# Build a graph.
> > +process($dom);
> > +die if @id_stack or @symbol_stack;
> > +
> > +# DFS visited state.
> > +my %seen_types;
> > +my %seen_symbols;
> > +sub dfs_type($);
> > +sub dfs_symbol($);
> > +
> > +sub dfs_type($) {
> > +  my ($type) = @_;
> > +  return if exists $seen_types{$type};
> > +  $seen_types{$type} = undef;
> > +
> > +  my $types = $type_type_edges{$type};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      for my $link (keys %{$types->{$type}}) {
> > +        if (exists $types->{$type}{$link}) {
> > +          dfs_type($type);
> > +        }
> > +      }
> > +    }
> > +  }
> > +  my $symbols = $type_symbol_edges{$type};
> > +  if (defined $symbols) {
> > +    for my $symbol (keys %$symbols) {
> > +      dfs_symbol($symbol);
> > +    }
> > +  }
> > +}
> > +
> > +sub dfs_symbol($) {
> > +  my ($symbol) = @_;
> > +  return if exists $seen_symbols{$symbol};
> > +  $seen_symbols{$symbol} = undef;
> > +
> > +  my $types = $symbol_type_edges{$symbol};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      dfs_type($type);
> > +    }
> > +  } else {
> > +    warn "no declaration found for exported symbol $symbol\n";
> > +  }
> > +}
> > +
> > +# Traverse the graph, using the exported symbols as the roots.
> > +for my $symbol (keys %exported_symbols) {
> > +  dfs_symbol($symbol);
> > +}
> > +
> > +# Useful counts.
> > +sub print_report() {
> > +  my $count_symbols = scalar keys %symbols;
> > +  my $count_exported_symbols = scalar keys %exported_symbols;
> > +  my $count_types = scalar keys %types;
> > +  my $count_reachable_types = scalar keys %seen_types;
> > +
> > +  print STDERR qq{symbols = $count_symbols
> > +exported symbols = $count_exported_symbols
> > +types = $count_types
> > +reachable types = $count_reachable_types
> > +};
> > +}
> > +
> > +my %drop_if_empty = map { $_ => undef } qw(
> > +  namespace-decl
> > +  abi-instr
> > +  abi-corpus
> > +  abi-corpus-group
> > +);
> > +
> > +# This is another DFS traversal of the XML.
> > +sub prune($);
> > +sub prune($) {
> > +  my ($node) = @_;
> > +
> > +  my $name = $node->getName();
> > +  my $id;
> > +  my $symbol;
> > +
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $id = $node->getAttribute('id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  # Return if we know that this is a type or declaration to keep or
> > +  # drop in its entirety.
> > +  if (defined $id) {
> > +    return !exists $seen_types{$id};
> > +  }
> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
> > +  }
> > +
> > +  # Otherwise, this is not a type, declaration or part thereof, so
> > +  # process child elements.
> > +  my $empty = 1;
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    if (prune($child)) {
> > +      $node->removeChild($child);
> > +    } else {
> > +      $empty = 0;
> > +    }
> > +  }
> > +
> > +  # Empty container elements can be dropped.
> > +  return $empty && exists $drop_if_empty{$name};
> > +}
> > +
> > +# Drop unreachable nodes and any container nodes that end up empty.
> > +prune($dom);
> > +my $out = $dom->toString();
> > +# Remove blank line debris.
> > +$out =~ s;^ *\n;;mg;
> > +# Remove the XML declaration as abidiff doesn't like it.
> > +$out =~ s;^<\?xml .*\?>\n;;m;
> > +print $out;
> > -- 
> > 2.31.0.rc2.261.g7f71774620-goog
> > 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Add an experimental ABI pruning utility.
  2021-03-12 18:41     ` Ben Woodard
@ 2021-03-15 11:08       ` Giuliano Procida
  0 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-15 11:08 UTC (permalink / raw)
  To: Ben Woodard
  Cc: Giuliano Procida via Libabigail, Matthias Maennich, kernel-team

Hi.

On Fri, 12 Mar 2021 at 18:41, Ben Woodard <woodard@redhat.com> wrote:

> if you look at abicompat there are functions like:
> perform_compat_check_in_*_mode(options& opts,
>                                     diff_context_sptr& ctxt,
>                                     corpus_sptr app_corpus,
>                                     corpus_sptr lib1_corpus,
>                                     corpus_sptr lib2_corpus)
> which ultimately does something like your perl script’s ELF reachability
> check when computing compatibility between libraries. It iterates through
> the undefined symbols and then calls get_sym_ids_of_*_to_keep().
>
>
So abicompat is a bit like abidiff -w symbol_list where the symbol_list is
just taken from an application.


> I would think that you would be able to essentially do the same thing
> inside of abidw to have it prune you abixml file in a similar way to how
> your perl script does.
>
>
Here the symbol list would be taken to be the ELF symbols. The harder part
is implementing the reachability traversal.


> (Paranthetical Note: one thing that really bothers me about those
> functions which I hope to work on as soon as I get a chance is I believe
> that they need to take const corpus_sptr’s for the app and lib1 lib2
> corpuses. Of course this means that the data structures would have to
> change and the sym_ids_of_*_to_keep() would need to be some sort of
> external data structure rather than something inside the corpus itself,
> I’ve been calling this a subcorpus as I’ve been thinking about how to make
> this change happen.
>
> The problem with the way that it is now is when using libabigail as a
> library, rather than as it is used in a tool like abicompat you want to be
> able to load a collection of corpora which takes non-trivial time in many
> cases and use them over and over. For example say you want to load the
> corpora of many variations of libraries and find the ones that are
> compatible with each other. You are going to want to use the app corpus
> over and over without having to reload it from the ELF or abixml each and
> every time. Yeah as a first pass you could clear the vectors with the
> *_to_keep but then you would only be able to use the corpus in one thread
> at a time. I think it would make better library design to move those things
> to the side.)
>
>
Some of the existing traversals in libabigail also have the notion of
"visited" as part of another object (like diff_context) and so mutate this
object, rather than being instantiated just for the given traversal. I
agree that it would be good to separate out this kind of interpretive state
from underlying objects.

In the case of optionally pruning unreachable symbols and types, having
extra state in read_context would make what's a very large object even
larger.

Logically, what we'd want to happen is something like:

set<node> reachable(const set<node> & roots, const corpus & corpus) {
  set<node> visited;
  for (const auto & root : roots)
    visit(visited, root, corpus);
  return visited;
}

corpus = read_corpus();
if (prune) {
  auto roots = exported(corpus);
  auto reachable = reachable(roots, corpus);
  corpus = filter(reachable, corpus);
}
write_corpus(corpus);

On Mar 12, 2021, at 8:51 AM, Giuliano Procida <gprocida@google.com> wrote:
>
> Hi.
>
> On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com> wrote:
>
>> I think that you have pointed out a problem that I’ve been discussing
>> with some users that I’m working with. The sheer volume of the abixml
>> output is pretty extreme. I don’t think that they were really expecting it.
>> However, they are not running into the particular situation that you are
>> seeing since they don’t have the coding model of the problem library. They
>> are more likely to fall in the 5-10% range that you reported above.
>>
>> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”.
>> The ELF reachability aspect that you point out seems to be part of the
>> difference between those two concepts.
>>
>>
> We want a distilled description of the ABI (the surface), not what may lie
> on the other side. More precisely, we want a typed interface where "type"
> encompasses extra things like architecture, calling convention, ISA, ELF
> symbol properties etc.
>
> One worry is that just chasing references from the exported ELF symbols
> may somehow miss something important, say for libraries that ship with bits
> of implementation in header files.
>
>
> Right now given the implementation above, it looks like ELF reachability
> is the only thing that abicompat takes into consideration. What you are
> saying potentially has important consequences because it could suggest that
> abicompat may be missing things.
>
> Do you know of or could you construct an example that demonstrates a case
> that would be missed by this ELF reachability? It may mean that we need to
> look at ways to enhance the checks in abicompat and in doing so they could
> also be part of the code that you use to prune the output of abidw.
>
>
Not really. My main thoughts were constructors and (as you mention) inline
functions in general.

If we think abouts what's being pruned:

1. declarations of functions and variables, absent from the ELF symbol table

These are either unusable (link failures) or have an implementation
entirely within header files.

In the simplest cases, a constexpr value or the behaviour of an inline
function could change between library versions, or their types could
change. We cannot possibly detect all of these kinds of things. The changes
may well be incompatible.

Note that abidiff doesn't report any such changes; I checked both with
abidiff -w and using objcopy to strip unwanted symbols.

2. types that are not reachable from any of the ELF symbols

If the types are not POD and caused the emission of constructors etc., then
the functions are covered by 1.

For POD types, the user can have code that builds, consumes and mutates
objects of these types. But it has no way to pass such objects through the
library interface (symbols) unless the types are hidden behind things like
void *.

If in some future revision of libabigail, we had a mechanism to say "this
void could actually be one of these other types" in the ABI, then the types
would become reachable, diffable and also not pruned by abiprune.

The thing that pops into my mind would be inline functions defined in the
> header files. We may be able to find those by searching through the TUs but
> then there are things like cpp macros even with -g3 I’m not sure we could
> find those but I’m not entirely sure we would need to.
>
> I really do believe that this topic needs additional investigation.
>
>
I think the question to answer is roughly: what can we get from the library
headers that is not reachable from the export ELF symbols? And the answer
consists of the values of all defines, macros, constants, inline functions.

However, it may be necessary to subtract some of the OS "background" noise.
Just because a random dependency of the library introduces a new constant,
it doesn't mean we care. We'd also need to decide where to draw a line
(which header files are excluded).

At one end of the spectrum, we'd need a sophisticated parser and deep
knowledge of the language semantics and at the other we'd just offer up
header diffs. How much of this were you hoping would be in the TUs (bearing
in mind that the library may not even use everything it makes available to
client code)?


> -ben
>
>
Giuliano.


>
>> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <
>> libabigail@sourceware.org> wrote:
>> >
>> > In response to internal reports about very large ABIs for libraires
>> > implemented in C++ but exposing only a C API, I wrote a script as a
>> > means of investigating the issue.
>> >
>> > The script aims to remove parts of the XML that are irrelevant to
>> > abidiff. It has been run on the current libabigail tests as well as
>> > some locally-generated ABI files.
>> >
>> > On most XML files, this results in a modest saving (5-10%). However,
>> > for the library that sparked the invesigaton, the resulting XML was
>> > more than 2300 times smaller.
>> >
>> > The script may have some merit on its own but perhaps should be
>> > implemented differently.
>> >
>> > Question: How hard would it be to implement this functionality (under
>> > flag control) within libabigail itself?
>>
>> I can’t directly answer “how hard” that is one of the things which we
>> would discover by actually doing it. As for should it be done. I think it
>> should be carefully looked into. However, as for priority since it is an
>> optimization rather than a functional problem, if it were me, I would file
>> this as a bug and put examples of the test data and results you are seeing
>> as well as the script below in that bz and then start working on the
>> process of refining libabigail’s output so that it matches the results of
>> your script.
>>
>>
> It's an optimisation analogous to DCE in a compiler. I'll certainly follow
> up with a bug. It's one library in particular that is ~50M before pruning
> and ~22k after.
>
> I've already made changes to the script so that it can be given a list of
> symbols and produce a reduced ABI. At the moment, we achieve this by
> passing a list to abidw (with -w). In fact, we extract full and reduced
> ABIs concurrently in the Android kernel build as each run takes a while. It
> should be straightforward to reimplement the script in C++, though it's not
> too slow already.
>
> In terms of implementation, it would need to work with libabigail IR and
> ir_node_visitor might be the right place to start.
>
> Giuliano.
>
>
>> -ben
>>
>> >
>> > A couple of tangential things worth mentioning (see the comments at
>> > the very beginning and very end of the script):
>> >
>> > - abidiff rejects XML files with an XML declaration at the top
>> > - abidiff rejects completely empty files
>> >
>> > Resolving the first would make the second moot.
>> >
>> >       * scripts/abiprune.pl: Add script to prune ABI XML.
>> >
>> > Signed-off-by: Giuliano Procida <gprocida@google.com>
>> > ---
>> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 323 insertions(+)
>> > create mode 100755 scripts/abiprune.pl
>> >
>> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
>> > new file mode 100755
>> > index 00000000..8e04a273
>> > --- /dev/null
>> > +++ b/scripts/abiprune.pl
>> > @@ -0,0 +1,323 @@
>> > +#!/usr/bin/perl
>> > +
>> > +# This script is intended to consume libabigail ABI XML as generated
>> > +# by abidw and produce a possibly smaller representation that captures
>> > +# the same ABI. In particular, the output should be such that abidiff
>> > +# --harmless reports no differences (or is empty).
>> > +#
>> > +# It does this by modelling symbols and types dependencies as a
>> > +# directed graph, marking nodes reachable from the ELF symbols and
>> > +# dropping unreachable nodes.
>> > +
>> > +use strict;
>> > +use warnings;
>> > +no warnings 'recursion';
>> > +
>> > +use autodie;
>> > +
>> > +use XML::LibXML;
>> > +
>> > +# Making a graph from ABI XML.
>> > +#
>> > +# The following are the types of "node" we care about. The "edges" are
>> > +# obtained from a few XML attributes as well as via XML element
>> > +# containment.
>> > +#
>> > +#  ELF (exported) symbols
>> > +#
>> > +#   elf-symbol (has a name; the code here currently knows nothing
>> > +#     about aliases)
>> > +#
>> > +#  Declarations (that mention a symbol)
>> > +#
>> > +#   These live in a scope. In C++ scopes can be nested and include
>> > +#   namespaces and class types.
>> > +#
>> > +#  var-decl (also used for member variables)
>> > +#    elf-symbol linked to symbol via mangled-name
>> > +#    type-id links to a type
>> > +#  function-decl (also used for member functions)
>> > +#    elf-symbol linked to symbol via mangled-name
>> > +#    parameter and return type-ids link to types
>> > +#      (alas, not just a link to a function type)
>> > +#
>> > +#  Types
>> > +#
>> > +#   These occupy pretty much all the other elements, besides those
>> > +#   that act as simple containers.
>> > +#
>> > +# XML elements and their roles in more detail
>> > +#
>> > +# ELF
>> > +#
>> > +# elf-needed - container
>> > +#  dependency - names a library
>> > +# elf-function-symbols - contains a list of symbols
>> > +# elf-variable-symbols - contains a list of symbols
>> > +#  elf-symbol - describes an ELF variable or function
>> > +#
>> > +# Grouping and scoping
>> > +#
>> > +# abi-corpus-group
>> > +#  abi-corpus
>> > +#   abi-instr - compilation unit containers
>> > +#    namespace-decl - pure container, possibly named
>> > +#
>> > +# Types (some introduce scopes, only in C++)
>> > +#
>> > +# type-decl - defines a primitive type
>> > +# typedef-decl - defines a type, links to a type
>> > +# qualified-type-def - defines a type, links to a type
>> > +# pointer-type-def - defines a type, links to a type
>> > +# reference-type-def - defines a type, links to a type
>> > +# array-type-def - defines a (multidimensional array) type, refers to
>> element type, contains subranges
>> > +#  subrange - contains array length, refers to element type; defines
>> types (never referred to; duplicated)
>> > +# function-type - defines a type
>> > +#  parameter - belongs to function-type and -decl, links to a type
>> > +#  return - belongs to function-type and -decl, links to a type
>> > +# enum-decl - defines a type, names it, contains a list of enumerators
>> and an underlying-type
>> > +#  underlying-type - belongs to enum-decl
>> > +#  enumerator - belongs to enum-decl
>> > +# union-decl - defines and names a type, contains member elements
>> linked to other things
>> > +# class-decl - defines and names a type, contains base type, member
>> elements linking to other things
>> > +#  base-class - belongs to class-decl
>> > +#  data-member - container for a member; holds access level
>> > +#  member-function - container for a member; holds access level
>> > +#  member-type - container for a type declaration; holds access level
>> > +#  member-template - container for a (function?) template declaration;
>> holds access level
>> > +#
>> > +# Higher order Things
>> > +#
>> > +# class-template-decl - defines a "type", but without instantiation
>> this isn't usable
>> > +# function-template-decl - defines a "type", but without instantiation
>> this isn't usable
>> > +#  template-type-parameter - defines a type (formal variable), perhaps
>> one which should be excluded from the real type graph
>> > +#  template-non-type-parameter - names a template parameter, links to
>> a type
>> > +#  template-parameter-type-composition - container?
>> > +#
>> > +# Values
>> > +#
>> > +# function-decl - names a function, can link to a symbol
>> > +#     has same children as function-type, rather than linking to a type
>> > +# var-decl - names a variable, can link to a symbol, links to a type
>> > +
>> > +# Graph nodes.
>> > +my %symbols;
>> > +my %exported_symbols;
>> > +my %types;
>> > +# Graph edges.
>> > +my %type_type_edges;
>> > +my %symbol_type_edges;
>> > +my %type_symbol_edges;
>> > +
>> > +# Keep track of the types being defined.
>> > +my @id_stack;
>> > +# Keep track of the current (value) declaration.
>> > +my @symbol_stack;
>> > +sub process($);
>> > +sub process($) {
>> > +  my ($node) = @_;
>> > +
>> > +  # The XML attributes we care about.
>> > +  my $name;
>> > +  my $id;
>> > +  my $type_id;
>> > +  my $symbol;
>> > +  my $naming_typedef_id;
>> > +
>> > +  # Not every node we encounter is an XML element.
>> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
>> > +    $name = $node->getAttribute('name');
>> > +    $id = $node->getAttribute('id');
>> > +    $type_id = $node->getAttribute('type-id');
>> > +    $symbol = $node->getAttribute('mangled-name');
>> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
>> > +    die if defined $id && defined $symbol;
>> > +  }
>> > +
>> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
>> > +    $exported_symbols{$name} = undef;
>> > +    # Early return is safe, but not necessary.
>> > +    return;
>> > +  }
>> > +
>> > +  if (defined $id) {
>> > +    # This element defines a type (but there may be more than one
>> > +    # defining the same type - we cannot rely on uniqueness).
>> > +    $types{$id} = undef;
>> > +    if (defined $naming_typedef_id) {
>> > +      # This is an odd one, there can be a backwards link from an
>> > +      # anonymous type to the typedef that refers to it, so we need to
>> > +      # pull in the typedef, even if nothing else refers to it.
>> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
>> > +    }
>> > +    if (@id_stack) {
>> > +      # Type parent<->child dependencies; record dependencies both
>> > +      # ways to avoid making holes in the XML types.
>> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
>> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
>> > +    }
>> > +    push @id_stack, $id;
>> > +  }
>> > +
>> > +  if (defined $symbol) {
>> > +    # This element is a declaration linked to a symbol (whether or not
>> > +    # exported).
>> > +    $symbols{$symbol} = undef;
>> > +    if (@id_stack) {
>> > +      # The enclosing scope may include a C++ class, in which case we
>> > +      # need to record the dependency.
>> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
>> > +    }
>> > +    # The symbol depends on the types mentioned in this element, so
>> > +    # record it.
>> > +    push @symbol_stack, $symbol;
>> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
>> > +  }
>> > +
>> > +  if (defined $type_id) {
>> > +    if (@id_stack) {
>> > +      # The enclosing type refers to this type, record the dependency.
>> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
>> > +    }
>> > +    if (@symbol_stack) {
>> > +      # The enclosing declaration (linked to a symbol) refers to this
>> > +      # type, record the dependency.
>> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
>> > +    }
>> > +  }
>> > +
>> > +  for my $child ($node->nonBlankChildNodes()) {
>> > +    process($child);
>> > +  }
>> > +
>> > +  if (defined $symbol) {
>> > +    pop @symbol_stack;
>> > +  }
>> > +  if (defined $id) {
>> > +    pop @id_stack;
>> > +  }
>> > +}
>> > +
>> > +# Load the XML.
>> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
>> > +# Build a graph.
>> > +process($dom);
>> > +die if @id_stack or @symbol_stack;
>> > +
>> > +# DFS visited state.
>> > +my %seen_types;
>> > +my %seen_symbols;
>> > +sub dfs_type($);
>> > +sub dfs_symbol($);
>> > +
>> > +sub dfs_type($) {
>> > +  my ($type) = @_;
>> > +  return if exists $seen_types{$type};
>> > +  $seen_types{$type} = undef;
>> > +
>> > +  my $types = $type_type_edges{$type};
>> > +  if (defined $types) {
>> > +    for my $type (keys %$types) {
>> > +      for my $link (keys %{$types->{$type}}) {
>> > +        if (exists $types->{$type}{$link}) {
>> > +          dfs_type($type);
>> > +        }
>> > +      }
>> > +    }
>> > +  }
>> > +  my $symbols = $type_symbol_edges{$type};
>> > +  if (defined $symbols) {
>> > +    for my $symbol (keys %$symbols) {
>> > +      dfs_symbol($symbol);
>> > +    }
>> > +  }
>> > +}
>> > +
>> > +sub dfs_symbol($) {
>> > +  my ($symbol) = @_;
>> > +  return if exists $seen_symbols{$symbol};
>> > +  $seen_symbols{$symbol} = undef;
>> > +
>> > +  my $types = $symbol_type_edges{$symbol};
>> > +  if (defined $types) {
>> > +    for my $type (keys %$types) {
>> > +      dfs_type($type);
>> > +    }
>> > +  } else {
>> > +    warn "no declaration found for exported symbol $symbol\n";
>> > +  }
>> > +}
>> > +
>> > +# Traverse the graph, using the exported symbols as the roots.
>> > +for my $symbol (keys %exported_symbols) {
>> > +  dfs_symbol($symbol);
>> > +}
>> > +
>> > +# Useful counts.
>> > +sub print_report() {
>> > +  my $count_symbols = scalar keys %symbols;
>> > +  my $count_exported_symbols = scalar keys %exported_symbols;
>> > +  my $count_types = scalar keys %types;
>> > +  my $count_reachable_types = scalar keys %seen_types;
>> > +
>> > +  print STDERR qq{symbols = $count_symbols
>> > +exported symbols = $count_exported_symbols
>> > +types = $count_types
>> > +reachable types = $count_reachable_types
>> > +};
>> > +}
>> > +
>> > +my %drop_if_empty = map { $_ => undef } qw(
>> > +  namespace-decl
>> > +  abi-instr
>> > +  abi-corpus
>> > +  abi-corpus-group
>> > +);
>> > +
>> > +# This is another DFS traversal of the XML.
>> > +sub prune($);
>> > +sub prune($) {
>> > +  my ($node) = @_;
>> > +
>> > +  my $name = $node->getName();
>> > +  my $id;
>> > +  my $symbol;
>> > +
>> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
>> > +    $id = $node->getAttribute('id');
>> > +    $symbol = $node->getAttribute('mangled-name');
>> > +    die if defined $id && defined $symbol;
>> > +  }
>> > +
>> > +  # Return if we know that this is a type or declaration to keep or
>> > +  # drop in its entirety.
>> > +  if (defined $id) {
>> > +    return !exists $seen_types{$id};
>> > +  }
>> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
>> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
>> > +  }
>> > +
>> > +  # Otherwise, this is not a type, declaration or part thereof, so
>> > +  # process child elements.
>> > +  my $empty = 1;
>> > +  for my $child ($node->nonBlankChildNodes()) {
>> > +    if (prune($child)) {
>> > +      $node->removeChild($child);
>> > +    } else {
>> > +      $empty = 0;
>> > +    }
>> > +  }
>> > +
>> > +  # Empty container elements can be dropped.
>> > +  return $empty && exists $drop_if_empty{$name};
>> > +}
>> > +
>> > +# Drop unreachable nodes and any container nodes that end up empty.
>> > +prune($dom);
>> > +my $out = $dom->toString();
>> > +# Remove blank line debris.
>> > +$out =~ s;^ *\n;;mg;
>> > +# Remove the XML declaration as abidiff doesn't like it.
>> > +$out =~ s;^<\?xml .*\?>\n;;m;
>> > +print $out;
>> > --
>> > 2.31.0.rc2.261.g7f71774620-goog
>> >
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v3] Add an experimental ABI pruning utility.
  2021-03-12 16:59 ` [RFC PATCH v2] " Giuliano Procida
@ 2021-03-16 16:55   ` Giuliano Procida
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
  0 siblings, 1 reply; 17+ messages in thread
From: Giuliano Procida @ 2021-03-16 16:55 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue.

The script aims to remove parts of the XML that are irrelevant to
abidiff. It has been run on current libabigail test data ABI files as
well as some locally-generated ones.

The script by default considers all exported ELF symbols as the roots
of interest (but this list can be overridden).

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

The script may have some merit on its own but perhaps should be
implemented differently (say in C++).

Question: How hard would it be to implement this functionality (under
flag control) within libabigail itself?

A couple of tangential things worth mentioning (see the comments at
the very beginning and very end of the script):

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot.

	* scripts/abiprune.pl: Add script to prune ABI XML or restrict
	it to a subset of symbols.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abiprune.pl | 356 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100755 scripts/abiprune.pl

diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
new file mode 100755
index 00000000..e26d3115
--- /dev/null
+++ b/scripts/abiprune.pl
@@ -0,0 +1,356 @@
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+#
+# It does this by modelling symbols and types dependencies as a
+# directed graph, marking nodes reachable from the ELF symbols and
+# dropping unreachable nodes.
+
+use strict;
+use warnings;
+no warnings 'recursion';
+
+use autodie;
+
+use Getopt::Long;
+use IO::File;
+use XML::LibXML;
+
+# Making a graph from ABI XML.
+#
+# The following are the types of "node" we care about. The "edges" are
+# obtained from a few XML attributes as well as via XML element
+# containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+#
+# XML elements and their roles in more detail
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-variable-symbols - contains a list of symbols
+# elf-function-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a type (function), but without instantiation this isn't usable
+# function-template-decl - defines a type (function), but without instantiation this isn't usable
+#  template-type-parameter - defines a type (variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+# var-decl - names a variable, can link to a symbol, links to a type
+
+my $symbols_file;
+GetOptions ("symbols=s" => \$symbols_file)
+  && scalar @ARGV == 0
+  or die "usage: $0 [-s symbols_file]\n";
+
+my %wanted_symbols;
+if ($symbols_file) {
+  my $fh = new IO::File $symbols_file, '<';
+  while (<$fh>) {
+    chomp;
+    $wanted_symbols{$_} = undef;
+  }
+  close $fh;
+}
+
+# Graph nodes.
+my %symbols;
+my %exported_symbols;
+my %types;
+# Graph edges.
+my %type_type_edges;
+my %symbol_type_edges;
+my %type_symbol_edges;
+
+# Keep track of the types being defined.
+my @id_stack;
+# Keep track of the current (value) declaration.
+my @symbol_stack;
+# Traverse the whole XML DOM.
+sub process($);
+sub process($) {
+  my ($node) = @_;
+
+  # The XML attributes we care about.
+  my $name;
+  my $id;
+  my $type_id;
+  my $symbol;
+  my $naming_typedef_id;
+
+  # Not every node we encounter is an XML element.
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $type_id = $node->getAttribute('type-id');
+    $symbol = $node->getAttribute('mangled-name');
+    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+    die if defined $id && defined $symbol;
+  }
+
+  if (defined $name && $node->getName() eq 'elf-symbol') {
+    $exported_symbols{$name} = undef;
+    # Early return is safe, but not necessary.
+    return;
+  }
+
+  if (defined $id) {
+    # This element defines a type (but there may be more than one
+    # defining the same type - we cannot rely on uniqueness).
+    $types{$id} = undef;
+    if (defined $naming_typedef_id) {
+      # This is an odd one, there can be a backwards link from an
+      # anonymous type to the typedef that refers to it, so we need to
+      # pull in the typedef, even if nothing else refers to it.
+      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
+    }
+    if (@id_stack) {
+      # Type parent<->child dependencies; record dependencies both
+      # ways to avoid making holes in the XML types.
+      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
+      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
+    }
+    push @id_stack, $id;
+  }
+
+  if (defined $symbol) {
+    # This element is a declaration linked to a symbol (whether or not
+    # exported).
+    $symbols{$symbol} = undef;
+    if (@id_stack) {
+      # The enclosing scope may include a C++ class, in which case we
+      # need to record the dependency.
+      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
+    }
+    # The symbol depends on the types mentioned in this element, so
+    # record it.
+    push @symbol_stack, $symbol;
+    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
+  }
+
+  if (defined $type_id) {
+    if (@id_stack) {
+      # The enclosing type refers to this type, record the dependency.
+      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
+    }
+    if (@symbol_stack) {
+      # The enclosing declaration (linked to a symbol) refers to this
+      # type, record the dependency.
+      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
+    }
+  }
+
+  for my $child ($node->nonBlankChildNodes()) {
+    process($child);
+  }
+
+  if (defined $symbol) {
+    pop @symbol_stack;
+  }
+  if (defined $id) {
+    pop @id_stack;
+  }
+}
+
+# Load the XML.
+my $dom = XML::LibXML->load_xml(IO => \*STDIN);
+# Build a graph.
+process($dom);
+die if @id_stack or @symbol_stack;
+
+%wanted_symbols = %exported_symbols unless defined $symbols_file;
+
+# DFS visited state.
+my %seen_types;
+my %seen_symbols;
+sub dfs_type($);
+sub dfs_symbol($);
+
+sub dfs_type($) {
+  my ($type) = @_;
+  if (not exists $types{$type}) {
+    warn "edge to unknown type $type\n";
+    return;
+  }
+  return if exists $seen_types{$type};
+  $seen_types{$type} = undef;
+
+  my $types = $type_type_edges{$type};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      for my $link (keys %{$types->{$type}}) {
+        if (exists $types->{$type}{$link}) {
+          dfs_type($type);
+        }
+      }
+    }
+  }
+  my $symbols = $type_symbol_edges{$type};
+  if (defined $symbols) {
+    for my $symbol (keys %$symbols) {
+      dfs_symbol($symbol);
+    }
+  }
+}
+
+sub dfs_symbol($) {
+  my ($symbol) = @_;
+  return if exists $seen_symbols{$symbol};
+  $seen_symbols{$symbol} = undef;
+
+  my $types = $symbol_type_edges{$symbol};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      dfs_type($type);
+    }
+  } else {
+    warn "no declaration found for exported symbol $symbol\n";
+  }
+}
+
+# Traverse the graph, starting from the wanted symbols.
+for my $symbol (keys %wanted_symbols) {
+  dfs_symbol($symbol);
+}
+
+# Useful counts.
+sub print_report() {
+  my $count_symbols = scalar keys %symbols;
+  my $count_exported_symbols = scalar keys %exported_symbols;
+  my $count_types = scalar keys %types;
+  my $count_reachable_types = scalar keys %seen_types;
+
+  print STDERR qq{symbols = $count_symbols
+exported symbols = $count_exported_symbols
+types = $count_types
+reachable types = $count_reachable_types
+};
+}
+
+my %drop_if_empty = map { $_ => undef } qw(
+  elf-variable-symbols
+  elf-function-symbols
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is another XML DOM traversal.
+sub prune($);
+sub prune($) {
+  my ($node) = @_;
+
+  my $node_name = $node->getName();
+  my $name;
+  my $id;
+  my $symbol;
+
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $symbol = $node->getAttribute('mangled-name');
+    die if defined $id && defined $symbol;
+  }
+
+  # Keep / prune wanted / unwanted symbols
+  if (defined $name && $node_name eq 'elf-symbol') {
+    return !exists $seen_symbols{$name};
+  }
+
+  # Return if we know that this is a type or declaration to keep or
+  # drop in its entirety.
+  if (defined $id) {
+    return !exists $seen_types{$id};
+  }
+  if ($node_name eq 'var-decl' || $node_name eq 'function-decl') {
+    return !defined $symbol || !exists $seen_symbols{$symbol};
+  }
+
+  # Otherwise, this is not a type, declaration or part thereof, so
+  # process child elements.
+  my $empty = 1;
+  for my $child ($node->nonBlankChildNodes()) {
+    if (prune($child)) {
+      $node->removeChild($child);
+    } else {
+      $empty = 0;
+    }
+  }
+
+  # Empty container elements can be dropped.
+  return $empty && exists $drop_if_empty{$node_name};
+}
+
+# Drop unreachable nodes and any container nodes that end up empty.
+prune($dom);
+my $out = $dom->toString();
+# Remove blank line debris.
+$out =~ s;^ *\n;;mg;
+# Remove the XML declaration as abidiff doesn't like it.
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $out;
-- 
2.31.0.rc2.261.g7f71774620-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 0/9] Utility to manipulate ABI XML
  2021-03-16 16:55   ` [RFC PATCH v3] " Giuliano Procida
@ 2021-03-25 21:51     ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 1/9] Add ABI tidying utility Giuliano Procida
                         ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

Hi Dodji.

Following up from the script which would prune unreachable parts of
the ABI, I refactored this into more useful pieces and added a few
different processing passes.

Each pass performs a specific function.

* don't give anonymous types (differing) names
* duplicate (type id) detection and resolution
  * one kind of duplicate (member-type fragments) looks easy to resolve
  * there are at least two other kinds, probably IR / canonicalisation bugs
* duplicate (type name) resolution
  * some opportunities for applying ODR for kernel code are being missed
* post-processing XML for fun and profit
  * reachability analysis and ABI pruning
  * symbol list filtering as an XML transformation
  * stabilisation of type and declaration order within a corpus

The script serves as a vehicle for analysis and bug reporting - I've
already opened a few bugs and have a couple more pending - as well a
means of prototyping XML transformation via DOM manipulation. It's a
work in progress.

I've run this over every .xml and .abi in tests, followed by abidiff,
and the results are good with these caveats:

* the script finds conflicting definitions of type ids
* renaming anonymous types causes harmless diffs
* the XML reader gives anonymous types linkage names which triggered
  an assertion in XML late canonicalisation

I hope this is interesting if not immediately useful.

Regards,
Giuliano.

Giuliano Procida (9):
  Add ABI tidying utility
  Add pass to drop empty XML elements
  Add pass to prune unreachable parts of the ABI
  Add pass to filter symbols
  Add pass to normalise anonymous type names
  Add pass to report duplicate type ids
  Add pass to eliminate duplicate member-type fragments
  Add pass to stabilise types and declarations
  Add pass to resolve stray forward type declarations

 scripts/abitidy.pl | 707 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 707 insertions(+)
 create mode 100755 scripts/abitidy.pl

-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/9] Add ABI tidying utility
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 2/9] Add pass to drop empty XML elements Giuliano Procida
                         ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

This initial version:

- reads XML into a DOM
- strips all text (whitespace) from elements
- reindents, assuming empty elements are emitted as a single tag
- writes out XML, excluding the XML declaration

Removing text elements makes other manipulation of the XML DON easier.

This should be a semantics-preserving transformation, but is not. See
https://sourceware.org/bugzilla/show_bug.cgi?id=27616.

	* scripts/abitidy.pl (stript_text): New function to remove
	text nodes from a DOM. (indent): New function to add whitespace
	nodes to reindent a DOM. (...): The rest of script consists of
	top-level comments, option handling and DOM read / write.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 146 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100755 scripts/abitidy.pl

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
new file mode 100755
index 00000000..66d636d7
--- /dev/null
+++ b/scripts/abitidy.pl
@@ -0,0 +1,146 @@
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+
+use v5.32.0;
+use strict;
+use warnings;
+use experimental 'signatures';
+
+use autodie;
+
+use Data::Dumper;
+use Getopt::Long;
+use IO::File;
+use XML::LibXML;
+
+# Overview of ABI XML elements and their roles
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-variable-symbols - contains a list of symbols
+# elf-function-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a type (function), but without instantiation this isn't usable
+# function-template-decl - defines a type (function), but without instantiation this isn't usable
+#  template-type-parameter - defines a type (variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# var-decl - names a variable, can link to a symbol, links to a type
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+
+# Remove all text nodes.
+sub strip_text($dom) {
+  for my $node ($dom->findnodes('//text()')) {
+    $node->unbindNode();
+  }
+}
+
+# Make XML nicely indented. We could make the code a bit less inside
+# out by passing the parent node as an extra argument. Efforts in this
+# direction ran into trouble.
+sub indent;
+sub indent($indent, $node) {
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    my @children = $node->childNodes();
+    return unless @children;
+    my $more_indent = $indent + 2;
+    # The ordering of operations here is incidental. The outcomes we
+    # want are 1. an extra newline after the opening tag and
+    # reindenting the closing tag to match, and 2. indentation for the
+    # children.
+    $node->insertBefore(new XML::LibXML::Text("\n"), $children[0]);
+    for my $child (@children) {
+      $node->insertBefore(new XML::LibXML::Text(' ' x $more_indent), $child);
+      indent($more_indent, $child);
+      $node->insertAfter(new XML::LibXML::Text("\n"), $child);
+    }
+    $node->appendText(' ' x $indent);
+  } else {
+    for my $child ($node->childNodes()) {
+      indent($indent, $child);
+    }
+  }
+}
+
+# Parse arguments.
+my $input_opt;
+my $output_opt;
+my $all_opt;
+GetOptions('i|input=s' => \$input_opt,
+           'o|output=s' => \$output_opt,
+           'a|all' => sub {
+             1
+           },
+  ) and !@ARGV or die("usage: $0",
+                      map { (' ', $_) } (
+                        '[-i|--input file]',
+                        '[-o|--output file]',
+                        '[-a|--all]',
+                      ), "\n");
+
+exit 0 unless defined $input_opt;
+
+# Load the XML.
+my $input = $input_opt eq '-' ? \*STDIN : new IO::File $input_opt, '<';
+my $dom = XML::LibXML->load_xml(IO => $input);
+close $input;
+
+# This simplifies DOM analysis and manipulation.
+strip_text($dom);
+
+exit 0 unless defined $output_opt;
+
+# Reformat for human consumption.
+indent(0, $dom);
+
+# Emit the XML, removing the XML declaration.
+my $output = $output_opt eq '-' ? \*STDOUT : new IO::File $output_opt, '>';
+my $out = $dom->toString();
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $output $out;
+close $output;
+
+exit 0;
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/9] Add pass to drop empty XML elements
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 1/9] Add ABI tidying utility Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 3/9] Add pass to prune unreachable parts of the ABI Giuliano Procida
                         ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

Certain elements in ABI XML are effectvely containers and can be
dropped if empty and their attributes don't carry ABI information.

- elf-variable-symbols: pure container
- elf-function-symbols: pure container
- namespace-decl: has a name
- abi-instr: compilation unit (path etc.)
- abi-corpus: binary object (architecture)
- abi-corpus-group: binary objects (architecture)

It could be argued that abi-corpus (or abi-corpus-group) should be
kept around to hold the architecture of an object or set of objects.
However, if a binary object has no symbols (say, if it is empty), it
hardly matters what the architecture is.

Note that:

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot. In the meantime, we
avoid dropping top-level elements.

	* scripts/abitidy.pl (drop_if_empty): New variable containing
	the tags of elements that can be dropped if empty.
	(drop_empty): New Function that removes empty elements, except
	top-level ones.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 66d636d7..1f74e267 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -105,20 +105,58 @@ sub indent($indent, $node) {
   }
 }
 
+# Remove an XML element and any preceeding comment.
+sub remove_node($node) {
+  my $prev = $node->previousSibling();
+  if ($prev && $prev->nodeType == XML_COMMENT_NODE) {
+    $prev->unbindNode();
+  }
+  $node->unbindNode();
+}
+
+# These container elements can be dropped if empty.
+my %drop_if_empty = map { $_ => undef } qw(
+  elf-variable-symbols
+  elf-function-symbols
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is a XML DOM traversal as we want post-order traversal so we
+# delete nodes that become empty during the process.
+sub drop_empty;
+sub drop_empty($node) {
+  my $node_name = $node->getName();
+  for my $child ($node->childNodes()) {
+    drop_empty($child);
+  }
+  if (!$node->hasChildNodes() && $node->nodeType == XML_ELEMENT_NODE && exists $drop_if_empty{$node->getName()}) {
+    # Until abidiff accepts empty ABIs, avoid dropping top-level elements.
+    if ($node->parentNode->nodeType == XML_ELEMENT_NODE) {
+      remove_node($node);
+    }
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
 my $all_opt;
+my $drop_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            'a|all' => sub {
-             1
+             $drop_opt = 1
            },
+           'd|drop-empty!' => \$drop_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
                         '[-i|--input file]',
                         '[-o|--output file]',
                         '[-a|--all]',
+                        '[-d|--[no-]drop-empty]',
                       ), "\n");
 
 exit 0 unless defined $input_opt;
@@ -131,6 +169,9 @@ close $input;
 # This simplifies DOM analysis and manipulation.
 strip_text($dom);
 
+# Drop empty elements.
+drop_empty($dom) if $drop_opt;
+
 exit 0 unless defined $output_opt;
 
 # Reformat for human consumption.
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/9] Add pass to prune unreachable parts of the ABI
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 1/9] Add ABI tidying utility Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 2/9] Add pass to drop empty XML elements Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 4/9] Add pass to filter symbols Giuliano Procida
                         ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue. This has been adapted here.

The aim is to remove parts of the XML that are irrelevant to
abidiff. ELF symbols (and shared object dependencies) are taken to be
the graph roots for the purpose of reachability analysis.

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

If this functionality is implemented in libagail tself, this pass will
become a no-op.

	* scripts/abitidy.pl (prune_unreachable): New function to
	determine the reachable declaration and type nodes in an XNL
	ABI and remove all unreachable ones.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 215 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 214 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 1f74e267..c9f93ed8 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -140,23 +140,233 @@ sub drop_empty($node) {
   }
 }
 
+# Remove unreachable declarations and types.
+#
+# When making a graph from ABI XML, the following are the types of
+# "node" we care about. The "edges" are obtained from a few XML
+# attributes as well as via XML element containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+sub prune_unreachable($dom) {
+  my %elf_symbols;
+  # Graph vertices (only needed for statistics).
+  my %vertices;
+  # Graph edges.
+  my %edges;
+
+  # Keep track of type / symbol nesting.
+  my @stack;
+
+  # Traverse the whole XML DOM.
+  my sub make_graph($node) {
+    # The XML attributes we care about.
+    my $name;
+    my $id;
+    my $type_id;
+    my $symbol;
+    my $naming_typedef_id;
+
+    # Not every node we encounter is an XML element.
+    if ($node->nodeType == XML_ELEMENT_NODE) {
+      $name = $node->getAttribute('name');
+      $id = $node->getAttribute('id');
+      $type_id = $node->getAttribute('type-id');
+      $symbol = $node->getAttribute('mangled-name');
+      $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+      die if defined $id && defined $symbol;
+    }
+
+    if (defined $name && $node->getName() eq 'elf-symbol') {
+      $elf_symbols{$name} = undef;
+      # Early return is safe, but not necessary.
+      return;
+    }
+
+    if (defined $id) {
+      my $vertex = "type:$id";
+      # This element defines a type (but there may be more than one
+      # defining the same type - we cannot rely on uniqueness).
+      $vertices{$vertex} = undef;
+      if (defined $naming_typedef_id) {
+        # This is an odd one, there can be a backwards link from an
+        # anonymous type to the typedef that refers to it, so we need to
+        # pull in the typedef, even if nothing else refers to it.
+        $edges{$vertex}{"type:$naming_typedef_id"} = undef;
+      }
+      if (@stack) {
+        # Parent<->child dependencies; record dependencies both
+        # ways to avoid holes in XML types and declarations.
+        $edges{$stack[-1]}{$vertex} = undef;
+        $edges{$vertex}{$stack[-1]} = undef;
+      }
+      push @stack, $vertex;
+    }
+
+    if (defined $symbol) {
+      my $vertex = "symbol:$symbol";
+      # This element is a declaration linked to a symbol (whether or not
+      # exported).
+      $vertices{$vertex} = undef;
+      if (@stack) {
+        # Parent<->child dependencies; record dependencies both ways
+        # to avoid holes in XML types and declarations.
+        #
+        # Symbols exist outside of the type hierarchy, so choosing to
+        # make them depend on a containing type scope and vice versa
+        # is conservative and probably not necessary.
+        $edges{$stack[-1]}{$vertex} = undef;
+        $edges{$vertex}{$stack[-1]} = undef;
+      }
+      # The symbol depends on the types mentioned in this element, so
+      # record it.
+      push @stack, $vertex;
+      # In practice there will be at most one symbol on the stack; we
+      # could verify this here, but it wouldn't achieve anything.
+    }
+
+    if (defined $type_id) {
+      if (@stack) {
+        # The enclosing type or symbol refers to another type.
+        $edges{$stack[-1]}{"type:$type_id"} = undef;
+      }
+    }
+
+    for my $child ($node->childNodes()) {
+      __SUB__->($child);
+    }
+
+    if (defined $symbol) {
+      pop @stack;
+    }
+    if (defined $id) {
+      pop @stack;
+    }
+  }
+
+  # Build a graph.
+  make_graph($dom);
+  die if @stack;
+  #warn Dumper(\%elf_symbols, \%vertices, \%edges);
+
+  # DFS visited state. Would be nicer with a flat namespace of nodes.
+  my %seen;
+  my sub dfs($vertex) {
+    no warnings 'recursion';
+    return if exists $seen{$vertex};
+    $seen{$vertex} = undef;
+
+    my $tos = $edges{$vertex};
+    if (defined $tos) {
+      for my $to (keys %$tos) {
+        __SUB__->($to);
+      }
+    }
+  }
+
+  # Traverse the graph, starting from the exported symbols.
+  for my $symbol (keys %elf_symbols) {
+    my $vertex = "symbol:$symbol";
+    if (exists $vertices{$vertex}) {
+      dfs($vertex);
+    } else {
+      warn "no declaration found for ELF symbol $symbol\n";
+    }
+  }
+
+  #warn Dumper(\%seen);
+
+  # Useful counts.
+  my sub print_report() {
+    my $count_elf_symbols = scalar keys %elf_symbols;
+    my $count_vertices = scalar keys %vertices;
+    my $count_seen = scalar keys %seen;
+
+    warn qq{ELF = $count_elf_symbols
+vertices = $count_vertices
+seen = $count_seen
+};
+  }
+
+  #print_report();
+
+  # XPath selection is too slow as we end up enumerating lots of
+  # nested items whose preservation is entirely determined by their
+  # containing items. DFS with early stopping for the win.
+  my sub remove_unwanted($node) {
+    my $node_name = $node->getName();
+    my $name;
+    my $id;
+    my $symbol;
+
+    if ($node->nodeType == XML_ELEMENT_NODE) {
+      $name = $node->getAttribute('name');
+      $id = $node->getAttribute('id');
+      $symbol = $node->getAttribute('mangled-name');
+      die if defined $id && defined $symbol;
+    }
+
+    # Return if we know that this is a type or declaration to keep or
+    # drop in its entirety.
+    if (defined $id) {
+      remove_node($node) unless exists $seen{"type:$id"};
+      return;
+    }
+    if ($node_name eq 'var-decl' || $node_name eq 'function-decl') {
+      remove_node($node) unless defined $symbol && exists $seen{"symbol:$symbol"};
+      return;
+    }
+
+    # Otherwise, this is not a type, declaration or part thereof, so
+    # process child elements.
+    for my $child ($node->childNodes()) {
+      __SUB__->($child);
+    }
+  }
+
+  remove_unwanted($dom);
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
 my $all_opt;
 my $drop_opt;
+my $prune_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            'a|all' => sub {
-             $drop_opt = 1
+             $drop_opt = $prune_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
+           'p|prune-unreachable!' => \$prune_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
                         '[-i|--input file]',
                         '[-o|--output file]',
                         '[-a|--all]',
                         '[-d|--[no-]drop-empty]',
+                        '[-p|--[no-]prune-unreachable]',
                       ), "\n");
 
 exit 0 unless defined $input_opt;
@@ -169,6 +379,9 @@ close $input;
 # This simplifies DOM analysis and manipulation.
 strip_text($dom);
 
+# Prune unreachable elements.
+prune_unreachable($dom) if $prune_opt;
+
 # Drop empty elements.
 drop_empty($dom) if $drop_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/9] Add pass to filter symbols
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (2 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 3/9] Add pass to prune unreachable parts of the ABI Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 5/9] Add pass to normalise anonymous type names Giuliano Procida
                         ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

Currently symbol lists can be applied at the point of
extraction (abidw) or comparison (abidiff). This commmit enables
symbol lists to be applied directly to ABI XML files which will
simplify workflows that use multiple symbol lists.

	* scripts/abitidy.pl (symbols): New variable to hold optional
	symbols to filter by. (filter_symbols): New function to filter
	out unlisted symbols.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index c9f93ed8..468eeac4 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -347,14 +347,35 @@ seen = $count_seen
   remove_unwanted($dom);
 }
 
+# Read symbols from a file.
+sub read_symbols($file) {
+  my %symbols;
+  my $fh = new IO::File $file, '<';
+  while (<$fh>) {
+    chomp;
+    $symbols{$_} = undef;
+  }
+  close $fh;
+  return \%symbols;
+}
+
+# Remove unlisted ELF symbols,
+sub filter_symbols($symbols, $dom) {
+  for my $node ($dom->findnodes('elf-symbol')) {
+    remove_node($node) unless exists $symbols->{$node->getAttribute('name')};
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
+my $symbols_opt;
 my $all_opt;
 my $drop_opt;
 my $prune_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
+           's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
              $drop_opt = $prune_opt = 1
            },
@@ -364,6 +385,7 @@ GetOptions('i|input=s' => \$input_opt,
                       map { (' ', $_) } (
                         '[-i|--input file]',
                         '[-o|--output file]',
+                        '[-s|--symbols file]',
                         '[-a|--all]',
                         '[-d|--[no-]drop-empty]',
                         '[-p|--[no-]prune-unreachable]',
@@ -379,6 +401,9 @@ close $input;
 # This simplifies DOM analysis and manipulation.
 strip_text($dom);
 
+# Remove unlisted symbols.
+filter_symbols(read_symbols($symbols_opt), $dom) if defined $symbols_opt;
+
 # Prune unreachable elements.
 prune_unreachable($dom) if $prune_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 5/9] Add pass to normalise anonymous type names
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (3 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 4/9] Add pass to filter symbols Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 6/9] Add pass to report duplicate type ids Giuliano Procida
                         ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

Currently libabigail exposes the internal names used for anonymous
types in ABI XML. The names are not stable - they are subject to
renumbering - and cause "harmless" diffs which in turn can contribute
to hard-to-read and verbose abidiff --harmless output.

	* scripts/abitidy.pl (normalise_anonymous_type_names): New
	function to normalise anonymous type names by stripping off
	numerical suffices.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 468eeac4..321363d7 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -366,6 +366,16 @@ sub filter_symbols($symbols, $dom) {
   }
 }
 
+sub normalise_anonymous_type_names($) {
+  my ($doc) = @_;
+  my $name_path = new XML::LibXML::XPathExpression('//abi-instr//*[@name]');
+  for my $node ($doc->findnodes($name_path)) {
+    my $value = $node->getAttribute('name');
+    $value =~ s;^(__anonymous_[a-z]+__)\d*$;$1;;
+    $node->setAttribute('name', $value);
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
@@ -373,14 +383,16 @@ my $symbols_opt;
 my $all_opt;
 my $drop_opt;
 my $prune_opt;
+my $normalise_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
-             $drop_opt = $prune_opt = 1
+             $drop_opt = $prune_opt = $normalise_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
            'p|prune-unreachable!' => \$prune_opt,
+           'n|normalise-anonymous!' => \$normalise_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
                         '[-i|--input file]',
@@ -389,6 +401,7 @@ GetOptions('i|input=s' => \$input_opt,
                         '[-a|--all]',
                         '[-d|--[no-]drop-empty]',
                         '[-p|--[no-]prune-unreachable]',
+                        '[-n|--[no-]normalise-anonymous]',
                       ), "\n");
 
 exit 0 unless defined $input_opt;
@@ -404,6 +417,9 @@ strip_text($dom);
 # Remove unlisted symbols.
 filter_symbols(read_symbols($symbols_opt), $dom) if defined $symbols_opt;
 
+# Normalise anonymous type names.
+normalise_anonymous_type_names($dom) if $normalise_opt;
+
 # Prune unreachable elements.
 prune_unreachable($dom) if $prune_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 6/9] Add pass to report duplicate type ids
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (4 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 5/9] Add pass to normalise anonymous type names Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 7/9] Add pass to eliminate duplicate member-type fragments Giuliano Procida
                         ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

Duplicate type ids sometimes appear ABI XML files. If these relate to
subrange elements, they are innocuous, otherwise they represent some
duplication or even inconsistency in libabigail output. See
https://sourceware.org/bugzilla/show_bug.cgi?id=26591.

	* scripts/abitidy.pl (report_duplicate_types): New function to
	report duplicate types.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 321363d7..d5ddd7ea 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -376,6 +376,22 @@ sub normalise_anonymous_type_names($) {
   }
 }
 
+sub report_duplicate_types($dom) {
+  my %hash;
+  for my $type ($dom->findnodes('*[@id]')) {
+    # subranges are not really types and shouldn't be considered
+    next if $type->getName() eq 'subrange';
+    my $id = $type->getAttribute('id');
+    for my $ids ($hash{$id}) {
+      $ids //= [];
+      push @$ids, $type;
+    }
+  }
+  for my $id (keys %hash) {
+    warn "residual duplicated types with id $id\n";
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
@@ -384,15 +400,17 @@ my $all_opt;
 my $drop_opt;
 my $prune_opt;
 my $normalise_opt;
+my $report_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
-             $drop_opt = $prune_opt = $normalise_opt = 1
+             $drop_opt = $prune_opt = $normalise_opt = $report_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
            'p|prune-unreachable!' => \$prune_opt,
            'n|normalise-anonymous!' => \$normalise_opt,
+           'r|report-duplicates!' => \$report_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
                         '[-i|--input file]',
@@ -402,6 +420,7 @@ GetOptions('i|input=s' => \$input_opt,
                         '[-d|--[no-]drop-empty]',
                         '[-p|--[no-]prune-unreachable]',
                         '[-n|--[no-]normalise-anonymous]',
+                        '[-r|--[no-]report-duplicates]',
                       ), "\n");
 
 exit 0 unless defined $input_opt;
@@ -420,6 +439,9 @@ filter_symbols(read_symbols($symbols_opt), $dom) if defined $symbols_opt;
 # Normalise anonymous type names.
 normalise_anonymous_type_names($dom) if $normalise_opt;
 
+# Check for duplicate types.
+report_duplicate_types($dom) if $report_opt;
+
 # Prune unreachable elements.
 prune_unreachable($dom) if $prune_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 7/9] Add pass to eliminate duplicate member-type fragments
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (5 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 6/9] Add pass to report duplicate type ids Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 8/9] Add pass to stabilise types and declarations Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 9/9] Add pass to resolve stray forward type declarations Giuliano Procida
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

One common kind of duplicate type in ABI XML occurs when the XNL
writer emits a type declaration with its surrounding scope but
that scope actually includes one or more user-defined types.

An additional complication is that a nested member type's access can
be misapplied to its containing type.

The types themselves are later emitted in full and with correct member
access specifiers.

	* scripts/abitidy.pl (sub_tree): New function to determine if
	one XML node is a sub-tree of another (in the sense that the
	former can be overlaid without changing the latter), ignoring
	possibly-incorrect access attributes.
	(eliminate_duplicate_types): New function that checks nodes
	with the same type id, determines if there is a maximal node
	and, if so, remmoves the other nodes.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 112 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index d5ddd7ea..35b3c054 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -12,9 +12,11 @@ use experimental 'signatures';
 
 use autodie;
 
+use Algorithm::Diff qw(diff);
 use Data::Dumper;
 use Getopt::Long;
 use IO::File;
+use List::Util qw(any all uniq);
 use XML::LibXML;
 
 # Overview of ABI XML elements and their roles
@@ -376,6 +378,108 @@ sub normalise_anonymous_type_names($) {
   }
 }
 
+sub sub_tree;
+sub sub_tree($l, $r) {
+  # Node name must be an exact match.
+  return 0 if $l->getName() ne $r->getName();
+  # Left attributes may be missing on the left, but otherwise must match.
+  # With one exception: libabigail emits the access specifier for the
+  # type it's trying to "emit in scope" rather than what may be a
+  # containing type; so allow access to differ on member-type nodes.
+  for my $k (keys %$l) {
+    my $lv = $l->{$k};
+    my $rv = $r->{$k};
+    return 0 unless defined $rv;
+    next if $k eq 'access' && $l->getName() eq 'member-type';
+    return 0 unless $lv eq $rv;
+  }
+  # Left elements must be a subsequence of right elements.
+  my @lc = grep { $_->nodeType != XML_COMMENT_NODE } $l->childNodes();
+  my @rc = grep { $_->nodeType != XML_COMMENT_NODE } $r->childNodes();
+  unless (scalar @lc <= scalar @rc) {
+    return 0;
+  }
+  # Only handle three forms of subsequence. Otherwise need some kind
+  # of backtracking check.
+  unless (@lc) {
+    # empty
+    return 1;
+  }
+  if (scalar @lc == 1) {
+    # singleton
+    return any { $_ } map { sub_tree($lc[0], $_) } @rc;
+  }
+  if (scalar @lc == scalar @rc) {
+    # same length
+    return all { $_ } map { sub_tree($lc[$_], $rc[$_]) } (0..$#lc);
+  }
+  warn "XML elements have interestingly different subelements\n",
+    $l->toString(), "\n", $r->toString(), "\n";
+  return 0;
+}
+
+sub get_scope($node) {
+  my @names;
+  while (1) {
+    $node = $node->parentNode;
+    last unless defined $node && $node->nodeType == XML_ELEMENT_NODE;
+    my $kind = $node->getName();
+    last if $kind eq 'abi-instr';
+    # Some C++ enums introduce scope but we don't need to worry.
+    next unless $kind eq 'class-decl' || $kind eq 'union-decl' || $kind eq 'namespace-decl';
+    # Anonymous type are not permitted to contain type members (but
+    # this still works with g++ -fpermissive).
+    my $name = $node->getAttribute('name');
+    unshift @names, $name;
+  }
+  return @names;
+}
+
+sub eliminate_duplicate_types($dom) {
+  my %hash;
+  # Collect all (unnested) types and their namespace scopes.
+  for my $type ($dom->findnodes('//abi-corpus/abi-instr/*[@id] | //abi-corpus/abi-instr//namespace-decl/*[@id]')) {
+    my $id = $type->getAttribute('id');
+    my $scope = join(':', get_scope($type));
+    for my $list ($hash{$id}{$scope}) {
+      $list //= [];
+      push @$list, $type;
+    }
+  }
+  for my $id (keys %hash) {
+    my $scopes = $hash{$id};
+    if (scalar keys %$scopes > 1) {
+      warn "inconsistent scopes found for duplicate types with id $id\n";
+      next;
+    }
+    my ($ns) = keys %$scopes;
+    my $types = $scopes->{$ns};
+    next if scalar @$types == 1;
+    # Find a potential maximal candidate.
+    my $candidate = 0;
+    for my $ix (1..$#$types) {
+      $candidate = $ix if sub_tree($types->[$candidate], $types->[$ix]);
+    }
+    # Verify it is indeed maximal.
+    my @losers = grep { $_ != $candidate } (0..$#$types);
+    for my $ix (@losers) {
+      unless (sub_tree($types->[$ix], $types->[$candidate])) {
+        warn "conflicting duplicate types with id $id\n";
+        my @strs = map { $types->[$_]->toString() } ($ix, $candidate);
+        map { $_ =~ s;><;>\n<;g } @strs;
+        my @lines = map { [split("\n", $_)] } @strs;
+        warn Dumper(diff(@lines));
+        $candidate = undef;
+        last;
+      }
+    }
+    if (defined $candidate) {
+      map { remove_node($types->[$_]) } @losers;
+      warn "successfully eliminated duplicate types with id $id\n";
+    }
+  }
+}
+
 sub report_duplicate_types($dom) {
   my %hash;
   for my $type ($dom->findnodes('*[@id]')) {
@@ -400,16 +504,18 @@ my $all_opt;
 my $drop_opt;
 my $prune_opt;
 my $normalise_opt;
+my $eliminate_opt;
 my $report_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
-             $drop_opt = $prune_opt = $normalise_opt = $report_opt = 1
+             $drop_opt = $prune_opt = $normalise_opt = $eliminate_opt = $report_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
            'p|prune-unreachable!' => \$prune_opt,
            'n|normalise-anonymous!' => \$normalise_opt,
+           'e|eliminate-duplicates!' => \$eliminate_opt,
            'r|report-duplicates!' => \$report_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
@@ -420,6 +526,7 @@ GetOptions('i|input=s' => \$input_opt,
                         '[-d|--[no-]drop-empty]',
                         '[-p|--[no-]prune-unreachable]',
                         '[-n|--[no-]normalise-anonymous]',
+                        '[-e|--[no-]eliminate-duplicates]',
                         '[-r|--[no-]report-duplicates]',
                       ), "\n");
 
@@ -439,6 +546,9 @@ filter_symbols(read_symbols($symbols_opt), $dom) if defined $symbols_opt;
 # Normalise anonymous type names.
 normalise_anonymous_type_names($dom) if $normalise_opt;
 
+# Eliminate complete duplicates and extra fragments of types.
+eliminate_duplicate_types($dom) if $eliminate_opt;
+
 # Check for duplicate types.
 report_duplicate_types($dom) if $report_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 8/9] Add pass to stabilise types and declarations
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (6 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 7/9] Add pass to eliminate duplicate member-type fragments Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  2021-03-25 21:51       ` [RFC PATCH 9/9] Add pass to resolve stray forward type declarations Giuliano Procida
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

This commit adds a pass which consolidates all types and declarations
within an abi-corpus into a replacement abi-instr. These elements are
then ordered by id and name respectively, with types before
declarations.

	* scripts/abitidy.pl (stabilise_types_and_declarations): New
	function to order types and declarations as deterministically
	as possible within an abi-corpus.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 84 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 1 deletion(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 35b3c054..67fe3a69 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -496,6 +496,82 @@ sub report_duplicate_types($dom) {
   }
 }
 
+# Stabilise types and declarations.
+sub stabilise_types_and_declarations($dom) {
+  my $corpus_path = new XML::LibXML::XPathExpression('//abi-corpus');
+  my $instr_path = new XML::LibXML::XPathExpression('abi-instr');
+  my $type_or_decl_path = new XML::LibXML::XPathExpression('*[@id]|*[@name]');
+
+  my @corpora = $dom->findnodes($corpus_path);
+  for my $coprus (@corpora) {
+    # Let's squish it. We expect its abi-instr elements to have
+    # consistent attributes, ignoring path.
+    my %attrs;
+    my @children = $coprus->findnodes($instr_path);
+    for my $child (@children) {
+      for my $attr (keys %$child) {
+        my $value = $child->{$attr};
+        $attrs{$attr}{$value} = undef;
+      }
+    }
+    next unless scalar keys %attrs;
+
+    # Create a replacement abi-instr node.
+    my $replacement = new XML::LibXML::Element('abi-instr');
+    # Original attribute ordering is lost, may as well be deterministic here.
+    for my $attr (sort keys %attrs) {
+      # Check attribute consistency.
+      for my $values ($attrs{$attr}) {
+        if (scalar keys %{$values} > 1) {
+          die "unexpected non-constant abi-instr attribute $attr\n"
+            unless $attr eq 'path' || $attr eq 'comp-dir-path' || $attr eq 'language';
+          $values = { 'various' => undef };
+        }
+        for my $value (keys %$values) {
+          $replacement->setAttribute($attr, $value);
+        }
+      }
+    }
+
+    # Gather sorted types and decls.
+    my @types_and_decls = sort {
+      my $a_id = $a->{id};
+      my $a_name = $a->{name};
+      die unless defined $a_id || defined $a_name;
+      my $b_id = $b->{id};
+      my $b_name = $b->{name};
+      die unless defined $b_id || defined $b_name;
+      # types before declarations
+      # order types by id
+      # order declarations by name
+      defined $a_id != defined $b_id ? !defined $a_id <=> !defined $b_id
+        : defined $a_id ? $a_id cmp $b_id
+        : $a_name cmp $b_name
+    } map { $_->findnodes($type_or_decl_path) } @children;
+
+    # Add them to replacement abi-instr
+    map {
+      my $prev = $_->previousSibling();
+      if ($prev && $prev->nodeType == XML_COMMENT_NODE) {
+        $prev->unbindNode();
+        $replacement->appendChild($prev);
+      }
+      $_->unbindNode();
+      $replacement->appendChild($_)
+    } @types_and_decls;
+    # Remove the old abi-instr nodes.
+    for my $child (@children) {
+      if ($child->hasChildNodes()) {
+        warn "failed to evacuate abi-instr: ", $child->toString(), "\n";
+        next;
+      }
+      remove_node($child);
+    }
+    # Add the replacement abi-instr node to the abi-corpus.
+    $coprus->appendChild($replacement);
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
@@ -505,17 +581,19 @@ my $drop_opt;
 my $prune_opt;
 my $normalise_opt;
 my $eliminate_opt;
+my $stabilise_opt;
 my $report_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
-             $drop_opt = $prune_opt = $normalise_opt = $eliminate_opt = $report_opt = 1
+             $drop_opt = $prune_opt = $normalise_opt = $eliminate_opt = $stabilise_opt = $report_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
            'p|prune-unreachable!' => \$prune_opt,
            'n|normalise-anonymous!' => \$normalise_opt,
            'e|eliminate-duplicates!' => \$eliminate_opt,
+           't|stabilise-order!' => \$stabilise_opt,
            'r|report-duplicates!' => \$report_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
@@ -527,6 +605,7 @@ GetOptions('i|input=s' => \$input_opt,
                         '[-p|--[no-]prune-unreachable]',
                         '[-n|--[no-]normalise-anonymous]',
                         '[-e|--[no-]eliminate-duplicates]',
+                        '[-t|--[no-]stabilise-order]',
                         '[-r|--[no-]report-duplicates]',
                       ), "\n");
 
@@ -555,6 +634,9 @@ report_duplicate_types($dom) if $report_opt;
 # Prune unreachable elements.
 prune_unreachable($dom) if $prune_opt;
 
+# Stabilise types and declarations.
+stabilise_types_and_declarations($dom) if $stabilise_opt;
+
 # Drop empty elements.
 drop_empty($dom) if $drop_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 9/9] Add pass to resolve stray forward type declarations
  2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
                         ` (7 preceding siblings ...)
  2021-03-25 21:51       ` [RFC PATCH 8/9] Add pass to stabilise types and declarations Giuliano Procida
@ 2021-03-25 21:51       ` Giuliano Procida
  8 siblings, 0 replies; 17+ messages in thread
From: Giuliano Procida @ 2021-03-25 21:51 UTC (permalink / raw)
  To: libabigail; +Cc: dodji, kernel-team, gprocida, maennich, woodard

This can be used to improve the precision of ABI reporting for C++ and
Linux kernel which both have some kind of One Definition Rule when it
comes to types.

TODO: handle naming-typedef-id references as well

	* scripts/abitidy.pl (substitute_type_ids): New function to
	perform renaming of type-id attributes within XML elements.
	(resolve_forward_declarations): New function that resolves
	forward declarations of types to their definitions, assuming a
	consistent universe of type names.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abitidy.pl | 62 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/scripts/abitidy.pl b/scripts/abitidy.pl
index 67fe3a69..d45a82bb 100755
--- a/scripts/abitidy.pl
+++ b/scripts/abitidy.pl
@@ -464,11 +464,8 @@ sub eliminate_duplicate_types($dom) {
     my @losers = grep { $_ != $candidate } (0..$#$types);
     for my $ix (@losers) {
       unless (sub_tree($types->[$ix], $types->[$candidate])) {
-        warn "conflicting duplicate types with id $id\n";
         my @strs = map { $types->[$_]->toString() } ($ix, $candidate);
-        map { $_ =~ s;><;>\n<;g } @strs;
-        my @lines = map { [split("\n", $_)] } @strs;
-        warn Dumper(diff(@lines));
+        warn "conflicting duplicate types with id $id:\n", map { "  $_\n" } @strs, "\n";
         $candidate = undef;
         last;
       }
@@ -572,6 +569,55 @@ sub stabilise_types_and_declarations($dom) {
   }
 }
 
+# Substitute a set of type ids with another.
+sub substitute_type_ids($winner, $losers, $dom) {
+  for my $ref ($dom->findnodes('//*[@type-id]')) {
+    my $type_id = $ref->getAttribute('type-id');
+    $ref->setAttribute('type-id', $winner) if exists $losers->{$type_id};
+  }
+}
+
+# Find definitions and declarations for the same thing and replace
+# references to the latter with the former. naming-typedef-id may be
+# an added complication.
+sub resolve_forward_declarations($dom) {
+  for my $corpus ($dom->findnodes('//abi-corpus')) {
+    my %synonyms;
+    # Safe to extend to deeper-nested types? Need get_scopes.
+    for my $type ($corpus->findnodes('abi-instr/*[@id]')) {
+      my $kind = $type->getName();
+      my $name = $type->getAttribute('name');
+      next unless defined $name;
+      next if $name =~ m;^__anonymous_;;
+      my $key = "$kind:$name";
+      $synonyms{$key} //= [];
+      push @{$synonyms{$key}}, $type;
+    }
+
+    for my $key (keys %synonyms) {
+      my $types = $synonyms{$key};
+      next if scalar(@$types) == 1;
+      my @decls = grep { $_->hasAttribute('is-declaration-only') } @$types;
+      my @defns = grep { !$_->hasAttribute('is-declaration-only') } @$types;
+      next unless @decls and @defns;
+      # Have declarations and definitions, check that top-level ids
+      # are the only differences.
+      my ($kind, $name) = split(':', $key);
+      my @decl_strs = uniq map { my $str = $_->toString(); my $id = $_->getAttribute('id'); $str =~ s; id='$id';;g; $str } @decls;
+      my @defn_strs = uniq map { my $str = $_->toString(); my $id = $_->getAttribute('id'); $str =~ s; id='$id';;g; $str } @defns;
+      unless (scalar @decl_strs == 1 && scalar @defn_strs == 1) {
+        warn "cannot resolve duplicate $kind types with name $name\n";
+        next;
+      }
+      my $winner = $defns[0]->getAttribute('id');
+      my @losers = grep { $_ ne $winner } map { $_->getAttribute('id') } @$types;
+      warn "resolved $kind $name: substituting @losers with $winner\n";
+      substitute_type_ids($winner, {map { $_ => undef } @losers}, $dom);
+      map { remove_node($_) } (@defns[1..$#defns], @decls);
+    }
+  }
+}
+
 # Parse arguments.
 my $input_opt;
 my $output_opt;
@@ -582,18 +628,20 @@ my $prune_opt;
 my $normalise_opt;
 my $eliminate_opt;
 my $stabilise_opt;
+my $forward_opt;
 my $report_opt;
 GetOptions('i|input=s' => \$input_opt,
            'o|output=s' => \$output_opt,
            's|symbols=s' => \$symbols_opt,
            'a|all' => sub {
-             $drop_opt = $prune_opt = $normalise_opt = $eliminate_opt = $stabilise_opt = $report_opt = 1
+             $drop_opt = $prune_opt = $normalise_opt = $eliminate_opt = $stabilise_opt = $forward_opt = $report_opt = 1
            },
            'd|drop-empty!' => \$drop_opt,
            'p|prune-unreachable!' => \$prune_opt,
            'n|normalise-anonymous!' => \$normalise_opt,
            'e|eliminate-duplicates!' => \$eliminate_opt,
            't|stabilise-order!' => \$stabilise_opt,
+           'f|resolve-forward!' => \$forward_opt,
            'r|report-duplicates!' => \$report_opt,
   ) and !@ARGV or die("usage: $0",
                       map { (' ', $_) } (
@@ -606,6 +654,7 @@ GetOptions('i|input=s' => \$input_opt,
                         '[-n|--[no-]normalise-anonymous]',
                         '[-e|--[no-]eliminate-duplicates]',
                         '[-t|--[no-]stabilise-order]',
+                        '[-f|--[no-]resolve-forward]',
                         '[-r|--[no-]report-duplicates]',
                       ), "\n");
 
@@ -631,6 +680,9 @@ eliminate_duplicate_types($dom) if $eliminate_opt;
 # Check for duplicate types.
 report_duplicate_types($dom) if $report_opt;
 
+# Check for types which are both declared and defined.
+resolve_forward_declarations($dom) if $forward_opt;
+
 # Prune unreachable elements.
 prune_unreachable($dom) if $prune_opt;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-03-25 21:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 11:53 [RFC PATCH] Add an experimental ABI pruning utility Giuliano Procida
2021-03-11 20:39 ` Ben Woodard
2021-03-12 16:51   ` Giuliano Procida
2021-03-12 18:41     ` Ben Woodard
2021-03-15 11:08       ` Giuliano Procida
2021-03-12 16:59 ` [RFC PATCH v2] " Giuliano Procida
2021-03-16 16:55   ` [RFC PATCH v3] " Giuliano Procida
2021-03-25 21:51     ` [RFC PATCH 0/9] Utility to manipulate ABI XML Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 1/9] Add ABI tidying utility Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 2/9] Add pass to drop empty XML elements Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 3/9] Add pass to prune unreachable parts of the ABI Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 4/9] Add pass to filter symbols Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 5/9] Add pass to normalise anonymous type names Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 6/9] Add pass to report duplicate type ids Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 7/9] Add pass to eliminate duplicate member-type fragments Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 8/9] Add pass to stabilise types and declarations Giuliano Procida
2021-03-25 21:51       ` [RFC PATCH 9/9] Add pass to resolve stray forward type declarations Giuliano Procida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).