From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id BFD8D383E804 for ; Thu, 11 Mar 2021 20:39:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org BFD8D383E804 Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-107-M7IP7Y07P1eQ1b0kSKaKMQ-1; Thu, 11 Mar 2021 15:39:49 -0500 X-MC-Unique: M7IP7Y07P1eQ1b0kSKaKMQ-1 Received: by mail-qv1-f71.google.com with SMTP id iy2so16011236qvb.22 for ; Thu, 11 Mar 2021 12:39:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=aZDpuJT/lrjWa7hF0iAkgrVSSc5rMtBW/P/AEbVgxyg=; b=jWQvHNTOYB2ykhbycM8hCHL2dTL2oiDWmHO7vu6ksq3/r/7s9ZWKgPcJS28kpLZ5/N p7xX8QR7xozDG8T1XZEApx1wZu7mR88bQzTZv92v5MSiKY3R3hdIGQsCxKmq6eh3PlEP fqVLsU6znc/jOlWitwo7A2N9Bec4Ur1My6nTTA/Pk63fluRqXOo3Xq4iGWb6jcCl/W1d 54PbLQYUT785uJgtzc6TZsohrm9kH6UiGaC1kItPgkVUIfxeUHIXkqnpWVnosXqpXGY9 HhmOX4Rx0MH664BQFcbxZikqDGUkRY1S/K51jecGHYpFHeNBNfdG0DCfK69EJEci6TUB fucQ== X-Gm-Message-State: AOAM532cDygpZJ5M/SgWFM1/SVugBA4UAMYKPsn0USJtoAQPj393ZXPs clXVT2xDc6FFQZDBxC+kibCJF6iaWAhRtUbddw9X3768l1IUkotXwMFEvgnwPW+hzyXakdV3t4A SDtJ+KUvHHfC4vJt1EWM7 X-Received: by 2002:a05:620a:1443:: with SMTP id i3mr9435844qkl.354.1615495188971; Thu, 11 Mar 2021 12:39:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJxMlhqWmHzvSdlmGHQkoBICbp8yTE24VqRKeJwxT/1XclnxVkhbp0kNE/hf/W2cqQfD28YNDg== X-Received: by 2002:a05:620a:1443:: with SMTP id i3mr9435794qkl.354.1615495188257; Thu, 11 Mar 2021 12:39:48 -0800 (PST) Received: from [192.168.1.234] (47-208-193-143.trckcmtc01.res.dyn.suddenlink.net. [47.208.193.143]) by smtp.gmail.com with ESMTPSA id h67sm2809414qkd.112.2021.03.11.12.39.46 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Mar 2021 12:39:47 -0800 (PST) Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\)) Subject: Re: [RFC PATCH] Add an experimental ABI pruning utility. From: Ben Woodard In-Reply-To: <20210311115340.3033687-1-gprocida@google.com> Date: Thu, 11 Mar 2021 12:39:44 -0800 Cc: libabigail@sourceware.org, Matthias Maennich , kernel-team@android.com Message-Id: References: <20210311115340.3033687-1-gprocida@google.com> To: Giuliano Procida X-Mailer: Apple Mail (2.3654.60.0.2.21) X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libabigail@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Mailing list of the Libabigail project List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Mar 2021 20:39:54 -0000 I think that you have pointed out a problem that I=E2=80=99ve been discussi= ng with some users that I=E2=80=99m working with. The sheer volume of the a= bixml output is pretty extreme. I don=E2=80=99t think that they were really= expecting it. However, they are not running into the particular situation = that you are seeing since they don=E2=80=99t have the coding model of the p= roblem library. They are more likely to fall in the 5-10% range that you re= ported above.=20 We have been talking about it in terms of =E2=80=9CABI Surface=E2=80=9D vs.= =E2=80=9CABI Corpus=E2=80=9D. The ELF reachability aspect that you point o= ut seems to be part of the difference between those two concepts. > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail wrote: >=20 > In response to internal reports about very large ABIs for libraires > implemented in C++ but exposing only a C API, I wrote a script as a > means of investigating the issue. >=20 > The script aims to remove parts of the XML that are irrelevant to > abidiff. It has been run on the current libabigail tests as well as > some locally-generated ABI files. >=20 > On most XML files, this results in a modest saving (5-10%). However, > for the library that sparked the invesigaton, the resulting XML was > more than 2300 times smaller. >=20 > The script may have some merit on its own but perhaps should be > implemented differently. >=20 > Question: How hard would it be to implement this functionality (under > flag control) within libabigail itself? I can=E2=80=99t directly answer =E2=80=9Chow hard=E2=80=9D that is one of t= he things which we would discover by actually doing it. As for should it be= done. I think it should be carefully looked into. However, as for priority= since it is an optimization rather than a functional problem, if it were m= e, I would file this as a bug and put examples of the test data and results= you are seeing as well as the script below in that bz and then start worki= ng on the process of refining libabigail=E2=80=99s output so that it matche= s the results of your script. -ben >=20 > A couple of tangential things worth mentioning (see the comments at > the very beginning and very end of the script): >=20 > - abidiff rejects XML files with an XML declaration at the top > - abidiff rejects completely empty files >=20 > Resolving the first would make the second moot. >=20 > =09* scripts/abiprune.pl: Add script to prune ABI XML. >=20 > Signed-off-by: Giuliano Procida > --- > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 323 insertions(+) > create mode 100755 scripts/abiprune.pl >=20 > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl > new file mode 100755 > index 00000000..8e04a273 > --- /dev/null > +++ b/scripts/abiprune.pl > @@ -0,0 +1,323 @@ > +#!/usr/bin/perl > + > +# This script is intended to consume libabigail ABI XML as generated > +# by abidw and produce a possibly smaller representation that captures > +# the same ABI. In particular, the output should be such that abidiff > +# --harmless reports no differences (or is empty). > +# > +# It does this by modelling symbols and types dependencies as a > +# directed graph, marking nodes reachable from the ELF symbols and > +# dropping unreachable nodes. > + > +use strict; > +use warnings; > +no warnings 'recursion'; > + > +use autodie; > + > +use XML::LibXML; > + > +# Making a graph from ABI XML. > +# > +# The following are the types of "node" we care about. The "edges" are > +# obtained from a few XML attributes as well as via XML element > +# containment. > +# > +# ELF (exported) symbols > +# > +# elf-symbol (has a name; the code here currently knows nothing > +# about aliases) > +# > +# Declarations (that mention a symbol) > +# > +# These live in a scope. In C++ scopes can be nested and include > +# namespaces and class types. > +# > +# var-decl (also used for member variables) > +# elf-symbol linked to symbol via mangled-name > +# type-id links to a type > +# function-decl (also used for member functions) > +# elf-symbol linked to symbol via mangled-name > +# parameter and return type-ids link to types > +# (alas, not just a link to a function type) > +# > +# Types > +# > +# These occupy pretty much all the other elements, besides those > +# that act as simple containers. > +# > +# XML elements and their roles in more detail > +# > +# ELF > +# > +# elf-needed - container > +# dependency - names a library > +# elf-function-symbols - contains a list of symbols > +# elf-variable-symbols - contains a list of symbols > +# elf-symbol - describes an ELF variable or function > +# > +# Grouping and scoping > +# > +# abi-corpus-group > +# abi-corpus > +# abi-instr - compilation unit containers > +# namespace-decl - pure container, possibly named > +# > +# Types (some introduce scopes, only in C++) > +# > +# type-decl - defines a primitive type > +# typedef-decl - defines a type, links to a type > +# qualified-type-def - defines a type, links to a type > +# pointer-type-def - defines a type, links to a type > +# reference-type-def - defines a type, links to a type > +# array-type-def - defines a (multidimensional array) type, refers to el= ement type, contains subranges > +# subrange - contains array length, refers to element type; defines typ= es (never referred to; duplicated) > +# function-type - defines a type > +# parameter - belongs to function-type and -decl, links to a type > +# return - belongs to function-type and -decl, links to a type > +# enum-decl - defines a type, names it, contains a list of enumerators a= nd an underlying-type > +# underlying-type - belongs to enum-decl > +# enumerator - belongs to enum-decl > +# union-decl - defines and names a type, contains member elements linked= to other things > +# class-decl - defines and names a type, contains base type, member elem= ents linking to other things > +# base-class - belongs to class-decl > +# data-member - container for a member; holds access level > +# member-function - container for a member; holds access level > +# member-type - container for a type declaration; holds access level > +# member-template - container for a (function?) template declaration; h= olds access level > +# > +# Higher order Things > +# > +# class-template-decl - defines a "type", but without instantiation this= isn't usable > +# function-template-decl - defines a "type", but without instantiation t= his isn't usable > +# template-type-parameter - defines a type (formal variable), perhaps o= ne which should be excluded from the real type graph > +# template-non-type-parameter - names a template parameter, links to a = type > +# template-parameter-type-composition - container? > +# > +# Values > +# > +# function-decl - names a function, can link to a symbol > +# has same children as function-type, rather than linking to a type > +# var-decl - names a variable, can link to a symbol, links to a type > + > +# Graph nodes. > +my %symbols; > +my %exported_symbols; > +my %types; > +# Graph edges. > +my %type_type_edges; > +my %symbol_type_edges; > +my %type_symbol_edges; > + > +# Keep track of the types being defined. > +my @id_stack; > +# Keep track of the current (value) declaration. > +my @symbol_stack; > +sub process($); > +sub process($) { > + my ($node) =3D @_; > + > + # The XML attributes we care about. > + my $name; > + my $id; > + my $type_id; > + my $symbol; > + my $naming_typedef_id; > + > + # Not every node we encounter is an XML element. > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { > + $name =3D $node->getAttribute('name'); > + $id =3D $node->getAttribute('id'); > + $type_id =3D $node->getAttribute('type-id'); > + $symbol =3D $node->getAttribute('mangled-name'); > + $naming_typedef_id =3D $node->getAttribute('naming-typedef-id'); > + die if defined $id && defined $symbol; > + } > + > + if (defined $name && $node->getName() eq 'elf-symbol') { > + $exported_symbols{$name} =3D undef; > + # Early return is safe, but not necessary. > + return; > + } > + > + if (defined $id) { > + # This element defines a type (but there may be more than one > + # defining the same type - we cannot rely on uniqueness). > + $types{$id} =3D undef; > + if (defined $naming_typedef_id) { > + # This is an odd one, there can be a backwards link from an > + # anonymous type to the typedef that refers to it, so we need to > + # pull in the typedef, even if nothing else refers to it. > + $type_type_edges{$id}{$naming_typedef_id}{naming} =3D undef; > + } > + if (@id_stack) { > + # Type parent<->child dependencies; record dependencies both > + # ways to avoid making holes in the XML types. > + $type_type_edges{$id_stack[-1]}{$id}{contains} =3D undef; > + $type_type_edges{$id}{$id_stack[-1]}{in} =3D undef; > + } > + push @id_stack, $id; > + } > + > + if (defined $symbol) { > + # This element is a declaration linked to a symbol (whether or not > + # exported). > + $symbols{$symbol} =3D undef; > + if (@id_stack) { > + # The enclosing scope may include a C++ class, in which case we > + # need to record the dependency. > + $type_symbol_edges{$id_stack[-1]}{$symbol} =3D undef; > + } > + # The symbol depends on the types mentioned in this element, so > + # record it. > + push @symbol_stack, $symbol; > + die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1; > + } > + > + if (defined $type_id) { > + if (@id_stack) { > + # The enclosing type refers to this type, record the dependency. > + $type_type_edges{$id_stack[-1]}{$type_id}{depends} =3D undef; > + } > + if (@symbol_stack) { > + # The enclosing declaration (linked to a symbol) refers to this > + # type, record the dependency. > + $symbol_type_edges{$symbol_stack[-1]}{$type_id} =3D undef; > + } > + } > + > + for my $child ($node->nonBlankChildNodes()) { > + process($child); > + } > + > + if (defined $symbol) { > + pop @symbol_stack; > + } > + if (defined $id) { > + pop @id_stack; > + } > +} > + > +# Load the XML. > +my $dom =3D XML::LibXML->load_xml(IO =3D> \*STDIN); > +# Build a graph. > +process($dom); > +die if @id_stack or @symbol_stack; > + > +# DFS visited state. > +my %seen_types; > +my %seen_symbols; > +sub dfs_type($); > +sub dfs_symbol($); > + > +sub dfs_type($) { > + my ($type) =3D @_; > + return if exists $seen_types{$type}; > + $seen_types{$type} =3D undef; > + > + my $types =3D $type_type_edges{$type}; > + if (defined $types) { > + for my $type (keys %$types) { > + for my $link (keys %{$types->{$type}}) { > + if (exists $types->{$type}{$link}) { > + dfs_type($type); > + } > + } > + } > + } > + my $symbols =3D $type_symbol_edges{$type}; > + if (defined $symbols) { > + for my $symbol (keys %$symbols) { > + dfs_symbol($symbol); > + } > + } > +} > + > +sub dfs_symbol($) { > + my ($symbol) =3D @_; > + return if exists $seen_symbols{$symbol}; > + $seen_symbols{$symbol} =3D undef; > + > + my $types =3D $symbol_type_edges{$symbol}; > + if (defined $types) { > + for my $type (keys %$types) { > + dfs_type($type); > + } > + } else { > + warn "no declaration found for exported symbol $symbol\n"; > + } > +} > + > +# Traverse the graph, using the exported symbols as the roots. > +for my $symbol (keys %exported_symbols) { > + dfs_symbol($symbol); > +} > + > +# Useful counts. > +sub print_report() { > + my $count_symbols =3D scalar keys %symbols; > + my $count_exported_symbols =3D scalar keys %exported_symbols; > + my $count_types =3D scalar keys %types; > + my $count_reachable_types =3D scalar keys %seen_types; > + > + print STDERR qq{symbols =3D $count_symbols > +exported symbols =3D $count_exported_symbols > +types =3D $count_types > +reachable types =3D $count_reachable_types > +}; > +} > + > +my %drop_if_empty =3D map { $_ =3D> undef } qw( > + namespace-decl > + abi-instr > + abi-corpus > + abi-corpus-group > +); > + > +# This is another DFS traversal of the XML. > +sub prune($); > +sub prune($) { > + my ($node) =3D @_; > + > + my $name =3D $node->getName(); > + my $id; > + my $symbol; > + > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { > + $id =3D $node->getAttribute('id'); > + $symbol =3D $node->getAttribute('mangled-name'); > + die if defined $id && defined $symbol; > + } > + > + # Return if we know that this is a type or declaration to keep or > + # drop in its entirety. > + if (defined $id) { > + return !exists $seen_types{$id}; > + } > + if ($name eq 'var-decl' || $name eq 'function-decl') { > + return !defined $symbol || !exists $exported_symbols{$symbol}; > + } > + > + # Otherwise, this is not a type, declaration or part thereof, so > + # process child elements. > + my $empty =3D 1; > + for my $child ($node->nonBlankChildNodes()) { > + if (prune($child)) { > + $node->removeChild($child); > + } else { > + $empty =3D 0; > + } > + } > + > + # Empty container elements can be dropped. > + return $empty && exists $drop_if_empty{$name}; > +} > + > +# Drop unreachable nodes and any container nodes that end up empty. > +prune($dom); > +my $out =3D $dom->toString(); > +# Remove blank line debris. > +$out =3D~ s;^ *\n;;mg; > +# Remove the XML declaration as abidiff doesn't like it. > +$out =3D~ s;^<\?xml .*\?>\n;;m; > +print $out; > --=20 > 2.31.0.rc2.261.g7f71774620-goog >=20