From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua1-x92b.google.com (mail-ua1-x92b.google.com [IPv6:2607:f8b0:4864:20::92b]) by sourceware.org (Postfix) with ESMTPS id 38E9D3858D29 for ; Mon, 15 Mar 2021 11:09:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 38E9D3858D29 Received: by mail-ua1-x92b.google.com with SMTP id h34so4055143uah.5 for ; Mon, 15 Mar 2021 04:09:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0hpoKcHV2RUaAN4Vg4/9L+BNPs7LmL6tu740lcaSizE=; b=nX8aIxjMMsVJHXGxpzwRszTCNqHPmWxNlj0a8kA43zqXTeAt2gYzX4XQYpfEFJKVzZ SV9mEsu8virJXrpqLXc0rpmrrFPGWDcnvT5FcOvphEGQ9ntYsDz7F+6IcuX4dMyu0O3n VOWItZBCcEZGetJ2fBPbLLipYvhynNMhmvn83x8vkLUp4/DRtUZIcNTInYkDWVE+WU+Q WpmlEtUePTBie9cC+15d33SX/cyqHZsAh5s2KLPqloYyaZaiVHs3iRBWqV0o4XBvVkGs NyIOJspYQsxe7HHahO8XcA8YlbpJSmL06PJdV8Bv//bpBg7fzVJrdpRcBUxuzwgwLxqW LenQ== X-Gm-Message-State: AOAM5315Hf0ddRZ6YvWrOkWVXiq2iFoNJ7E4hI5ZzcSTwLTP6F6IvSyA DGigA1EDFSB12zoqvL5PbUU1KgKjugFFpjyl8nD3bw== X-Google-Smtp-Source: ABdhPJxvozntFoLdqdLip0ccyTUyoeRaAzMhzoSkrq73kOfqj6CZ0F9vs4ikbGnA6zLSycKi9sQIuwkNYS6cG+4LNIs= X-Received: by 2002:ab0:5381:: with SMTP id k1mr4457595uaa.84.1615806545475; Mon, 15 Mar 2021 04:09:05 -0700 (PDT) MIME-Version: 1.0 References: <20210311115340.3033687-1-gprocida@google.com> <0716D5B1-4285-4C0E-9231-14663D092188@redhat.com> In-Reply-To: <0716D5B1-4285-4C0E-9231-14663D092188@redhat.com> From: Giuliano Procida Date: Mon, 15 Mar 2021 11:08:28 +0000 Message-ID: Subject: Re: [RFC PATCH] Add an experimental ABI pruning utility. To: Ben Woodard Cc: Giuliano Procida via Libabigail , Matthias Maennich , kernel-team@android.com X-Spam-Status: No, score=-29.6 required=5.0 tests=BAYES_00, DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH, GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libabigail@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Mailing list of the Libabigail project List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Mar 2021 11:09:10 -0000 Hi. On Fri, 12 Mar 2021 at 18:41, Ben Woodard wrote: > if you look at abicompat there are functions like: > perform_compat_check_in_*_mode(options& opts, > diff_context_sptr& ctxt, > corpus_sptr app_corpus, > corpus_sptr lib1_corpus, > corpus_sptr lib2_corpus) > which ultimately does something like your perl script=E2=80=99s ELF reach= ability > check when computing compatibility between libraries. It iterates through > the undefined symbols and then calls get_sym_ids_of_*_to_keep(). > > So abicompat is a bit like abidiff -w symbol_list where the symbol_list is just taken from an application. > I would think that you would be able to essentially do the same thing > inside of abidw to have it prune you abixml file in a similar way to how > your perl script does. > > Here the symbol list would be taken to be the ELF symbols. The harder part is implementing the reachability traversal. > (Paranthetical Note: one thing that really bothers me about those > functions which I hope to work on as soon as I get a chance is I believe > that they need to take const corpus_sptr=E2=80=99s for the app and lib1 l= ib2 > corpuses. Of course this means that the data structures would have to > change and the sym_ids_of_*_to_keep() would need to be some sort of > external data structure rather than something inside the corpus itself, > I=E2=80=99ve been calling this a subcorpus as I=E2=80=99ve been thinking = about how to make > this change happen. > > The problem with the way that it is now is when using libabigail as a > library, rather than as it is used in a tool like abicompat you want to b= e > able to load a collection of corpora which takes non-trivial time in many > cases and use them over and over. For example say you want to load the > corpora of many variations of libraries and find the ones that are > compatible with each other. You are going to want to use the app corpus > over and over without having to reload it from the ELF or abixml each and > every time. Yeah as a first pass you could clear the vectors with the > *_to_keep but then you would only be able to use the corpus in one thread > at a time. I think it would make better library design to move those thin= gs > to the side.) > > Some of the existing traversals in libabigail also have the notion of "visited" as part of another object (like diff_context) and so mutate this object, rather than being instantiated just for the given traversal. I agree that it would be good to separate out this kind of interpretive state from underlying objects. In the case of optionally pruning unreachable symbols and types, having extra state in read_context would make what's a very large object even larger. Logically, what we'd want to happen is something like: set reachable(const set & roots, const corpus & corpus) { set visited; for (const auto & root : roots) visit(visited, root, corpus); return visited; } corpus =3D read_corpus(); if (prune) { auto roots =3D exported(corpus); auto reachable =3D reachable(roots, corpus); corpus =3D filter(reachable, corpus); } write_corpus(corpus); On Mar 12, 2021, at 8:51 AM, Giuliano Procida wrote: > > Hi. > > On Thu, 11 Mar 2021 at 20:39, Ben Woodard wrote: > >> I think that you have pointed out a problem that I=E2=80=99ve been discu= ssing >> with some users that I=E2=80=99m working with. The sheer volume of the a= bixml >> output is pretty extreme. I don=E2=80=99t think that they were really ex= pecting it. >> However, they are not running into the particular situation that you are >> seeing since they don=E2=80=99t have the coding model of the problem lib= rary. They >> are more likely to fall in the 5-10% range that you reported above. >> >> We have been talking about it in terms of =E2=80=9CABI Surface=E2=80=9D = vs. =E2=80=9CABI Corpus=E2=80=9D. >> The ELF reachability aspect that you point out seems to be part of the >> difference between those two concepts. >> >> > We want a distilled description of the ABI (the surface), not what may li= e > on the other side. More precisely, we want a typed interface where "type" > encompasses extra things like architecture, calling convention, ISA, ELF > symbol properties etc. > > One worry is that just chasing references from the exported ELF symbols > may somehow miss something important, say for libraries that ship with bi= ts > of implementation in header files. > > > Right now given the implementation above, it looks like ELF reachability > is the only thing that abicompat takes into consideration. What you are > saying potentially has important consequences because it could suggest th= at > abicompat may be missing things. > > Do you know of or could you construct an example that demonstrates a case > that would be missed by this ELF reachability? It may mean that we need t= o > look at ways to enhance the checks in abicompat and in doing so they coul= d > also be part of the code that you use to prune the output of abidw. > > Not really. My main thoughts were constructors and (as you mention) inline functions in general. If we think abouts what's being pruned: 1. declarations of functions and variables, absent from the ELF symbol tabl= e These are either unusable (link failures) or have an implementation entirely within header files. In the simplest cases, a constexpr value or the behaviour of an inline function could change between library versions, or their types could change. We cannot possibly detect all of these kinds of things. The changes may well be incompatible. Note that abidiff doesn't report any such changes; I checked both with abidiff -w and using objcopy to strip unwanted symbols. 2. types that are not reachable from any of the ELF symbols If the types are not POD and caused the emission of constructors etc., then the functions are covered by 1. For POD types, the user can have code that builds, consumes and mutates objects of these types. But it has no way to pass such objects through the library interface (symbols) unless the types are hidden behind things like void *. If in some future revision of libabigail, we had a mechanism to say "this void could actually be one of these other types" in the ABI, then the types would become reachable, diffable and also not pruned by abiprune. The thing that pops into my mind would be inline functions defined in the > header files. We may be able to find those by searching through the TUs b= ut > then there are things like cpp macros even with -g3 I=E2=80=99m not sure = we could > find those but I=E2=80=99m not entirely sure we would need to. > > I really do believe that this topic needs additional investigation. > > I think the question to answer is roughly: what can we get from the library headers that is not reachable from the export ELF symbols? And the answer consists of the values of all defines, macros, constants, inline functions. However, it may be necessary to subtract some of the OS "background" noise. Just because a random dependency of the library introduces a new constant, it doesn't mean we care. We'd also need to decide where to draw a line (which header files are excluded). At one end of the spectrum, we'd need a sophisticated parser and deep knowledge of the language semantics and at the other we'd just offer up header diffs. How much of this were you hoping would be in the TUs (bearing in mind that the library may not even use everything it makes available to client code)? > -ben > > Giuliano. > >> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail < >> libabigail@sourceware.org> wrote: >> > >> > In response to internal reports about very large ABIs for libraires >> > implemented in C++ but exposing only a C API, I wrote a script as a >> > means of investigating the issue. >> > >> > The script aims to remove parts of the XML that are irrelevant to >> > abidiff. It has been run on the current libabigail tests as well as >> > some locally-generated ABI files. >> > >> > On most XML files, this results in a modest saving (5-10%). However, >> > for the library that sparked the invesigaton, the resulting XML was >> > more than 2300 times smaller. >> > >> > The script may have some merit on its own but perhaps should be >> > implemented differently. >> > >> > Question: How hard would it be to implement this functionality (under >> > flag control) within libabigail itself? >> >> I can=E2=80=99t directly answer =E2=80=9Chow hard=E2=80=9D that is one o= f the things which we >> would discover by actually doing it. As for should it be done. I think i= t >> should be carefully looked into. However, as for priority since it is an >> optimization rather than a functional problem, if it were me, I would fi= le >> this as a bug and put examples of the test data and results you are seei= ng >> as well as the script below in that bz and then start working on the >> process of refining libabigail=E2=80=99s output so that it matches the r= esults of >> your script. >> >> > It's an optimisation analogous to DCE in a compiler. I'll certainly follo= w > up with a bug. It's one library in particular that is ~50M before pruning > and ~22k after. > > I've already made changes to the script so that it can be given a list of > symbols and produce a reduced ABI. At the moment, we achieve this by > passing a list to abidw (with -w). In fact, we extract full and reduced > ABIs concurrently in the Android kernel build as each run takes a while. = It > should be straightforward to reimplement the script in C++, though it's n= ot > too slow already. > > In terms of implementation, it would need to work with libabigail IR and > ir_node_visitor might be the right place to start. > > Giuliano. > > >> -ben >> >> > >> > A couple of tangential things worth mentioning (see the comments at >> > the very beginning and very end of the script): >> > >> > - abidiff rejects XML files with an XML declaration at the top >> > - abidiff rejects completely empty files >> > >> > Resolving the first would make the second moot. >> > >> > * scripts/abiprune.pl: Add script to prune ABI XML. >> > >> > Signed-off-by: Giuliano Procida >> > --- >> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++ >> > 1 file changed, 323 insertions(+) >> > create mode 100755 scripts/abiprune.pl >> > >> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl >> > new file mode 100755 >> > index 00000000..8e04a273 >> > --- /dev/null >> > +++ b/scripts/abiprune.pl >> > @@ -0,0 +1,323 @@ >> > +#!/usr/bin/perl >> > + >> > +# This script is intended to consume libabigail ABI XML as generated >> > +# by abidw and produce a possibly smaller representation that capture= s >> > +# the same ABI. In particular, the output should be such that abidiff >> > +# --harmless reports no differences (or is empty). >> > +# >> > +# It does this by modelling symbols and types dependencies as a >> > +# directed graph, marking nodes reachable from the ELF symbols and >> > +# dropping unreachable nodes. >> > + >> > +use strict; >> > +use warnings; >> > +no warnings 'recursion'; >> > + >> > +use autodie; >> > + >> > +use XML::LibXML; >> > + >> > +# Making a graph from ABI XML. >> > +# >> > +# The following are the types of "node" we care about. The "edges" ar= e >> > +# obtained from a few XML attributes as well as via XML element >> > +# containment. >> > +# >> > +# ELF (exported) symbols >> > +# >> > +# elf-symbol (has a name; the code here currently knows nothing >> > +# about aliases) >> > +# >> > +# Declarations (that mention a symbol) >> > +# >> > +# These live in a scope. In C++ scopes can be nested and include >> > +# namespaces and class types. >> > +# >> > +# var-decl (also used for member variables) >> > +# elf-symbol linked to symbol via mangled-name >> > +# type-id links to a type >> > +# function-decl (also used for member functions) >> > +# elf-symbol linked to symbol via mangled-name >> > +# parameter and return type-ids link to types >> > +# (alas, not just a link to a function type) >> > +# >> > +# Types >> > +# >> > +# These occupy pretty much all the other elements, besides those >> > +# that act as simple containers. >> > +# >> > +# XML elements and their roles in more detail >> > +# >> > +# ELF >> > +# >> > +# elf-needed - container >> > +# dependency - names a library >> > +# elf-function-symbols - contains a list of symbols >> > +# elf-variable-symbols - contains a list of symbols >> > +# elf-symbol - describes an ELF variable or function >> > +# >> > +# Grouping and scoping >> > +# >> > +# abi-corpus-group >> > +# abi-corpus >> > +# abi-instr - compilation unit containers >> > +# namespace-decl - pure container, possibly named >> > +# >> > +# Types (some introduce scopes, only in C++) >> > +# >> > +# type-decl - defines a primitive type >> > +# typedef-decl - defines a type, links to a type >> > +# qualified-type-def - defines a type, links to a type >> > +# pointer-type-def - defines a type, links to a type >> > +# reference-type-def - defines a type, links to a type >> > +# array-type-def - defines a (multidimensional array) type, refers to >> element type, contains subranges >> > +# subrange - contains array length, refers to element type; defines >> types (never referred to; duplicated) >> > +# function-type - defines a type >> > +# parameter - belongs to function-type and -decl, links to a type >> > +# return - belongs to function-type and -decl, links to a type >> > +# enum-decl - defines a type, names it, contains a list of enumerator= s >> and an underlying-type >> > +# underlying-type - belongs to enum-decl >> > +# enumerator - belongs to enum-decl >> > +# union-decl - defines and names a type, contains member elements >> linked to other things >> > +# class-decl - defines and names a type, contains base type, member >> elements linking to other things >> > +# base-class - belongs to class-decl >> > +# data-member - container for a member; holds access level >> > +# member-function - container for a member; holds access level >> > +# member-type - container for a type declaration; holds access level >> > +# member-template - container for a (function?) template declaration= ; >> holds access level >> > +# >> > +# Higher order Things >> > +# >> > +# class-template-decl - defines a "type", but without instantiation >> this isn't usable >> > +# function-template-decl - defines a "type", but without instantiatio= n >> this isn't usable >> > +# template-type-parameter - defines a type (formal variable), perhap= s >> one which should be excluded from the real type graph >> > +# template-non-type-parameter - names a template parameter, links to >> a type >> > +# template-parameter-type-composition - container? >> > +# >> > +# Values >> > +# >> > +# function-decl - names a function, can link to a symbol >> > +# has same children as function-type, rather than linking to a ty= pe >> > +# var-decl - names a variable, can link to a symbol, links to a type >> > + >> > +# Graph nodes. >> > +my %symbols; >> > +my %exported_symbols; >> > +my %types; >> > +# Graph edges. >> > +my %type_type_edges; >> > +my %symbol_type_edges; >> > +my %type_symbol_edges; >> > + >> > +# Keep track of the types being defined. >> > +my @id_stack; >> > +# Keep track of the current (value) declaration. >> > +my @symbol_stack; >> > +sub process($); >> > +sub process($) { >> > + my ($node) =3D @_; >> > + >> > + # The XML attributes we care about. >> > + my $name; >> > + my $id; >> > + my $type_id; >> > + my $symbol; >> > + my $naming_typedef_id; >> > + >> > + # Not every node we encounter is an XML element. >> > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { >> > + $name =3D $node->getAttribute('name'); >> > + $id =3D $node->getAttribute('id'); >> > + $type_id =3D $node->getAttribute('type-id'); >> > + $symbol =3D $node->getAttribute('mangled-name'); >> > + $naming_typedef_id =3D $node->getAttribute('naming-typedef-id'); >> > + die if defined $id && defined $symbol; >> > + } >> > + >> > + if (defined $name && $node->getName() eq 'elf-symbol') { >> > + $exported_symbols{$name} =3D undef; >> > + # Early return is safe, but not necessary. >> > + return; >> > + } >> > + >> > + if (defined $id) { >> > + # This element defines a type (but there may be more than one >> > + # defining the same type - we cannot rely on uniqueness). >> > + $types{$id} =3D undef; >> > + if (defined $naming_typedef_id) { >> > + # This is an odd one, there can be a backwards link from an >> > + # anonymous type to the typedef that refers to it, so we need t= o >> > + # pull in the typedef, even if nothing else refers to it. >> > + $type_type_edges{$id}{$naming_typedef_id}{naming} =3D undef; >> > + } >> > + if (@id_stack) { >> > + # Type parent<->child dependencies; record dependencies both >> > + # ways to avoid making holes in the XML types. >> > + $type_type_edges{$id_stack[-1]}{$id}{contains} =3D undef; >> > + $type_type_edges{$id}{$id_stack[-1]}{in} =3D undef; >> > + } >> > + push @id_stack, $id; >> > + } >> > + >> > + if (defined $symbol) { >> > + # This element is a declaration linked to a symbol (whether or no= t >> > + # exported). >> > + $symbols{$symbol} =3D undef; >> > + if (@id_stack) { >> > + # The enclosing scope may include a C++ class, in which case we >> > + # need to record the dependency. >> > + $type_symbol_edges{$id_stack[-1]}{$symbol} =3D undef; >> > + } >> > + # The symbol depends on the types mentioned in this element, so >> > + # record it. >> > + push @symbol_stack, $symbol; >> > + die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1= ; >> > + } >> > + >> > + if (defined $type_id) { >> > + if (@id_stack) { >> > + # The enclosing type refers to this type, record the dependency= . >> > + $type_type_edges{$id_stack[-1]}{$type_id}{depends} =3D undef; >> > + } >> > + if (@symbol_stack) { >> > + # The enclosing declaration (linked to a symbol) refers to this >> > + # type, record the dependency. >> > + $symbol_type_edges{$symbol_stack[-1]}{$type_id} =3D undef; >> > + } >> > + } >> > + >> > + for my $child ($node->nonBlankChildNodes()) { >> > + process($child); >> > + } >> > + >> > + if (defined $symbol) { >> > + pop @symbol_stack; >> > + } >> > + if (defined $id) { >> > + pop @id_stack; >> > + } >> > +} >> > + >> > +# Load the XML. >> > +my $dom =3D XML::LibXML->load_xml(IO =3D> \*STDIN); >> > +# Build a graph. >> > +process($dom); >> > +die if @id_stack or @symbol_stack; >> > + >> > +# DFS visited state. >> > +my %seen_types; >> > +my %seen_symbols; >> > +sub dfs_type($); >> > +sub dfs_symbol($); >> > + >> > +sub dfs_type($) { >> > + my ($type) =3D @_; >> > + return if exists $seen_types{$type}; >> > + $seen_types{$type} =3D undef; >> > + >> > + my $types =3D $type_type_edges{$type}; >> > + if (defined $types) { >> > + for my $type (keys %$types) { >> > + for my $link (keys %{$types->{$type}}) { >> > + if (exists $types->{$type}{$link}) { >> > + dfs_type($type); >> > + } >> > + } >> > + } >> > + } >> > + my $symbols =3D $type_symbol_edges{$type}; >> > + if (defined $symbols) { >> > + for my $symbol (keys %$symbols) { >> > + dfs_symbol($symbol); >> > + } >> > + } >> > +} >> > + >> > +sub dfs_symbol($) { >> > + my ($symbol) =3D @_; >> > + return if exists $seen_symbols{$symbol}; >> > + $seen_symbols{$symbol} =3D undef; >> > + >> > + my $types =3D $symbol_type_edges{$symbol}; >> > + if (defined $types) { >> > + for my $type (keys %$types) { >> > + dfs_type($type); >> > + } >> > + } else { >> > + warn "no declaration found for exported symbol $symbol\n"; >> > + } >> > +} >> > + >> > +# Traverse the graph, using the exported symbols as the roots. >> > +for my $symbol (keys %exported_symbols) { >> > + dfs_symbol($symbol); >> > +} >> > + >> > +# Useful counts. >> > +sub print_report() { >> > + my $count_symbols =3D scalar keys %symbols; >> > + my $count_exported_symbols =3D scalar keys %exported_symbols; >> > + my $count_types =3D scalar keys %types; >> > + my $count_reachable_types =3D scalar keys %seen_types; >> > + >> > + print STDERR qq{symbols =3D $count_symbols >> > +exported symbols =3D $count_exported_symbols >> > +types =3D $count_types >> > +reachable types =3D $count_reachable_types >> > +}; >> > +} >> > + >> > +my %drop_if_empty =3D map { $_ =3D> undef } qw( >> > + namespace-decl >> > + abi-instr >> > + abi-corpus >> > + abi-corpus-group >> > +); >> > + >> > +# This is another DFS traversal of the XML. >> > +sub prune($); >> > +sub prune($) { >> > + my ($node) =3D @_; >> > + >> > + my $name =3D $node->getName(); >> > + my $id; >> > + my $symbol; >> > + >> > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { >> > + $id =3D $node->getAttribute('id'); >> > + $symbol =3D $node->getAttribute('mangled-name'); >> > + die if defined $id && defined $symbol; >> > + } >> > + >> > + # Return if we know that this is a type or declaration to keep or >> > + # drop in its entirety. >> > + if (defined $id) { >> > + return !exists $seen_types{$id}; >> > + } >> > + if ($name eq 'var-decl' || $name eq 'function-decl') { >> > + return !defined $symbol || !exists $exported_symbols{$symbol}; >> > + } >> > + >> > + # Otherwise, this is not a type, declaration or part thereof, so >> > + # process child elements. >> > + my $empty =3D 1; >> > + for my $child ($node->nonBlankChildNodes()) { >> > + if (prune($child)) { >> > + $node->removeChild($child); >> > + } else { >> > + $empty =3D 0; >> > + } >> > + } >> > + >> > + # Empty container elements can be dropped. >> > + return $empty && exists $drop_if_empty{$name}; >> > +} >> > + >> > +# Drop unreachable nodes and any container nodes that end up empty. >> > +prune($dom); >> > +my $out =3D $dom->toString(); >> > +# Remove blank line debris. >> > +$out =3D~ s;^ *\n;;mg; >> > +# Remove the XML declaration as abidiff doesn't like it. >> > +$out =3D~ s;^<\?xml .*\?>\n;;m; >> > +print $out; >> > -- >> > 2.31.0.rc2.261.g7f71774620-goog >> > > > >