From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 840F7386F035 for ; Fri, 12 Mar 2021 18:41:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 840F7386F035 Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-162-r5uRoE8BPvqfsbjVGBIsXw-1; Fri, 12 Mar 2021 13:41:31 -0500 X-MC-Unique: r5uRoE8BPvqfsbjVGBIsXw-1 Received: by mail-qt1-f200.google.com with SMTP id w2so12684258qts.18 for ; Fri, 12 Mar 2021 10:41:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=wF+M16/Bal4kQOYwlTZ/FmhdUCapGqUoLmA6uPih0WY=; b=AvFRgV+yZDwjIqET3PdmAU2VvnUwZKjYwSgu7mwZNltJ8a4ySc5cmJOD/5W4YD+xKL m/SNtLt2L7kQsq9CUzHTZq4Tdi5UB0zjkd+6jA7mTTuFzMslMYJ1SZYE/cOmNae+0O7e Zr8Fx0NgOHUnO0UmROVx4nhYpNzn/TIp2WJ3UOIqfZnlVCQilz0r071/zUXSQx/AoDRq RHWR4SkxPbM6XJ6hyCvLRtN0PvRFGXzzXFjSTXqfO5Fz5M1RDpyR8Z/W4zQ8zLe51RH1 UcsGH9AgYfmAPXETM7d33C1pLHO/MQNQNDyTRUA3njoLoX6b/kD9teou5vOT+qQSno9D ZWmQ== X-Gm-Message-State: AOAM5317vrXwnxyb5+wRN+lSr7lW9grpspEyE0pzjJDemIJ1cwVZuWpG HxNi6v8Lg5OWStXUkDjZOkC0n23p4EQrMxLzGZXr/7rRIPasHcSwIr+yVWhCrdQVMrWCDZK7zL5 /MZXMmNz7cF2o+0bp1wM8 X-Received: by 2002:ac8:b4b:: with SMTP id m11mr13459101qti.254.1615574490501; Fri, 12 Mar 2021 10:41:30 -0800 (PST) X-Google-Smtp-Source: ABdhPJz2nlxWN5x2DjUZ7gEXPAqRyp9F/yue7ihfh4L8TVtqHjkNxmePRDSAqeMNtEoWuVrjDuJ97g== X-Received: by 2002:ac8:b4b:: with SMTP id m11mr13459059qti.254.1615574489965; Fri, 12 Mar 2021 10:41:29 -0800 (PST) Received: from [192.168.1.234] (47-208-193-143.trckcmtc01.res.dyn.suddenlink.net. [47.208.193.143]) by smtp.gmail.com with ESMTPSA id m21sm4786284qka.28.2021.03.12.10.41.27 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 12 Mar 2021 10:41:29 -0800 (PST) From: Ben Woodard Message-Id: <0716D5B1-4285-4C0E-9231-14663D092188@redhat.com> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\)) Subject: Re: [RFC PATCH] Add an experimental ABI pruning utility. Date: Fri, 12 Mar 2021 10:41:25 -0800 In-Reply-To: Cc: Giuliano Procida via Libabigail , Matthias Maennich , kernel-team@android.com To: Giuliano Procida References: <20210311115340.3033687-1-gprocida@google.com> X-Mailer: Apple Mail (2.3654.60.0.2.21) X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libabigail@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Mailing list of the Libabigail project List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Mar 2021 18:41:40 -0000 if you look at abicompat there are functions like:=20 perform_compat_check_in_*_mode(options& opts, diff_context_sptr& ctxt, corpus_sptr app_corpus, corpus_sptr lib1_corpus, corpus_sptr lib2_corpus) which ultimately does something like your perl script=E2=80=99s ELF reachab= ility check when computing compatibility between libraries. It iterates thr= ough the undefined symbols and then calls get_sym_ids_of_*_to_keep().=20 I would think that you would be able to essentially do the same thing insid= e of abidw to have it prune you abixml file in a similar way to how your pe= rl script does.=20 (Paranthetical Note: one thing that really bothers me about those functions= which I hope to work on as soon as I get a chance is I believe that they n= eed to take const corpus_sptr=E2=80=99s for the app and lib1 lib2 corpuses.= Of course this means that the data structures would have to change and the= sym_ids_of_*_to_keep() would need to be some sort of external data structu= re rather than something inside the corpus itself, I=E2=80=99ve been callin= g this a subcorpus as I=E2=80=99ve been thinking about how to make this cha= nge happen.=20 The problem with the way that it is now is when using libabigail as a libra= ry, rather than as it is used in a tool like abicompat you want to be able = to load a collection of corpora which takes non-trivial time in many cases = and use them over and over. For example say you want to load the corpora of= many variations of libraries and find the ones that are compatible with ea= ch other. You are going to want to use the app corpus over and over without= having to reload it from the ELF or abixml each and every time. Yeah as a = first pass you could clear the vectors with the *_to_keep but then you woul= d only be able to use the corpus in one thread at a time. I think it would = make better library design to move those things to the side.) > On Mar 12, 2021, at 8:51 AM, Giuliano Procida wrote= : >=20 > Hi. >=20 > On Thu, 11 Mar 2021 at 20:39, Ben Woodard > wrote: > I think that you have pointed out a problem that I=E2=80=99ve been discus= sing with some users that I=E2=80=99m working with. The sheer volume of the= abixml output is pretty extreme. I don=E2=80=99t think that they were real= ly expecting it. However, they are not running into the particular situatio= n that you are seeing since they don=E2=80=99t have the coding model of the= problem library. They are more likely to fall in the 5-10% range that you = reported above.=20 >=20 > We have been talking about it in terms of =E2=80=9CABI Surface=E2=80=9D v= s. =E2=80=9CABI Corpus=E2=80=9D. The ELF reachability aspect that you point= out seems to be part of the difference between those two concepts. >=20 >=20 > We want a distilled description of the ABI (the surface), not what may li= e on the other side. More precisely, we want a typed interface where "type"= encompasses extra things like architecture, calling convention, ISA, ELF s= ymbol properties etc. >=20 > One worry is that just chasing references from the exported ELF symbols m= ay somehow miss something important, say for libraries that ship with bits = of implementation in header files. Right now given the implementation above, it looks like ELF reachability is= the only thing that abicompat takes into consideration. What you are sayin= g potentially has important consequences because it could suggest that abic= ompat may be missing things. Do you know of or could you construct an example that demonstrates a case t= hat would be missed by this ELF reachability? It may mean that we need to l= ook at ways to enhance the checks in abicompat and in doing so they could a= lso be part of the code that you use to prune the output of abidw. The thing that pops into my mind would be inline functions defined in the h= eader files. We may be able to find those by searching through the TUs but = then there are things like cpp macros even with -g3 I=E2=80=99m not sure we= could find those but I=E2=80=99m not entirely sure we would need to. I really do believe that this topic needs additional investigation. -ben > =20 > > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail > wrote: > >=20 > > In response to internal reports about very large ABIs for libraires > > implemented in C++ but exposing only a C API, I wrote a script as a > > means of investigating the issue. > >=20 > > The script aims to remove parts of the XML that are irrelevant to > > abidiff. It has been run on the current libabigail tests as well as > > some locally-generated ABI files. > >=20 > > On most XML files, this results in a modest saving (5-10%). However, > > for the library that sparked the invesigaton, the resulting XML was > > more than 2300 times smaller. > >=20 > > The script may have some merit on its own but perhaps should be > > implemented differently. > >=20 > > Question: How hard would it be to implement this functionality (under > > flag control) within libabigail itself? >=20 > I can=E2=80=99t directly answer =E2=80=9Chow hard=E2=80=9D that is one of= the things which we would discover by actually doing it. As for should it = be done. I think it should be carefully looked into. However, as for priori= ty since it is an optimization rather than a functional problem, if it were= me, I would file this as a bug and put examples of the test data and resul= ts you are seeing as well as the script below in that bz and then start wor= king on the process of refining libabigail=E2=80=99s output so that it matc= hes the results of your script. >=20 >=20 > It's an optimisation analogous to DCE in a compiler. I'll certainly follo= w up with a bug. It's one library in particular that is ~50M before pruning= and ~22k after. >=20 > I've already made changes to the script so that it can be given a list of= symbols and produce a reduced ABI. At the moment, we achieve this by passi= ng a list to abidw (with -w). In fact, we extract full and reduced ABIs con= currently in the Android kernel build as each run takes a while. It should = be straightforward to reimplement the script in C++, though it's not too sl= ow already. >=20 > In terms of implementation, it would need to work with libabigail IR and = ir_node_visitor might be the right place to start. >=20 > Giuliano. > =20 > -ben >=20 > >=20 > > A couple of tangential things worth mentioning (see the comments at > > the very beginning and very end of the script): > >=20 > > - abidiff rejects XML files with an XML declaration at the top > > - abidiff rejects completely empty files > >=20 > > Resolving the first would make the second moot. > >=20 > > * scripts/abiprune.pl : Add script to prune = ABI XML. > >=20 > > Signed-off-by: Giuliano Procida > > > --- > > scripts/abiprune.pl | 323 +++++++++++++++++++++++= +++++++++++++++++++++ > > 1 file changed, 323 insertions(+) > > create mode 100755 scripts/abiprune.pl > >=20 > > diff --git a/scripts/abiprune.pl b/scripts/abipru= ne.pl > > new file mode 100755 > > index 00000000..8e04a273 > > --- /dev/null > > +++ b/scripts/abiprune.pl > > @@ -0,0 +1,323 @@ > > +#!/usr/bin/perl > > + > > +# This script is intended to consume libabigail ABI XML as generated > > +# by abidw and produce a possibly smaller representation that captures > > +# the same ABI. In particular, the output should be such that abidiff > > +# --harmless reports no differences (or is empty). > > +# > > +# It does this by modelling symbols and types dependencies as a > > +# directed graph, marking nodes reachable from the ELF symbols and > > +# dropping unreachable nodes. > > + > > +use strict; > > +use warnings; > > +no warnings 'recursion'; > > + > > +use autodie; > > + > > +use XML::LibXML; > > + > > +# Making a graph from ABI XML. > > +# > > +# The following are the types of "node" we care about. The "edges" are > > +# obtained from a few XML attributes as well as via XML element > > +# containment. > > +# > > +# ELF (exported) symbols > > +# > > +# elf-symbol (has a name; the code here currently knows nothing > > +# about aliases) > > +# > > +# Declarations (that mention a symbol) > > +# > > +# These live in a scope. In C++ scopes can be nested and include > > +# namespaces and class types. > > +# > > +# var-decl (also used for member variables) > > +# elf-symbol linked to symbol via mangled-name > > +# type-id links to a type > > +# function-decl (also used for member functions) > > +# elf-symbol linked to symbol via mangled-name > > +# parameter and return type-ids link to types > > +# (alas, not just a link to a function type) > > +# > > +# Types > > +# > > +# These occupy pretty much all the other elements, besides those > > +# that act as simple containers. > > +# > > +# XML elements and their roles in more detail > > +# > > +# ELF > > +# > > +# elf-needed - container > > +# dependency - names a library > > +# elf-function-symbols - contains a list of symbols > > +# elf-variable-symbols - contains a list of symbols > > +# elf-symbol - describes an ELF variable or function > > +# > > +# Grouping and scoping > > +# > > +# abi-corpus-group > > +# abi-corpus > > +# abi-instr - compilation unit containers > > +# namespace-decl - pure container, possibly named > > +# > > +# Types (some introduce scopes, only in C++) > > +# > > +# type-decl - defines a primitive type > > +# typedef-decl - defines a type, links to a type > > +# qualified-type-def - defines a type, links to a type > > +# pointer-type-def - defines a type, links to a type > > +# reference-type-def - defines a type, links to a type > > +# array-type-def - defines a (multidimensional array) type, refers to = element type, contains subranges > > +# subrange - contains array length, refers to element type; defines t= ypes (never referred to; duplicated) > > +# function-type - defines a type > > +# parameter - belongs to function-type and -decl, links to a type > > +# return - belongs to function-type and -decl, links to a type > > +# enum-decl - defines a type, names it, contains a list of enumerators= and an underlying-type > > +# underlying-type - belongs to enum-decl > > +# enumerator - belongs to enum-decl > > +# union-decl - defines and names a type, contains member elements link= ed to other things > > +# class-decl - defines and names a type, contains base type, member el= ements linking to other things > > +# base-class - belongs to class-decl > > +# data-member - container for a member; holds access level > > +# member-function - container for a member; holds access level > > +# member-type - container for a type declaration; holds access level > > +# member-template - container for a (function?) template declaration;= holds access level > > +# > > +# Higher order Things > > +# > > +# class-template-decl - defines a "type", but without instantiation th= is isn't usable > > +# function-template-decl - defines a "type", but without instantiation= this isn't usable > > +# template-type-parameter - defines a type (formal variable), perhaps= one which should be excluded from the real type graph > > +# template-non-type-parameter - names a template parameter, links to = a type > > +# template-parameter-type-composition - container? > > +# > > +# Values > > +# > > +# function-decl - names a function, can link to a symbol > > +# has same children as function-type, rather than linking to a typ= e > > +# var-decl - names a variable, can link to a symbol, links to a type > > + > > +# Graph nodes. > > +my %symbols; > > +my %exported_symbols; > > +my %types; > > +# Graph edges. > > +my %type_type_edges; > > +my %symbol_type_edges; > > +my %type_symbol_edges; > > + > > +# Keep track of the types being defined. > > +my @id_stack; > > +# Keep track of the current (value) declaration. > > +my @symbol_stack; > > +sub process($); > > +sub process($) { > > + my ($node) =3D @_; > > + > > + # The XML attributes we care about. > > + my $name; > > + my $id; > > + my $type_id; > > + my $symbol; > > + my $naming_typedef_id; > > + > > + # Not every node we encounter is an XML element. > > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { > > + $name =3D $node->getAttribute('name'); > > + $id =3D $node->getAttribute('id'); > > + $type_id =3D $node->getAttribute('type-id'); > > + $symbol =3D $node->getAttribute('mangled-name'); > > + $naming_typedef_id =3D $node->getAttribute('naming-typedef-id'); > > + die if defined $id && defined $symbol; > > + } > > + > > + if (defined $name && $node->getName() eq 'elf-symbol') { > > + $exported_symbols{$name} =3D undef; > > + # Early return is safe, but not necessary. > > + return; > > + } > > + > > + if (defined $id) { > > + # This element defines a type (but there may be more than one > > + # defining the same type - we cannot rely on uniqueness). > > + $types{$id} =3D undef; > > + if (defined $naming_typedef_id) { > > + # This is an odd one, there can be a backwards link from an > > + # anonymous type to the typedef that refers to it, so we need to > > + # pull in the typedef, even if nothing else refers to it. > > + $type_type_edges{$id}{$naming_typedef_id}{naming} =3D undef; > > + } > > + if (@id_stack) { > > + # Type parent<->child dependencies; record dependencies both > > + # ways to avoid making holes in the XML types. > > + $type_type_edges{$id_stack[-1]}{$id}{contains} =3D undef; > > + $type_type_edges{$id}{$id_stack[-1]}{in} =3D undef; > > + } > > + push @id_stack, $id; > > + } > > + > > + if (defined $symbol) { > > + # This element is a declaration linked to a symbol (whether or not > > + # exported). > > + $symbols{$symbol} =3D undef; > > + if (@id_stack) { > > + # The enclosing scope may include a C++ class, in which case we > > + # need to record the dependency. > > + $type_symbol_edges{$id_stack[-1]}{$symbol} =3D undef; > > + } > > + # The symbol depends on the types mentioned in this element, so > > + # record it. > > + push @symbol_stack, $symbol; > > + die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1; > > + } > > + > > + if (defined $type_id) { > > + if (@id_stack) { > > + # The enclosing type refers to this type, record the dependency. > > + $type_type_edges{$id_stack[-1]}{$type_id}{depends} =3D undef; > > + } > > + if (@symbol_stack) { > > + # The enclosing declaration (linked to a symbol) refers to this > > + # type, record the dependency. > > + $symbol_type_edges{$symbol_stack[-1]}{$type_id} =3D undef; > > + } > > + } > > + > > + for my $child ($node->nonBlankChildNodes()) { > > + process($child); > > + } > > + > > + if (defined $symbol) { > > + pop @symbol_stack; > > + } > > + if (defined $id) { > > + pop @id_stack; > > + } > > +} > > + > > +# Load the XML. > > +my $dom =3D XML::LibXML->load_xml(IO =3D> \*STDIN); > > +# Build a graph. > > +process($dom); > > +die if @id_stack or @symbol_stack; > > + > > +# DFS visited state. > > +my %seen_types; > > +my %seen_symbols; > > +sub dfs_type($); > > +sub dfs_symbol($); > > + > > +sub dfs_type($) { > > + my ($type) =3D @_; > > + return if exists $seen_types{$type}; > > + $seen_types{$type} =3D undef; > > + > > + my $types =3D $type_type_edges{$type}; > > + if (defined $types) { > > + for my $type (keys %$types) { > > + for my $link (keys %{$types->{$type}}) { > > + if (exists $types->{$type}{$link}) { > > + dfs_type($type); > > + } > > + } > > + } > > + } > > + my $symbols =3D $type_symbol_edges{$type}; > > + if (defined $symbols) { > > + for my $symbol (keys %$symbols) { > > + dfs_symbol($symbol); > > + } > > + } > > +} > > + > > +sub dfs_symbol($) { > > + my ($symbol) =3D @_; > > + return if exists $seen_symbols{$symbol}; > > + $seen_symbols{$symbol} =3D undef; > > + > > + my $types =3D $symbol_type_edges{$symbol}; > > + if (defined $types) { > > + for my $type (keys %$types) { > > + dfs_type($type); > > + } > > + } else { > > + warn "no declaration found for exported symbol $symbol\n"; > > + } > > +} > > + > > +# Traverse the graph, using the exported symbols as the roots. > > +for my $symbol (keys %exported_symbols) { > > + dfs_symbol($symbol); > > +} > > + > > +# Useful counts. > > +sub print_report() { > > + my $count_symbols =3D scalar keys %symbols; > > + my $count_exported_symbols =3D scalar keys %exported_symbols; > > + my $count_types =3D scalar keys %types; > > + my $count_reachable_types =3D scalar keys %seen_types; > > + > > + print STDERR qq{symbols =3D $count_symbols > > +exported symbols =3D $count_exported_symbols > > +types =3D $count_types > > +reachable types =3D $count_reachable_types > > +}; > > +} > > + > > +my %drop_if_empty =3D map { $_ =3D> undef } qw( > > + namespace-decl > > + abi-instr > > + abi-corpus > > + abi-corpus-group > > +); > > + > > +# This is another DFS traversal of the XML. > > +sub prune($); > > +sub prune($) { > > + my ($node) =3D @_; > > + > > + my $name =3D $node->getName(); > > + my $id; > > + my $symbol; > > + > > + if ($node->nodeType =3D=3D XML_ELEMENT_NODE) { > > + $id =3D $node->getAttribute('id'); > > + $symbol =3D $node->getAttribute('mangled-name'); > > + die if defined $id && defined $symbol; > > + } > > + > > + # Return if we know that this is a type or declaration to keep or > > + # drop in its entirety. > > + if (defined $id) { > > + return !exists $seen_types{$id}; > > + } > > + if ($name eq 'var-decl' || $name eq 'function-decl') { > > + return !defined $symbol || !exists $exported_symbols{$symbol}; > > + } > > + > > + # Otherwise, this is not a type, declaration or part thereof, so > > + # process child elements. > > + my $empty =3D 1; > > + for my $child ($node->nonBlankChildNodes()) { > > + if (prune($child)) { > > + $node->removeChild($child); > > + } else { > > + $empty =3D 0; > > + } > > + } > > + > > + # Empty container elements can be dropped. > > + return $empty && exists $drop_if_empty{$name}; > > +} > > + > > +# Drop unreachable nodes and any container nodes that end up empty. > > +prune($dom); > > +my $out =3D $dom->toString(); > > +# Remove blank line debris. > > +$out =3D~ s;^ *\n;;mg; > > +# Remove the XML declaration as abidiff doesn't like it. > > +$out =3D~ s;^<\?xml .*\?>\n;;m; > > +print $out; > > --=20 > > 2.31.0.rc2.261.g7f71774620-goog > >=20