public inbox for libabigail@sourceware.org
 help / color / mirror / Atom feed
* Thoughts on ABI XML type ids
@ 2020-06-09 11:34 Giuliano Procida
  0 siblings, 0 replies; only message in thread
From: Giuliano Procida @ 2020-06-09 11:34 UTC (permalink / raw)
  To: libabigail; +Cc: Matthias Männich, Dodji Seketeli, Mark J. Wielaard

Hi all.

Mark posted [https://sourceware.org/pipermail/libabigail/2020q2/001973.html]
a patch for review that aimed to change the type ids used in ABI XML
to something

* more stable
* more humanly comprehensible.

Unstable type ids mean that diffing (large) ABI XML files is often not
useful as the useful part of the diff is drowned out by thousands of
changes caused by type id renumbering. If the XML is stored in a VCS
this has a non-trivial impact on storage. If the diffs go via mail or
are held in a web-based review system, these also become unwieldy.

libabigail only outputs full type names (in XML comments) with the
--annotate option. These names are not part of the ABI.

There are some obvious candidates for canonical type names:

* C++ mangled names
[https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling]
* libabigail's "internal" type names [get_cached_pretty_representation(true)]

The latter should be close to c++filt of the former.

Canonical here means recognising struct foo and class foo as the same
thing, standardising on foo(void) vs foo() and ignoring irrelevancies
such as parameter names in function types etc.

libagibail doesn't do quite as well as compilers with anonymous type
naming. GCC creates unique mangled names for anonymous types (as
described in the link) and libabigail could do something analogous to
this. [_ZZZ1giEN1S1fE_2iEUt1_ -> g(int)::S::f(int)::{unnamed type#3}]

DWARF information doesn't contain mangled type names, except
incompletely as part of function symbol linkage names. Implementing
get_mangled_name() would be a very significant undertaking with the
dubious benefit of allowing the XML or XML diffs to be post-processed
with c++filt to produce something close or identical libabigail's
existing pretty representations of types. One real benefit of mangled
names is that they don't require escaping for XML. Mangled type names
can be arbitrarily long but should be perfectly stable if implemented
perfectly.

C++ type names include XML special characters (<) and will need to be
escaped or neutered (such as by replacing "<>" with "{}") for
inclusion in XML attributes. Apart from this issue, they are the most
human readable. If the pretty representation changes between
libabigail versions, so will the type ids.

The most compact stable representation of a type is likely to be a
hash (prefix) of some function of the type. Existing hashing within
libabigail uses addresses so is not stable. Hashing could be
implemented recursively over the type IR (a moderate to large
undertaking) or could be done by hashing the internal type name. If
the chosen function or hashing changes between libabigail versions,
all type ids will change.

Whatever naming mechanisms are chosen must respect libabigail's notion
of type identity. They must be made injective over the types in a
corpus. The only guaranteed way to achieve this is with a secondary
check for collision and disambiguation. This can be achieved with an
extra map within the XML writer. Mark's patch re-uses the string
intern store which may not be as reliable or may be just fine.

Disambiguation (while still trying to maintain stability) can be
achieved in various ways, all of which may require multiple probes in
the case that the types have seemingly identical names.

* add/increment a unique suffix
* increment the hash value itself

*Proposals*

We should investigate various options for stable type ids. We
shouldn't preclude multiple implementations for different use-cases,
so long as uniqueness can be guaranteed via a disambiguation step.

Mark's existing proposal needs some further work for anonymous type
ids. It may be worth investing time into giving libabigail anonymous
type naming logic similar to that used by compilers as this could have
other benefits.

Matthias and I would like to see something compact, fast and maximally
stable. We are dealing with very large ABIs which are probably not
libabigail's typical use-case.

If get_cached_pretty_representation is already being called, then
using this is free apart from the IO and space costs, with a very
simple transformation for use in XML. We can also look at using it in
combination with a fast hash to reduce the size blow-up (and reading
cost).

If it's not being called and is expensive, then it may be worth
investing time in a stable hash over the type IR instead.

Comments welcome!

Regards,
Giuliano.

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-06-09 11:34 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-09 11:34 Thoughts on ABI XML type ids Giuliano Procida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).