public inbox for libabigail@sourceware.org
 help / color / mirror / Atom feed
From: Matthias Maennich <maennich@google.com>
To: Mark Wielaard <mark@klomp.org>
Cc: Dodji Seketeli <dodji@seketeli.org>, libabigail@sourceware.org
Subject: Re: [PATCH 1/4] Add named-types-ids to use name ids after the type name instead of numbers.
Date: Fri, 24 Apr 2020 19:11:55 +0200	[thread overview]
Message-ID: <20200424171155.GG80434@google.com> (raw)
In-Reply-To: <cea8dd80ef9b566aa78d919047abeafdc75de21c.camel@klomp.org>

Hi Mark!

On Fri, Apr 24, 2020 at 05:26:24PM +0200, Mark Wielaard wrote:
>Hi Matthias,
>
>On Fri, 2020-04-24 at 14:57 +0200, Matthias Maennich wrote:
>> On Thu, Apr 23, 2020 at 07:48:12PM +0200, Dodji Seketeli wrote:
>> > "Mark J. Wielaard" <mark@klomp.org> a écrit:
>> > [...]
>> > diff --git a/src/abg-libxml-utils.cc b/src/abg-libxml-utils.cc
>> > index 2c46aad8..7370afd5 100644
>> > --- a/src/abg-libxml-utils.cc
>> > +++ b/src/abg-libxml-utils.cc
>> > @@ -249,6 +249,65 @@ escape_xml_string(const std::string& str)
>> >   return result;
>> > }
>> >
>> > +/// Replace the various special characters in a type string as used in
>> > +/// a named type-id attribute.
>> > +///
>> > +/// The characters '<', '>', ''', '"' and ' ' are all replaced by '_'.
>> > +/// The characters '&' and '*' at the end of a string are simply dropped,
>> > +/// otherwise they are also replaced by an '_'.
>> > +///
>> > +/// The result is not reversible.
>> > +///
>> > +//// @param str the input string to read to search for the characters
>> > +//// to replace.
>> > +////
>> > +//// @param replaced the output string where to write the resulting
>> > +//// string that contains the replaced characters,
>> > +void
>> > +escape_xml_string_as_named_type_id(const std::string& str,
>> > +				   std::string& replaced)
>> > +{
>> > +  for (std::string::const_iterator i = str.begin(); i != str.end(); ++i)
>> > +    switch (*i)
>> > +      {
>> > +      case ' ':
>> > +      case '<':
>> > +      case '>':
>> > +      case '"':
>> > +      case '\'':
>> > +	replaced += '_';
>> > +	break;
>> > +      case '&':
>> > +      case '*':
>>
>> Would this not make references types and pointer types
>> indistinguishable?
>
>Yes, which is why in the original patch prefix the named-type-id with
>the type-kind.
>
>> Consider
>>
>>    | template<typename T>
>>    | class Test{};
>>    |
>>    | Test<int&> a;
>>    | Test<int*> b;
>
>So "int&" would come out as "ref-int" and "int*" would come out as
>"ptr-int".
>
>> While we add underscores until we have a unique match, that will later
>>
>> Also, why would we need to replace those characters at all? Except the
>> quote itself, we should be fine?! What am I missing? Can we reuse the
>> mangled name? Can replace less of those characters?
>
>As far as I understand '<' and '&' always need to be escaped in XML
>attributes. '>' doesn't, but it is really confusing to replace only '<'
>and not '>', so to "balance it out" I replaced them both with the same
>char.
>
>> In the textual representation those are replaced differently:
>>
>>    name='Test&lt;int&amp;&gt;'
>>    name='Test&lt;int*&gt;'
>
>I found that to be really hard to read.
>I admit that my tests have all be plain C, for which the replacement
>turns out very nicely. I'll try some C++ code to see if I can make the
>named-type-ids easier to read. Suggestions welcome. But maybe --named-
>type-ids only really makes sense for plain C libraries...?

When I was thinking of addressing this myself, I had in mind to do this
based on the type name and the type dependency graph.

The typename can maybe be mangled like C++. That should give a nice
string suitable to put into the xml attribute. Even for C. (And c++filt
could maybe make it readable if it needs to be). The format is well
thought through, battle tested and we only need underscores if the
typenames are exactly the same.

The above types Test<int&> and Test<int*> become 4TestIRiE and
4TestIPiE and  `cat abi.xml | c++filt -t` makes them human readable in
the representation file. (https://godbolt.org/z/WyA4xi)

Sorting those types topological and alphabetical and merging translation
units (as you did in patch 4/4) should yield a stable xml even across
larger refactorings.

How do you like this approach. Not necessarily the sorting, but the
encoding?

>
>> > +	if (i + 1 != str.end())
>> > +	  replaced += '_';
>> > +	break;
>> > +      default:
>> > +	replaced += *i;
>> > +      }
>> > +}
>> > +
>> > +/// Replace the various special characters in a type string as
>> > used in
>> > +/// a type-id attribute.
>> > +///
>> > +/// The characters '<', '>', ''', '"' are all replaced by '_'.
>> > +/// The character '&' is replaced by the string "-ref".
>>
>> Is this really what is happening? I seem to miss the part where & is
>> transformed into '-ref'.
>
>Bad comment, the same function right above has the correct comment.
>Instead of prefixes I originally used postfix -ref and -ptr, but that
>didn't distinguish other types, which don't have special characters in
>their names (like typedefs, which are often named after the underlying
>type). I'll fix up the comment.
>
>> > diff --git a/src/abg-writer.cc b/src/abg-writer.cc
>> > index 74fa1a8c..5f85810b 100644
>> > --- a/src/abg-writer.cc
>> > +++ b/src/abg-writer.cc
>> > @@ -127,6 +127,29 @@ public:
>> >     ABG_ASSERT(env);
>> >     return env->intern(o.str());
>> >   }
>> > +
>> > +  /// Return a unique string representing a type representation.  The
>> > +  /// returned string will be made unique by postfixing underscores if
>> > +  /// necessary.
>> > +  ///
>> > +  /// @param type to create an unique id string for
>> > +  interned_string
>> > +  get_named_type_id(const type_base* type) const
>> > +  {
>> > +    string named_type_id =
>> > +      xml::escape_xml_string_as_named_type_id(get_pretty_representation(type,
>> > +									/*internal=*/true));
>> > +
>> > +    /* We want to make sure the id is unique. See if it is already
>> > +       interned in this environment, if it is, it isn't unique and we
>> > +       add some underscores to it till it is.  */
>> > +    const environment* env = get_environment();
>> > +    ABG_ASSERT(env);
>> > +    while (env->is_interned_string(named_type_id))
>> > +      named_type_id = named_type_id + "_";
>>
>> Postfixing the underscore has the advantage that this is not
>> reproducibly producing the same output for the same input. That would
>> defeat (a bit) the purpose of that patch to stabilize the output.
>
>I am afraid I don't fully understand what you are suggesting here.

Sorry, that was a very confusing sentence (not sure who wrote that :-)).
What I meant to say is: If we have similar types (like the c++ example
above) that will convert into the same underscore representation and we
need add underscores to disambiguate, then we depend on the order we see
the translation units for the output format. If think I would like to
eliminate the underscore suffixing down to only the cases where the type
name is exactly the same.

>
>There is an issue with anonymous types. Since they have the same names.
>I don't have a good solution for that, but think it might be possible
>to disambiguate them in most cases.
>
>Another issue is using the exact same type name defined in different
>translation units that are not the same type. That issue violates ODR
>for C++, but can happen for C code. If it happens, it does depend on
>the order we see the translation unit. But in the cases I am interested
>in you would use --named-type-ids together with --drop-private-types.
>Which makes it more likely that such internal same name, but different
>type, issues pop up.

I think --named-type-ids is useful for everyone that puts the xml
description in the sources, updates it regularly and wants small human
readable diffs when doing so. I would say this is independent of the
language in use, but I agree that this is much less of a problem for C.

Cheers,
Matthias

>
>Cheers,
>
>Mark

  reply	other threads:[~2020-04-24 17:11 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-21 12:28 More simplifications of abi XML file Mark J. Wielaard
2020-04-21 12:28 ` [PATCH 1/4] Add named-types-ids to use name ids after the type name instead of numbers Mark J. Wielaard
2020-04-23 17:48   ` Dodji Seketeli
2020-04-24 12:57     ` Matthias Maennich
2020-04-24 15:26       ` Mark Wielaard
2020-04-24 17:11         ` Matthias Maennich [this message]
2020-04-24 14:26     ` Mark Wielaard
2020-04-21 12:28 ` [PATCH 2/4] Add no-parameter-names to drop function parameter names Mark J. Wielaard
2020-04-24 13:35   ` Dodji Seketeli
2020-04-21 12:28 ` [PATCH 3/4] Add --no-elf-needed option to drop DT_NEEDED list from corpus Mark J. Wielaard
2020-04-24 14:17   ` Dodji Seketeli
2020-04-21 12:28 ` [PATCH 4/4] Add --merge-translation-units option Mark J. Wielaard
2020-05-07 12:27   ` Matthias Maennich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200424171155.GG80434@google.com \
    --to=maennich@google.com \
    --cc=dodji@seketeli.org \
    --cc=libabigail@sourceware.org \
    --cc=mark@klomp.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).