public inbox for libabigail@sourceware.org
 help / color / mirror / Atom feed
From: Matthias Maennich <maennich@google.com>
To: Dodji Seketeli <dodji@seketeli.org>
Cc: "Mark J. Wielaard" <mark@klomp.org>, libabigail@sourceware.org
Subject: Re: [PATCH 1/4] Add named-types-ids to use name ids after the type name instead of numbers.
Date: Fri, 24 Apr 2020 14:57:00 +0200	[thread overview]
Message-ID: <20200424125700.GF80434@google.com> (raw)
In-Reply-To: <87v9lqayar.fsf@seketeli.org>

Hi!

Thanks both for working on that!

On Thu, Apr 23, 2020 at 07:48:12PM +0200, Dodji Seketeli wrote:
>Hello Mark,
>
>First of all, thanks for working on this, it's really appreciated.
>
>"Mark J. Wielaard" <mark@klomp.org> a écrit:
>
>> To make the XML output more readable and a bit more stable generate
>> type ids based on the underlying type name. When types are inserted,
>> removed or rearranged the type ids will (mostly) be the same so that
>> references can stay the same. This also makes it easier to read and
>> compare the XML corpus representation.
>
>Note that abidw has the option --annotate that makes it easy to read the
>abixml file, for cases where people want to read it.
>
>Nevertheless, this new --named-type-ids is definitely a welcomed improvement for the
>stability of abixml files.
>
>I made some light changes to this patch and so I have attached the
>resulting patch to this one, so that you can test it in your environment
>and see if it still suits you.
>
>I am thinking that we should add some basic testing for this option as
>well, I am about to modify test-read-dwarf.cc to make it support this.
>I haven't done it yet as I didn't want to delay this review any further.
>
>[...]
>
>> diff --git a/src/abg-ir.cc b/src/abg-ir.cc
>> index 27831352..f4fee60a 100644
>> --- a/src/abg-ir.cc
>> +++ b/src/abg-ir.cc
>> @@ -2884,6 +2884,13 @@ interned_string
>>  environment::intern(const string& s) const
>>  {return const_cast<environment*>(this)->priv_->string_pool_.create_string(s);}
>>
>> +bool
>> +environment::is_interned_string(const string& s) const
>> +{
>
>I have added a comment for this function.
>
>> +  const char *c = s.c_str();
>> +  return const_cast<environment*>(this)->priv_->string_pool_.has_string(c);
>> +}
>> +
>>  // </environment stuff>
>>
>>  // <type_or_decl_base stuff>
>> diff --git a/src/abg-libxml-utils.cc b/src/abg-libxml-utils.cc
>> index 2c46aad8..e8c1c1b8 100644
>> --- a/src/abg-libxml-utils.cc
>> +++ b/src/abg-libxml-utils.cc
>> @@ -249,6 +249,64 @@ escape_xml_string(const std::string& str)
>>    return result;
>>  }
>>
>> +/// Replace the various special characters in a type string as used in
>> +/// a type-id attribute.
>> +///
>> +/// The characters '<', '>', ''', '"' and ' ' are all replaced by '_'.
>> +/// The characters '&' and '*' at the end of a string are simply dropped,
>> +/// otherwise they are also replaced by an '_'.
>> +///
>> +/// The result is not reversible.
>> +///
>> +//// @param str the input string to read to search for the characters
>> +//// to replace.
>> +////
>> +//// @param replaced the output string where to write the resulting
>> +//// string that contains the replaced characters,
>> +void
>> +replace_xml_type_string(const std::string& str, std::string& replaced)
>> +{
>
>I thinking you could have just used the xml::escape_xml_string function
>that is in this abg-libxml-utils.cc file.  I understand that this new
>one that you wrote makes the result more pleasant to read, though.  So I
>am not arguing against your choice.
>
>I have however changed the name of this function to
>escape_xml_string_as_named_type_id to make the name a bit more
>"self-documented" and more in-line with escape_xml_string that exists
>already.
>
>[...]
>
>> --- a/src/abg-writer.cc
>> +++ b/src/abg-writer.cc
>> @@ -127,6 +127,54 @@ public:
>>      ABG_ASSERT(env);
>>      return env->intern(o.str());
>>    }
>> +
>> +  /// Return a unique string representing a name, prefixed by a type
>> +  /// string. The returned string will be made unique by postfixing
>> +  /// underscores if necessary.
>> +  ///
>> +  /// @param type to create an unique id string for
>> +  interned_string
>> +  get_id_for_type(const type_base* type) const
>> +  {
>> +    ostringstream o;
>> +    /* Try to find an appropriate prefix. */
>> +    if (is_type_decl(type))
>> +      o << "type-";
>> +    else if (is_class_type(type))
>> +      o << "class-";
>> +    else if (is_union_type(type))
>> +      o << "union-";
>> +    else if (is_enum_type(type))
>> +      o << "enum-";
>> +    else if (is_typedef(type))
>> +      o << "typedef-";
>> +    else if (is_qualified_type(type))
>> +      o << "qual-";
>> +    else if (is_pointer_type(type))
>> +      o << "ptr-";
>> +    else if (is_reference_type(type))
>> +      o << "ref-";
>> +    else if (is_array_type(type))
>> +      o << "array-";
>> +    else if (is_subrange_type(type))
>> +      o << "subrng-";
>> +    else if (is_function_type(type))
>> +      o << "func-";
>> +    else
>> +      ABG_ASSERT_NOT_REACHED;
>
>Here, have replaced the above by a call to
>abigail::ir::get_pretty_representation().  It does what you think it
>does for all the types supported in the internal representation (IR).
>Whenever a new type representation is added to the IR, this should
>hopefully just works as well.  I have also renamed this function as
>get_named_type_id() to make it a bit more self-documented.
>
>[...]
>
>> +
>> +    string name = xml::replace_xml_type_string(get_type_name(type));
>> +    o << name;
>> +
>> +    /* We want to make sure the id is unique. See if it is already
>> +       interned in this environment, if it is, it isn't unique and we
>> +       add some underscores to it till it is.  */
>> +    const environment* env = get_environment();
>> +    ABG_ASSERT(env);
>> +    while (env->is_interned_string(o.str()))
>
>Btw, I like this neat little trick you added (the
>environment::is_interned_string function) to ensure the uniqueness of
>the type id.  Well done.
>
>> +      o << "_";
>> +    return env->intern(o.str());
>> +  }
>>  };
>>
>
>[...]
>
>So please, find my amended patch below and I hope I haven't botched it
>too much.  Please let me know if it works for you.
>
>Cheers,
>

From ecde139657bc8bde2aa4e98f4542be38f576c77d Mon Sep 17 00:00:00 2001
>From: Mark Wielaard <mark@klomp.org>
>Date: Tue, 21 Apr 2020 14:28:18 +0200
>Subject: [PATCH] Add named-types-ids to use name ids after the type name
> instead of numbers.
>
>To make the XML output more readable and a bit more stable generate
>type ids based on the textual representation of the type. When types
>are inserted, removed or rearranged the type ids will (mostly) be the
>same so that references can stay the same. This also makes it easier
>to read and compare the XML corpus representation.
>
>	* doc/manuals/abidw.rst: Document --named-type-ids.
>	* include/abg-ir.h (is_interned_string): New method.
>	* include/abg-libxml-utils.h (replace_xml_type_string): Likewise.
>	* include/abg-writer.h (set_named_type_ids): New function.
>	(set_common_options): Call it.
>	* src/abg-ir.cc (environment::is_interned_string): New method.
>	* src/abg-libxml-utils.cc (escape_xml_string_as_named_type_id):
>	New function.
>	* src/abg-writer.cc (id_manager::get_named_type_id): Add new
>	method.
>	(write_context::m_named_type_ids): New data member.
>	(write_context::write_context): Initialize it.
>	(write_context::{g,s}et_named_type_ids}): Add accessors.
>	(write_context::get_id_for_type): Use the new
>	id_manager::get_named_type_id if get_named_type_ids().
>	* tools/abidw.cc (options): Add named_type_ids.
>	(display_usage): Describe --named-type-ids.
>	(parse_command_line): Parse --named-type-ids.
>
>Signed-off-by: Mark Wielaard <mark@klomp.org>
>Signed-off-by: Dodji Seketeli <dodji@redhat.com>
>---
> doc/manuals/abidw.rst      |  8 +++++
> include/abg-ir.h           |  3 ++
> include/abg-libxml-utils.h |  2 ++
> include/abg-writer.h       |  4 +++
> src/abg-ir.cc              | 14 ++++++++
> src/abg-libxml-utils.cc    | 59 +++++++++++++++++++++++++++++++++
> src/abg-writer.cc          | 68 ++++++++++++++++++++++++++++++++++++--
> tools/abidw.cc             |  7 +++-
> 8 files changed, 161 insertions(+), 4 deletions(-)
>
>diff --git a/doc/manuals/abidw.rst b/doc/manuals/abidw.rst
>index 6cc4693c..1e427d32 100644
>--- a/doc/manuals/abidw.rst
>+++ b/doc/manuals/abidw.rst
>@@ -178,6 +178,14 @@ Options
>    In the emitted ABI representation, do not show file, line or column
>    where ABI artifacts are defined.
> 
>+  * ``--named-type-ids``
>+
>+    Without this option ids used to reference types in the XML file
>+    use simple numbers.  With this option the ids used are derived
>+    from the type name to make it easier to see which type is
>+    referenced and make the XML file more stable in case new types are
>+    added (without this option that might mean all id numbers change).
>+
>   * ``--check-alternate-debug-info-base-name`` <*elf-path*>
> 
> 
>diff --git a/include/abg-ir.h b/include/abg-ir.h
>index fda10de5..406a1719 100644
>--- a/include/abg-ir.h
>+++ b/include/abg-ir.h
>@@ -209,6 +209,9 @@ public:
>   interned_string
>   intern(const string&) const;
> 
>+  bool
>+  is_interned_string(const string&) const;
>+
>   friend class class_or_union;
>   friend class class_decl;
>   friend class function_type;
>diff --git a/include/abg-libxml-utils.h b/include/abg-libxml-utils.h
>index 6331bde5..7f85d5cb 100644
>--- a/include/abg-libxml-utils.h
>+++ b/include/abg-libxml-utils.h
>@@ -120,6 +120,8 @@ unescape_xml_comment(const std::string& str,
> std::string
> unescape_xml_comment(const std::string& str);
> 
>+std::string
>+escape_xml_string_as_named_type_id(const std::string& str);
> }//end namespace xml
> }//end namespace abigail
> #endif //__ABG_LIBXML_UTILS_H__
>diff --git a/include/abg-writer.h b/include/abg-writer.h
>index ed739ef1..f1598a15 100644
>--- a/include/abg-writer.h
>+++ b/include/abg-writer.h
>@@ -65,6 +65,9 @@ set_write_comp_dir(write_context& ctxt, bool flag);
> void
> set_short_locs(write_context& ctxt, bool flag);
> 
>+void
>+set_named_type_ids(write_context& ctxt, bool flag);
>+
> /// A convenience generic function to set common options (usually used
> /// by Libabigail tools) from a generic options carrying-object, into
> /// a given @ref write_context.
>@@ -84,6 +87,7 @@ set_common_options(write_context& ctxt, const OPTS& opts)
>   set_write_corpus_path(ctxt, opts.write_corpus_path);
>   set_write_comp_dir(ctxt, opts.write_comp_dir);
>   set_short_locs(ctxt, opts.short_locs);
>+  set_named_type_ids(ctxt, opts.named_type_ids);
> }
> 
> void
>diff --git a/src/abg-ir.cc b/src/abg-ir.cc
>index 27831352..e68c59cc 100644
>--- a/src/abg-ir.cc
>+++ b/src/abg-ir.cc
>@@ -2884,6 +2884,20 @@ interned_string
> environment::intern(const string& s) const
> {return const_cast<environment*>(this)->priv_->string_pool_.create_string(s);}
> 
>+/// Test if a given string has already been interned in the current
>+/// environment.
>+///
>+/// @param s the string to consider.
>+///
>+/// @return true iff @p s has already been intered in the current
>+/// environment.
>+bool
>+environment::is_interned_string(const string& s) const
>+{
>+  const char *c = s.c_str();
>+  return const_cast<environment*>(this)->priv_->string_pool_.has_string(c);
>+}
>+
> // </environment stuff>
> 
> // <type_or_decl_base stuff>
>diff --git a/src/abg-libxml-utils.cc b/src/abg-libxml-utils.cc
>index 2c46aad8..7370afd5 100644
>--- a/src/abg-libxml-utils.cc
>+++ b/src/abg-libxml-utils.cc
>@@ -249,6 +249,65 @@ escape_xml_string(const std::string& str)
>   return result;
> }
> 
>+/// Replace the various special characters in a type string as used in
>+/// a named type-id attribute.
>+///
>+/// The characters '<', '>', ''', '"' and ' ' are all replaced by '_'.
>+/// The characters '&' and '*' at the end of a string are simply dropped,
>+/// otherwise they are also replaced by an '_'.
>+///
>+/// The result is not reversible.
>+///
>+//// @param str the input string to read to search for the characters
>+//// to replace.
>+////
>+//// @param replaced the output string where to write the resulting
>+//// string that contains the replaced characters,
>+void
>+escape_xml_string_as_named_type_id(const std::string& str,
>+				   std::string& replaced)
>+{
>+  for (std::string::const_iterator i = str.begin(); i != str.end(); ++i)
>+    switch (*i)
>+      {
>+      case ' ':
>+      case '<':
>+      case '>':
>+      case '"':
>+      case '\'':
>+	replaced += '_';
>+	break;
>+      case '&':
>+      case '*':

Would this not make references types and pointer types
indistinguishable?

Consider

   | template<typename T>
   | class Test{};
   |
   | Test<int&> a;
   | Test<int*> b;

While we add underscores until we have a unique match, that will later 

Also, why would we need to replace those characters at all? Except the
quote itself, we should be fine?! What am I missing? Can we reuse the
mangled name? Can replace less of those characters?


In the textual representation those are replaced differently:

   name='Test&lt;int&amp;&gt;'
   name='Test&lt;int*&gt;'

>+	if (i + 1 != str.end())
>+	  replaced += '_';
>+	break;
>+      default:
>+	replaced += *i;
>+      }
>+}
>+
>+/// Replace the various special characters in a type string as used in
>+/// a type-id attribute.
>+///
>+/// The characters '<', '>', ''', '"' are all replaced by '_'.
>+/// The character '&' is replaced by the string "-ref".

Is this really what is happening? I seem to miss the part where & is
transformed into '-ref'.

>+///
>+/// The result is not reversible.
>+///
>+//// @param str the input string to read to search for the characters
>+//// to replace.
>+////
>+//// @return the resulting string that contains the string with the
>+//// replaced characters.
>+std::string
>+escape_xml_string_as_named_type_id(const std::string& str)
>+{
>+  std::string result;
>+  escape_xml_string_as_named_type_id(str, result);
>+  return result;
>+}
>+
> /// Escape the '-' character, to avoid having a '--' in a comment.
> ///
> /// The resulting entity for '-' is '&#45;'.
>diff --git a/src/abg-writer.cc b/src/abg-writer.cc
>index 74fa1a8c..5f85810b 100644
>--- a/src/abg-writer.cc
>+++ b/src/abg-writer.cc
>@@ -127,6 +127,29 @@ public:
>     ABG_ASSERT(env);
>     return env->intern(o.str());
>   }
>+
>+  /// Return a unique string representing a type representation.  The
>+  /// returned string will be made unique by postfixing underscores if
>+  /// necessary.
>+  ///
>+  /// @param type to create an unique id string for
>+  interned_string
>+  get_named_type_id(const type_base* type) const
>+  {
>+    string named_type_id =
>+      xml::escape_xml_string_as_named_type_id(get_pretty_representation(type,
>+									/*internal=*/true));
>+
>+    /* We want to make sure the id is unique. See if it is already
>+       interned in this environment, if it is, it isn't unique and we
>+       add some underscores to it till it is.  */
>+    const environment* env = get_environment();
>+    ABG_ASSERT(env);
>+    while (env->is_interned_string(named_type_id))
>+      named_type_id = named_type_id + "_";

Postfixing the underscore has the advantage that this is not
reproducibly producing the same output for the same input. That would
defeat (a bit) the purpose of that patch to stabilize the output.

Again, thanks for working on that! :-)

Cheers,
Matthias

>+
>+    return env->intern(named_type_id);
>+  }
> };
> 
> /// A hashing functor that should be as fast as possible.
>@@ -174,6 +197,7 @@ class write_context
>   bool					m_write_corpus_path;
>   bool					m_write_comp_dir;
>   bool					m_short_locs;
>+  bool					m_named_type_ids;
>   mutable type_ptr_map			m_type_id_map;
>   mutable type_ptr_set_type		m_emitted_type_set;
>   type_ptr_set_type			m_emitted_decl_only_set;
>@@ -208,7 +232,8 @@ public:
>       m_write_architecture(true),
>       m_write_corpus_path(true),
>       m_write_comp_dir(true),
>-      m_short_locs(false)
>+      m_short_locs(false),
>+      m_named_type_ids(false)
>   {}
> 
>   /// Getter of the environment we are operating from.
>@@ -306,6 +331,20 @@ public:
>   set_short_locs(bool f)
>   {m_short_locs = f;}
> 
>+  /// Getter of the named-type-ids option.
>+  ///
>+  /// @return true iff named type ids shall be emitted
>+  bool
>+  get_named_type_ids() const
>+  {return m_named_type_ids;}
>+
>+  /// Setter of the named-type-ids option
>+  ///
>+  /// @param f the new value of the flag.
>+  void
>+  set_named_type_ids(bool f)
>+  {m_named_type_ids = f;}
>+
>   /// Getter of the "show-locs" option.
>   ///
>   /// When this option is true then the XML writer emits location
>@@ -357,6 +396,10 @@ public:
>   /// in a hash table, hashing the type.  So if the type has no id
>   /// associated to it, create a new one and return it.  Otherwise,
>   /// return the existing id for that type.
>+  ///
>+  /// @param t the type to get the id for.
>+  ///
>+  /// @return an interned string representing the id of the type.
>   interned_string
>   get_id_for_type(const type_base_sptr& t)
>   {return get_id_for_type(t.get());}
>@@ -365,6 +408,10 @@ public:
>   /// in a hash table, hashing the type.  So if the type has no id
>   /// associated to it, create a new one and return it.  Otherwise,
>   /// return the existing id for that type.
>+  ///
>+  /// @param t the type to get the id for.
>+  ///
>+  /// @return an interned string representing the id of the type.
>   interned_string
>   get_id_for_type(const type_base* t) const
>   {
>@@ -375,8 +422,11 @@ public:
>     type_ptr_map::const_iterator it = m_type_id_map.find(c);
>     if (it == m_type_id_map.end())
>       {
>-	interned_string id =
>-	  get_id_manager().get_id_with_prefix("type-id-");
>+	interned_string id;
>+	if (get_named_type_ids())
>+	  id = get_id_manager().get_named_type_id(c);
>+	else
>+	  id = get_id_manager().get_id_with_prefix("type-id-");
> 	m_type_id_map[c] = id;
> 	return id;
>       }
>@@ -2033,6 +2083,18 @@ void
> set_short_locs(write_context& ctxt, bool flag)
> {ctxt.set_short_locs(flag);}
> 
>+/// Set the 'named-type-ids' flag.
>+///
>+/// When this flag is set then the XML writer will emit type ids
>+/// based on the name of types, instead of numbered ids.
>+///
>+/// @param ctxt the context to set this flag on to.
>+///
>+/// @param flag the new value of the 'named-type_ids' flag.
>+void
>+set_named_type_ids(write_context& ctxt, bool flag)
>+{ctxt.set_named_type_ids(flag);}
>+
> /// Serialize the canonical types of a given scope.
> ///
> /// @param scope the scope to consider.
>diff --git a/tools/abidw.cc b/tools/abidw.cc
>index 72a8b0f1..2190e8aa 100644
>--- a/tools/abidw.cc
>+++ b/tools/abidw.cc
>@@ -111,6 +111,7 @@ struct options
>   bool			do_log;
>   bool			drop_private_types;
>   bool			drop_undefined_syms;
>+  bool			named_type_ids;
> 
>   options()
>     : display_version(),
>@@ -130,7 +131,8 @@ struct options
>       annotate(),
>       do_log(),
>       drop_private_types(false),
>-      drop_undefined_syms(false)
>+      drop_undefined_syms(false),
>+      named_type_ids(false)
>   {}
> 
>   ~options()
>@@ -164,6 +166,7 @@ display_usage(const string& prog_name, ostream& out)
>     << "  --short-locs  only print filenames rather than paths\n"
>     << "  --drop-private-types  drop private types from representation\n"
>     << "  --drop-undefined-syms  drop undefined symbols from representation\n"
>+    << "  --named-type-ids  use id attributes based on type names in XML file\n"
>     << "  --no-comp-dir-path  do not show compilation path information\n"
>     << "  --check-alternate-debug-info <elf-path>  check alternate debug info "
>     "of <elf-path>\n"
>@@ -304,6 +307,8 @@ parse_command_line(int argc, char* argv[], options& opts)
> 	opts.drop_private_types = true;
>       else if (!strcmp(argv[i], "--drop-undefined-syms"))
> 	opts.drop_undefined_syms = true;
>+      else if (!strcmp(argv[i], "--named-type-ids"))
>+	opts.named_type_ids = true;
>       else if (!strcmp(argv[i], "--no-linux-kernel-mode"))
> 	opts.linux_kernel_mode = false;
>       else if (!strcmp(argv[i], "--abidiff"))
>-- 
>2.26.1
>

>
>-- 
>		Dodji


  reply	other threads:[~2020-04-24 12:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-21 12:28 More simplifications of abi XML file Mark J. Wielaard
2020-04-21 12:28 ` [PATCH 1/4] Add named-types-ids to use name ids after the type name instead of numbers Mark J. Wielaard
2020-04-23 17:48   ` Dodji Seketeli
2020-04-24 12:57     ` Matthias Maennich [this message]
2020-04-24 15:26       ` Mark Wielaard
2020-04-24 17:11         ` Matthias Maennich
2020-04-24 14:26     ` Mark Wielaard
2020-04-21 12:28 ` [PATCH 2/4] Add no-parameter-names to drop function parameter names Mark J. Wielaard
2020-04-24 13:35   ` Dodji Seketeli
2020-04-21 12:28 ` [PATCH 3/4] Add --no-elf-needed option to drop DT_NEEDED list from corpus Mark J. Wielaard
2020-04-24 14:17   ` Dodji Seketeli
2020-04-21 12:28 ` [PATCH 4/4] Add --merge-translation-units option Mark J. Wielaard
2020-05-07 12:27   ` Matthias Maennich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200424125700.GF80434@google.com \
    --to=maennich@google.com \
    --cc=dodji@seketeli.org \
    --cc=libabigail@sourceware.org \
    --cc=mark@klomp.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).