From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gprocida@google.com>
Received: from mail-vs1-xe33.google.com (mail-vs1-xe33.google.com
 [IPv6:2607:f8b0:4864:20::e33])
 by sourceware.org (Postfix) with ESMTPS id 77EAA386102F
 for <libabigail@sourceware.org>; Tue,  6 Apr 2021 14:36:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 77EAA386102F
Received: by mail-vs1-xe33.google.com with SMTP id d65so3795801vsd.6
 for <libabigail@sourceware.org>; Tue, 06 Apr 2021 07:36:30 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=cJVNOuCZJv5DSmwjeDJWeYPm0XZx6hso2IoL3gJNWEo=;
 b=oQJRLUw9BXOTu/c7g5mGteWYJX1Iqqhzk8QTIMve3U/h1+J/yWc2TGohjrtQVs3XeK
 Ka5CqpHYGssSPFHnlnhfOp6o9jvYiusx7U+NETgogW/RBpbDd5b8urffqB6mnMjAh+W3
 zbLo2K8izYtAo40DqMuNLg+wDWQ6PMCdFSObCMjGoTljpKyi03BuK+7hwrt/c1IRU0Be
 ptt6m+fkH5hDAu++is+iw49FKO9qGkh4DIMwoVafommzl8tzCzOMEKOK24bCCaNJnCv7
 kBSQ2p9UP4gwNfHcc9/b448CX57LHY3Es7wxcpFI3+e05xlhv8fCIFmpYDmpmbWVmNHO
 hTMQ==
X-Gm-Message-State: AOAM531fEXBdbEXs2Mug1wMNwQPWrOlSH8LhHigMdEzXpUNEPeB9ZOnK
 kX4bPHLLYM6oT4SI8s033ILLMPO8rVMVGtyqmApOYg==
X-Google-Smtp-Source: ABdhPJwCRxi+fsLMurd+/MuN4vGMI8MMivgPKg6II9hUBGGczBE1uq86CuPariyJFY8mfZ/n0/lYB3k/Eg8OZzLc94A=
X-Received: by 2002:a05:6102:ed0:: with SMTP id
 m16mr3956839vst.4.1617719789744; 
 Tue, 06 Apr 2021 07:36:29 -0700 (PDT)
MIME-Version: 1.0
References: <20210331092352.619148-1-gprocida@google.com>
 <20210331170454.951900-1-gprocida@google.com>
 <8735w3vdkv.fsf@seketeli.org>
In-Reply-To: <8735w3vdkv.fsf@seketeli.org>
From: Giuliano Procida <gprocida@google.com>
Date: Tue, 6 Apr 2021 15:35:52 +0100
Message-ID: <CAGvU0HkGVVHwWj-c=wqyX+n9hLsPDuxkdePRrW_-9wikWPS9wQ@mail.gmail.com>
Subject: Re: [PATCH v2] XML reader: improve XML node traversal
To: Dodji Seketeli <dodji@seketeli.org>
Cc: libabigail@sourceware.org, kernel-team@android.com, maennich@google.com
X-Spam-Status: No, score=-29.3 required=5.0 tests=BAYES_00, DKIMWL_WL_MED,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,
 GIT_PATCH_0, HTML_MESSAGE, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 USER_IN_DEF_DKIM_WL,
 USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libabigail@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Mailing list of the Libabigail project <libabigail.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libabigail>,
 <mailto:libabigail-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libabigail/>
List-Help: <mailto:libabigail-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libabigail>,
 <mailto:libabigail-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Apr 2021 14:36:33 -0000

Hi.

Thanks for reviewing.

On Tue, 6 Apr 2021 at 12:48, Dodji Seketeli <dodji@seketeli.org> wrote:

> Hello Giuliano,
>
> Giuliano Procida <gprocida@google.com> a =C3=A9crit:
>
> [...]
>
> Thanks for looking into this.
>
> Sadly, as several other patches got in before I got to review this, this
> patch doesn't apply to the current state of master anymore.  So overall,
> I think it would need a rebasing.  Sorry about that :-(
>
>
No problem. So long as the public master is the same as your working
master, I can easily rebase and resend. My v2 posting reflects the
current public master.


> [...]
>
> > The XML reader uses libxml's Reader API for parsing XML and traverses
> > high-level XML nodes using reader cursor advancement. Low-level nodes
> > are read into complete node trees and are traversed following
> > children, next and (in one instance) parent pointers. Some
> > intermediate-level nodes can be parsed both ways.
> >
> > A lot of code cares about node types, typically skipping over
> > non-element nodes. Unfortunately, in a few places, the tracked "corpus
> > node" needs to land on text (whitespace) within elements for the XML
> > reader to work; stripping text nodes out causes some elements to be
> > skipped.
>
> Can you please give an example of this?  It doesn't seem obvious to me.
>

Of course. The contributed test case serves as an example (it just has all
unnecessary space stripped out). It wasn't obvious to me either!

I'm sure there is a much smaller fix that could be made to address each
case of sensitivity to whitespace. However, the main contribution here are
the new primitives which make, I hope, the parsing code easier to read and
understand and harder to hide bugs in.


>
> > Almost all node tree traversals are very straightforward and
> > can be expressed using simpler primitves than at present.
> >
> > This commit changes all the node tree traversals to use two node
> > iteration primitives consistently. These are "first_child" and
> > "next_child", both of which automatically skip text nodes. They
> > replace the existing "advance_to_next_sibling_element" and *all*
> > direct references to "children" and "next" nodes. The unwanted
> > dependency on text within elements has been fixed.
>
> I think providing an example of the initial problem would help to
> understand how this fixes the initial issue.
>
>
Please see the test case. We'd like to have confidence in the semantics of
libabigail XML, independent of the current reader and writer
implementations. Transformations that do not affect the XML element
structure should not change XML meaning, but do at present.


> [...]
>
> > For-loops constructed with the new primitives present only element
> > nodes to the loop body. This allows a lot of redundant checking for
> > null / non-element nodes to be removed.
> >
> > These changes have been tested by using abidiff to comparing all
> > current test XML files with copies where all text has been removed.
> > One such test case is included in this commit.
> >
> >       * include/abg-libxml-utils.h
> >       (advance_to_next_sibling_element): Remove this function.
> >       (first_child): Add new function. (next_child): Likewise. This
>
> Please note how the "(next_child)" should go on the next line.
>
> There are several instances of this issue in the ChangeLog part of the
> commit log.
>
>
Noted. I may have been doing this inconsistently for a while. I'll correct
this.


> [...]
>
> >       is simpler than but functionally identical to
> >       advance_to_next_sibling_element.
> >       * src/abg-libxml-utils.cc (skip_non_elements): New function to
> >       skip non-element nodes in a sibling node list. (first_child):
> >       New function to find the first child element of a node.
> >       (next_child): New function to find the next sibling element of
> >       a node.
> >       * src/abg-reader.cc (walk_xml_node_to_map_type_ids): Remove
> >       redundant node checks. Iterate over children with first_child
> >       and next_child helpers. (read_translation_unit): Likewise.
> >       (read_translation_unit_from_input): Remove a no-longer-needed
> >       ->next and redundant node checks. Advance "corpus node" after
> >       reading the translation unit. (read_symbol_db_from_input):
> >       Remove a no-longer-needed ->next and advance "corpus node" at
> >       the end of the for-loop.  Remove redundant node checks.
> >       (build_needed): Remove redundant node checks. Iterate over
> >       children with first_child and next_child.
> >       (read_elf_needed_from_input): Remove redundant node checks and
> >       simplify logic. Move to next element node after reading
> >       elf-needed node. (read_corpus_from_input): Remove no-children
> >       early return (which was inhibited by text nodes anyway). Set
> >       initial "corpus node" using first_child instead of children
> >       pointer. Remove redundant "corpus node" updates at end of
> >       function as the caller takes care of this and the node can
> >       become null with the new primitives.
> >       (read_corpus_group_from_input): Iterate over children using
> >       first_child and next child. Set "corpus node" for each corpus,
> >       not just the first. Incidentally keep going even if one corpus
> >       parse fails. (build_namespace_decl): Remove redundant node
> >       checks. Iterate over children using first_child and
> >       next_child. (build_elf_symbol): Remove redundant node checks.
> >       (build_elf_symbol_db): Iterate over children using first_child
> >       and next_child. (build_function_decl): Remove redundant node
> >       checks. Iterate over children using first_child and
> >       next_child. (build_function_type): Likewise.
> >       (build_array_type_def): Likewise. (build_enum_type_decl):
> >       Likewise. (build_class_decl): Remove redundant node
> >       checks. Iterate over children using first_child and
> >       next_child. Add comment about skipping children of
> >       declaration-only types (build_union_decl): Likewise.
> >       (build_function_tdecl): Remove redundant node checks.
> >       Iterate over children using first_child and next_child.
> >       (build_type_composition): Likewise.
> >       (build_template_tparameter): Likewise.
> >       * tests/data/Makefile.am: Add new test files.
> >       * tests/data/test-abidiff-exit/test-squished-report.txt: Add
> >       empty diff report.
> >       * tests/data/test-abidiff-exit/test-squished-v0.abi: Add test
> >       ABI file, containing all ELF elements.
> >       * tests/data/test-abidiff-exit/test-squished-v1.abi: Likewise
> >       and identical, except that all text has been stripped.
> >       * tests/test-abidiff-exit.cc: Add new test case.
> >
> > Reviewed-by: Matthias Maennich <maennich@google.com>
> > Signed-off-by: Giuliano Procida <gprocida@google.com>
> > ---
> >  include/abg-libxml-utils.h                    |   5 +-
> >  src/abg-libxml-utils.cc                       |  47 ++--
> >  src/abg-reader.cc                             | 260 ++++++------------
> >  tests/data/Makefile.am                        |   3 +
> >  .../test-squished-report.txt                  |   0
> >  .../test-abidiff-exit/test-squished-v0.abi    |  43 +++
> >  .../test-abidiff-exit/test-squished-v1.abi    |   1 +
> >  tests/test-abidiff-exit.cc                    |  11 +
> >  8 files changed, 167 insertions(+), 203 deletions(-)
> >  create mode 100644 tests/data/test-abidiff-exit/test-squished-report.t=
xt
> >  create mode 100644 tests/data/test-abidiff-exit/test-squished-v0.abi
> >  create mode 100644 tests/data/test-abidiff-exit/test-squished-v1.abi
> >
> > diff --git a/include/abg-libxml-utils.h b/include/abg-libxml-utils.h
> > index 69e940f6..f87ee5a0 100644
> > --- a/include/abg-libxml-utils.h
> > +++ b/include/abg-libxml-utils.h
> > @@ -82,7 +82,10 @@ int get_xml_node_depth(xmlNodePtr);
> >    reinterpret_cast<char*>(xml_char_str.get())
> >
> >  xmlNodePtr
> > -advance_to_next_sibling_element(xmlNodePtr node);
> > +first_child(xmlNodePtr node);
> > +
> > +xmlNodePtr
> > +next_child(xmlNodePtr node);
> >
> >  void
> >  escape_xml_string(const std::string& str,
> > diff --git a/src/abg-libxml-utils.cc b/src/abg-libxml-utils.cc
> > index a1acff1f..baea25ac 100644
> > --- a/src/abg-libxml-utils.cc
> > +++ b/src/abg-libxml-utils.cc
> > @@ -413,40 +413,39 @@ unescape_xml_comment(const std::string& str)
> >    return result;
> >  }
> >
> > -/// Maybe get the next sibling element node of an XML node, or stay to
> the sam
> > +/// Skip until we reach an XML element node or run out of (child) node=
s.
> >  ///
> > -/// If there is no next sibling xml element node, the function returns
> > -/// the initial node.
> > +/// @param a pointer to a node.
> >  ///
> > -/// @param node the initial node to consider.
> > -///
> > -/// @return the next sibling node or the initial node @p node.
> > -static xmlNodePtr
> > -go_to_next_sibling_element_or_stay(xmlNodePtr node)
>
> Removal of this is not documented in the ChangeLog part of the commit log=
.
>
> > +/// @return a pointer to an XML element node or nil.
> > +xmlNodePtr
> > +skip_non_elements(xmlNodePtr node)
>
> This does exactly the same as the removed
> go_to_next_sibling_element_or_stay or am I missing something?
>
>
The only (but crucial) difference is that skip_non_elements is happy to
return nil when it reaches the end of the list while
go_to_next_sibling_element_or_stay will not go past the last element.

For example, consider a sibling (->next) chain: whitespace1 -> elf-symbol
-> whitespace2 -> nil that might occur in a XML file containing a single
ELF symbol.

Once the parser has consumed the symbol, we want the next element (but
there is none).

The new code does: next_child which calls skip_non_elements(node->next) and
both return nil.
The old code does: advance_to_next_sibling_element(node) which
calls go_to_next_sibling_element_or_stay(node->next) which returns
whitespace2 and then an extra check is needed on the node type to
eventually return nil.

So the new code can be thought of as factoring the ? : conditional logic
in go_to_next_sibling_element_or_stay out into the
calling advance_to_next_sibling_element and renaming the two functions.
Then there's adding first_child which uses the newly factored function in a
consistent fashion. Each node's type is tested exactly once.


> >  {
> > -  xmlNodePtr n;
> > -  for (n =3D node; n; n =3D n->next)
> > -    {
> > -      if (n->type =3D=3D XML_ELEMENT_NODE)
> > -     break;
> > -    }
> > -  return n ? n : node;
> > +  while (node && node->type !=3D XML_ELEMENT_NODE)
> > +    node =3D node->next;
> > +  return node;
> >  }
> >
> > -/// Get the next sibling element node of an XML node.
> > +/// Get the first child element of an XML node.
> >  ///
> > -/// If there is no next sibling xml element node, the function returns
> nil.
> > +/// @param a pointer to the node whose children are to be perused.
> >  ///
> > -/// @param node the XML node to consider.
> > +/// @return a pointer to the first XML element child of @p node or nil=
.
> > +xmlNodePtr
> > +first_child(xmlNodePtr node)
>
> Please call this first_element_child.  Otherwise, it can be misleading,
> I think.
>
>
The short names are to aid readability of the for-loops. Unless we are
going to start caring about non-element nodes, "element" is redundant.

One step (much) further would be to define an adaptor object and iterators
so the loops could look like:

for (const auto* child : children(node)) { ... }

or even

for (const auto& child : children(node)) { ... }

but that's more machinery and coding to worry about.

> +{
> > +  return skip_non_elements(node->children);
> > +}
> > +
> > +/// Get the next sibling element of an XML node.
> >  ///
> > -/// @return the next sibling element node or nil.
> > +/// @param a pointer to the XML node whose following siblings are to b=
e
> perused.
> > +///
> > +/// @return a pointer to the next sibling element of @p node or nil.
> >  xmlNodePtr
> > -advance_to_next_sibling_element(xmlNodePtr node)
> > +next_child(xmlNodePtr node)
>
> Looking at the code of this function I think it should rather be called
> next_element_sibling.  Or what am I missing?
>
>
As above, we don't want to even care about non-element nodes outside
skip_non_elements and mentioning "element" is redundant. All the rest of
the code works with nodes of type XML_ELEMENT_NODE only (and this constant
now appears exactly once in the code).

"child" vs "sibling" are close to interchangeable in the context of
traversing the children of a given node:
* next sibling =3D next child of the same parent
* "child" is 2 characters shorter than sibling
* "first sibling" means the wrong thing

>  {
> > -  xmlNodePtr n =3D go_to_next_sibling_element_or_stay(node->next);
> > -  if (n =3D=3D 0 || n->type !=3D XML_ELEMENT_NODE)
> > -    return 0;
> > -  return n;
> > +  return skip_non_elements(node->next);
> >  }
>
> [...]
>
> Cheers,
>
>
> --
>                 Dodji
>

Regards,
Giuliano.