public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Martin Sebor <msebor@gmail.com>
To: David Malcolm <dmalcolm@redhat.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027]
Date: Tue, 2 Nov 2021 13:49:42 -0600	[thread overview]
Message-ID: <ab873f70-9a83-2467-975d-887beef861b4@gmail.com> (raw)
In-Reply-To: <20211101211412.1123930-1-dmalcolm@redhat.com>

On 11/1/21 3:14 PM, David Malcolm via Gcc-patches wrote:
> [Resending to get around mailing list size limit; see notes below]
> 
> This patch implements a new -Whomoglyph diagnostic, enabled by default.
> 
> Internally it implements the "skeleton" algorithm from:
>    http://www.unicode.org/reports/tr39/#Confusable_Detection
> so that every new identifier is mapped to a "skeleton", and if
> the skeleton is already in use by a different identifier, issue
> a -Whomoglyph diagnostic.
> It uses the data from:
>    https://www.unicode.org/Public/security/13.0.0/confusables.txt
> to determine which characters are confusable.
> 
> For example, given the example of CVE-2021-42694 at
> https://trojansource.codes/, with this patch we emit:
> 
> t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
>      7 | void say<U+041D>ello() {
>        | ^~~~
> t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
>      3 | void sayHello() {
>        | ^~~~
> 
> (the precise location of the token isn't quite right; the
> identifiers should be underlined, rather than the "void" tokens)
> 
> This takes advantage of:
>    "diagnostics: escape non-ASCII source bytes for certain diagnostics"
>      https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
> to escape non-ASCII characters when printing a source line for -Whomoglyph,
> so that we print "say<U+041D>ello" when quoting the source line, making it
> clearer that this is not "sayHello".
> 
> In order to implement "skeleton", I had to implement NFD support, so the
> patch also contains some UTF-32 support code.
> 
> Known issues:
> - I'm doing an extra hash_table lookup on every identifier lookup.
>    I haven't yet measured the impact on the speed of the compiler.
>    If this is an issue, is there a good place to stash an extra
>    pointer in every identifier?
> - doesn't yet bootstrap, as the confusables.txt data contains ASCII
>    to ASCII confusables, leading to warnings such as:
> ../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] [-Whomoglyph]
> ../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier ‘OPT_I’ (‘OPT_I’) here
>    Perhaps the option should have levels, where we don't complain about
>    pure ASCII confusables at the default level?
> - no docs yet
> - as noted above the location_t of the token isn't quite right
> - map_identifier_to_skeleton and map_skeleton_to_first_use aren't
>    yet integrated with the garbage collector
> - some other FIXMEs in the patch
> 
> [I had to trim the patch for space to get it to pass the size filter on the
> mailing list; I trimmed:
>    contrib/unicode/confusables.txt,
>    gcc/testsuite/selftests/NormalizationTest.txt
> which can be downloaded from the URLs in the ChangeLog, and:
>    gcc/confusables.inc
>    gcc/decomposition.inc
> which can be generated using the scripts in the patch ]
> 
> Thoughts?

None from me on the actual feature -- even after our discussion
this morning I remain comfortably ignorant of the problem :)
I just have a quick comment on the two new string classes:

...
> +
> +/* A class for manipulating UTF-32 strings.  */
> +
> +class utf32_string
> +{
...
> + private:
...
> +  cppchar_t *m_buf;
> +  size_t m_alloc_len;
> +  size_t m_len;
> +};
> +
> +/* A class for constructing UTF-8 encoded strings.
> +   These are not NUL-terminated.  */
> +
> +class utf8_string
> +{
...
> + private:
> +  uchar  *m_buf;
> +  size_t m_alloc_sz;
> +  size_t m_len;
> +};

There are container abstractions both in C++ and in GCC that
these classes look like they could be implemented in terms of:
I'm thinking of std::string, std::vector, vec, and auto_vec.
They have the additional advantage of being safely copyable
and assignable, and of course, of having already been tested.
I see that the classes in your patch provide additional
functionality that the abstractions don't.  I'd expect it
be doable on top of the abstractions and without
reimplementing all the basic buffer management.

Martin

      parent reply	other threads:[~2021-11-02 19:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-01 21:14 David Malcolm
2021-11-02 11:56 ` Jakub Jelinek
2021-11-02 12:06   ` Jakub Jelinek
2021-11-02 19:49 ` Martin Sebor [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab873f70-9a83-2467-975d-887beef861b4@gmail.com \
    --to=msebor@gmail.com \
    --cc=dmalcolm@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).