public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: David Malcolm <dmalcolm@redhat.com>
To: Marek Polacek <polacek@redhat.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Joseph Myers <joseph@codesourcery.com>,
	Jakub Jelinek <jakub@redhat.com>,
	David Malcolm <dmalcolm@redhat.com>
Subject: [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026]
Date: Tue,  2 Nov 2021 16:57:59 -0400	[thread overview]
Message-ID: <20211102205801.1202228-1-dmalcolm@redhat.com> (raw)
In-Reply-To: <20211101163652.36794-1-polacek@redhat.com>

On Mon, 2021-11-01 at 12:36 -0400, Marek Polacek via Gcc-patches wrote:
> From a link below:
> "An issue was discovered in the Bidirectional Algorithm in the
> Unicode
> Specification through 14.0. It permits the visual reordering of
> characters via control sequences, which can be used to craft source
> code
> that renders different logic than the logical ordering of tokens
> ingested by compilers and interpreters. Adversaries can leverage this
> to
> encode source code for compilers accepting Unicode such that targeted
> vulnerabilities are introduced invisibly to human reviewers."
> 
> More info:
> https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> https://trojansource.codes/
> 
> This is not a compiler bug.  However, to mitigate the problem, this
> patch
> implements -Wbidirectional=[none|unpaired|any] to warn about possibly
> misleading Unicode bidirectional characters the preprocessor may
> encounter.
> 
> The default is =unpaired, which warns about improperly terminated
> bidirectional characters; e.g. a LRE without its appertaining PDF. 
> The
> level =any warns about any use of bidirectional characters.
> 
> This patch handles both UCNs and UTF-8 characters.  UCNs designating
> bidi characters in identifiers are accepted since r204886.  Then
> r217144
> enabled -fextended-identifiers by default.  Extended characters in
> C/C++
> identifiers have been accepted since r275979.  However, this patch
> still
> warns about mixing UTF-8 and UCN bidi characters; there seems to be
> no
> good reason to allow mixing them.
> 
> We warn in different contexts: comments (both C and C++-style),
> string
> literals, character constants, and identifiers.  Expectedly, UCNs are
> ignored
> in comments and raw string literals.  The bidirectional characters
> can nest
> so this patch handles that as well.
> 
> I have not included nor tested this at all with Fortran (which also
> has
> string literals and line comments).
> 
> Dave M. posted patches improving diagnostic involving Unicode
> characters.
> This patch does not make use of this new infrastructure yet.

Challenge accepted :)

Here are a couple of patches on top of the v1 version of your patch
to make use of that new infrastructure.

The first patch is relatively non-invasive; the second patch reworks
things quite a bit to capture location_t values for the bidirectional
control characters, and use them in the diagnostics, with labelled
ranges, giving e.g.:

$ ./xgcc -B. -S ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c -fdiagnostics-escape-format=bytes
../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c: In function ‘main’:
../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c:5:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=]
    5 |     /* Say hello; newline<e2><81><a7>/*/ return 0 ;
      |                          ~~~~~~~~~~~~  ^
      |                          |             |
      |                          |             end of bidirectional context
      |                          U+2067 (RIGHT-TO-LEFT ISOLATE)

There's a more complicated example in the test case.

Not yet bootstrapped, but hopefully gives you some ideas on future
versions of the patch.

Note that the precise location_t values aren't going to make much sense
without the escaping feature [1], and I don't think that's backportable
to GCC 11, so these UX tweaks might be for GCC 12+ only.

Hope this is constructive
Dave

[1] what is a "column number" in a line of bidirectional text?  Right now
it's a 1-based offset w.r.t. the logical ordering of the characters, but
respecting tabs and counting certain characters as occupying two columns,
but it's not at all clear to me that there's such a thing as a
"column number" in bidirectional text.


David Malcolm (2):
  Flag CPP_W_BIDIRECTIONAL so that source lines are escaped
  Capture locations of bidi chars and underline ranges

 .../c-c++-common/Wbidirectional-ranges.c      |  54 ++++
 libcpp/lex.c                                  | 254 ++++++++++++++----
 2 files changed, 261 insertions(+), 47 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-ranges.c

-- 
2.26.3


  parent reply	other threads:[~2021-11-02 20:58 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-01 16:36 Marek Polacek
2021-11-01 22:10 ` Joseph Myers
2021-11-02 17:18   ` [PATCH v2] " Marek Polacek
2021-11-02 19:20     ` Martin Sebor
2021-11-02 19:52       ` Marek Polacek
2021-11-08 21:33         ` Marek Polacek
2021-11-15 17:28           ` [PATCH] libcpp: Implement -Wbidi-chars " Marek Polacek
2021-11-15 23:15             ` David Malcolm
2021-11-16 19:50               ` [PATCH v2] " Marek Polacek
2021-11-16 23:00                 ` David Malcolm
2021-11-17  0:37                   ` [PATCH v3] " Marek Polacek
2021-11-17  2:28                     ` David Malcolm
2021-11-17  3:05                       ` Marek Polacek
2021-11-17 22:45                         ` [committed] libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026] David Malcolm
2021-11-17 22:45                           ` [PATCH 2/2] libcpp: capture and underline ranges " David Malcolm
2021-11-17 23:01                             ` Marek Polacek
2021-11-30  8:38             ` [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] Stephan Bergmann
2021-11-30 13:26               ` Marek Polacek
2021-11-30 15:00                 ` Stephan Bergmann
2021-11-30 15:27                   ` Marek Polacek
2022-01-14  9:23                     ` Stephan Bergmann
2022-01-14 13:28                       ` Marek Polacek
2022-01-14 14:52                         ` Stephan Bergmann
2021-11-02 20:57 ` David Malcolm [this message]
2021-11-02 20:58   ` [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped David Malcolm
2021-11-02 21:07     ` David Malcolm
2021-11-02 20:58   ` [PATCH 2/2] Capture locations of bidi chars and underline ranges David Malcolm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211102205801.1202228-1-dmalcolm@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=joseph@codesourcery.com \
    --cc=polacek@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).