From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dmalcolm@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id DA6833858405
 for <gcc-patches@gcc.gnu.org>; Mon, 15 Nov 2021 23:15:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DA6833858405
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-544-lHZrQNc0P5yTHn4RduhOkw-1; Mon, 15 Nov 2021 18:15:43 -0500
X-MC-Unique: lHZrQNc0P5yTHn4RduhOkw-1
Received: by mail-qk1-f198.google.com with SMTP id
 s184-20020ae9dec1000000b00462de7f85b9so12317674qkf.3
 for <gcc-patches@gcc.gnu.org>; Mon, 15 Nov 2021 15:15:43 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to
 :references:user-agent:mime-version:content-transfer-encoding;
 bh=bX3C1i5WsZ7SJqCO6bpk/3blkgJVcB2nQR+k35beLig=;
 b=Z8DtMGS9FFauiMwLS/MNLFTPkFW2l/DGxM2D+IQ89c4dIfeiMZXKnXRewsp01ssyVs
 1cxNvUn+pdX0zQ5G5ZBIBYigpZH4XZ5UJ93kFQbXsr7JKZ/dMx2+dORQBRVbJO7yTV9Z
 qRddY2xx2BNx0/CqPIWZlWetlU4U2Z8hoWxqc7qu18hR72O1eB4ca/NhcX2euEISwlwP
 FrtV6wpeue0McuPCHXjdvebg/S1yr3/K5xAicXNw5EQzbGQx9wI329SczcEfUs81mJit
 nABq5aHMXMpcqk8AKe2YIe9qfx0I/2lP6eshx6pBe++MLG/UuQc0ydRpRcudG2RW+uaH
 tWYw==
X-Gm-Message-State: AOAM531uI35fKljclzkgnTVoQVUgnZPbJ4l7b4wovD/K6R+qhdOTjxU+
 XMqQO5fgFugCygFLD117vsooam6DN2RmKK5GgpexN74HtJdMiTY6TvQ5tU238PC0A283M1wbW8k
 Fbso68+EZURJMBORY6A==
X-Received: by 2002:a05:6214:2482:: with SMTP id
 gi2mr40333832qvb.59.1637018142703; 
 Mon, 15 Nov 2021 15:15:42 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxoosGU2ov5aFhd6tHEswefDT1l3bQ9FkvLBXyqL//LQZcVGZ6MNAfqS9rEI9snfENeqi5P/A==
X-Received: by 2002:a05:6214:2482:: with SMTP id
 gi2mr40333793qvb.59.1637018142371; 
 Mon, 15 Nov 2021 15:15:42 -0800 (PST)
Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net.
 [73.69.212.193])
 by smtp.gmail.com with ESMTPSA id w14sm91033qkp.54.2021.11.15.15.15.41
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 15 Nov 2021 15:15:41 -0800 (PST)
Message-ID: <a4463efe9cd50165a6fcd1ca05470a03595792f3.camel@redhat.com>
Subject: Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574
 [PR103026]
From: David Malcolm <dmalcolm@redhat.com>
To: Marek Polacek <polacek@redhat.com>, Joseph Myers <joseph@codesourcery.com>
Cc: Jakub Jelinek <jakub@redhat.com>, Martin Sebor <msebor@redhat.com>, GCC
 Patches <gcc-patches@gcc.gnu.org>
Date: Mon, 15 Nov 2021 18:15:40 -0500
In-Reply-To: <YZKY2b8utf/5/f6o@redhat.com>
References: <20211101163652.36794-1-polacek@redhat.com>
 <alpine.DEB.2.22.394.2111012158330.1978253@digraph.polyomino.org.uk>
 <YYFzA3Z4rPwTo927@redhat.com>
 <49fc8d97-4cc7-5c78-d65a-02ea1419f483@gmail.com>
 <YYGW5eiZhSxq6aHl@redhat.com> <YYmXt6jB3Qq3qPIW@redhat.com>
 <YZKY2b8utf/5/f6o@redhat.com>
User-Agent: Evolution 3.38.4 (3.38.4-1.fc33)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Nov 2021 23:15:49 -0000

> On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
> > Ping, can we conclude on the name?   IMHO, -Wbidirectional is just fine,
> > but changing the name is a trivial operation. 
> 
> Here's a patch with a better name (suggested by Jonathan W.).  Otherwise no
> changes.

Thanks for implementing this.

> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> 
> -- >8 --
> From a link below:
> "An issue was discovered in the Bidirectional Algorithm in the Unicode
> Specification through 14.0. It permits the visual reordering of
> characters via control sequences, which can be used to craft source code
> that renders different logic than the logical ordering of tokens
> ingested by compilers and interpreters. Adversaries can leverage this to
> encode source code for compilers accepting Unicode such that targeted
> vulnerabilities are introduced invisibly to human reviewers."
> 
> More info:
> https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> https://trojansource.codes/
> 
> This is not a compiler bug.  However, to mitigate the problem, this patch
> implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
> misleading Unicode bidirectional characters the preprocessor may encounter.
> 
> The default is =unpaired, which warns about improperly terminated
> bidirectional characters; e.g. a LRE without its appertaining PDF.  The

I like the default.

Wording nit: maybe use "corresponding" rather than "appertaining"; I
believe the latter has a sense that one is part of the other, when they
are more like peers.

> level =any warns about any use of bidirectional characters.

Terminology nit:
The patch is referring to "bidirectional characters", but I think the
term "bidirectional control characters" would be better.

For example, a passage of text containing both numbers and characters
in a right-to-left script could be considered "bidirectional", since
the numbers are written from left-to-right.

Specifically, the patch looks for these specific characters:
  * U+202A LEFT-TO-RIGHT EMBEDDING
  * U+202B RIGHT-TO-LEFT EMBEDDING
  * U+202C POP DIRECTIONAL FORMATTING
  * U+202D LEFT-TO-RIGHT OVERRIDE
  * U+202E RIGHT-TO-LEFT OVERRIDE
  * U+2066 LEFT-TO-RIGHT ISOLATE
  * U+2067 RIGHT-TO-LEFT ISOLATE
  * U+2068 FIRST STRONG ISOLATE
  * U+2069 POP DIRECTIONAL ISOLATE

However, the following characters could also be considered as
"bidirectional control characters":
  * U+200E ‎LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E)
  * U+200F ‎RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F)
but aren't checked for in the patch.  Should they be?  I can imagine
ways in which they could be abused, so I think so.

 
[...snip...]

> diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
> index 06457ac739e..b047df0f125 100644
> --- a/gcc/c-family/c.opt
> +++ b/gcc/c-family/c.opt
> @@ -374,6 +374,30 @@ Wbad-function-cast
>  C ObjC Var(warn_bad_function_cast) Warning
>  Warn about casting functions to incompatible types.
>  
> +Wbidi-chars
> +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none)
> +;
> +
> +Wbidi-chars=
> +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level)
> +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional characters.

"control characters"

[...snip...]

>  
> +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]}
> +@opindex Wbidi-chars=
> +@opindex Wbidi-chars
> +@opindex Wno-bidi-chars
> +Warn about possibly misleading UTF-8 bidirectional characters in comments,

(and here again)

> +string literals, character constants, and identifiers.  Such characters can
> +change left-to-right writing direction into right-to-left (and vice versa),
> +which can cause confusion between the logical order and visual order.  This
> +may be dangerous; for instance, it may seem that a piece of code is not
> +commented out, whereas it in fact is.
> +
> +There are three levels of warning supported by GCC@.  The default is
> +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated
> +bidi contexts.  @option{-Wbidi-chars=none} turns the warning off.
> +@option{-Wbidi-chars=any} warns about any use of bidirectional characters.

(and again)

[...snip...]


> diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> new file mode 100644
> index 00000000000..9fd4bc535ca
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> @@ -0,0 +1,166 @@
> +/* PR preprocessor/103026 */
> +/* { dg-do compile } */
> +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */
> +/* Test all bidi chars in various contexts (identifiers, comments,
> +   string literals, character constants), both UCN and UTF-8.  The bidi
> +   chars here are properly terminated, except for the character constants.  */
> +
> +/* a b c LRE‪ 1 2 3 PDF‬ x y z */
> +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */
> +/* a b c RLE‫ 1 2 3 PDF‬ x y z */
> +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */
> +/* a b c LRO‭ 1 2 3 PDF‬ x y z */
> +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */
> +/* a b c RLO‮ 1 2 3 PDF‬ x y z */
> +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */
> +/* a b c LRI⁦ 1 2 3 PDI⁩ x y z */
> +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */
> +/* a b c RLI⁧ 1 2 3 PDI⁩ x y */
> +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */
> +/* a b c FSI⁨ 1 2 3 PDI⁩ x y z */
> +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */

AIUI the Unicode bidirectionality algorithm works at the line level,
and so each line in a block comment should be checked individually for
unclossed bidi control chars, rather than a block comment as a whole. 
Hence I think the test case needs to have block comment test coverage
for:
- single line blocks
- first line of a multiline block comment
- middle line of a multiline block comment
- final line of a multiline block comment
but I think the patch as it stands is only checking for the first of
these four cases.

[...snip...]

> diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c
> new file mode 100644
> index 00000000000..efb26309b68
> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c
> @@ -0,0 +1,166 @@
> +/* PR preprocessor/103026 */
> +/* { dg-do compile } */
> +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */
> +/* Test all bidi chars in various contexts (identifiers, comments,
> +   string literals, character constants), both UCN and UTF-8.  The bidi
> +   chars here are properly terminated, except for the character constants.  */

Similar comment as above re block comments

[...snip...]

> diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
> index 176f8c5bbce..60cf08ddd35 100644
> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -319,6 +319,17 @@ enum cpp_main_search
>    CMS_system,  /* Search the system INCLUDE path.  */
>  };
>  
> +/* The possible bidirectional characters checking levels, from least
> +   restrictive to most.  */
> +enum cpp_bidirectional_level {
> +  /* No checking.  */
> +  bidirectional_none,
> +  /* Only detect unpaired uses of bidirectional characters.  */
> +  bidirectional_unpaired,
> +  /* Detect any use of bidirectional characters.  */
> +  bidirectional_any
> +};

As before, "control characters".

[...snip...]

> @@ -539,6 +550,10 @@ struct cpp_options
>    /* True if warn about differences between C++98 and C++11.  */
>    bool cpp_warn_cxx11_compat;
>  
> +  /* Nonzero of bidirectional characters checking is on.  See enum

s/of/if/

and usual nit about "control characters".

> +     cpp_bidirectional_level.  */
> +  unsigned char cpp_warn_bidirectional;
> +
>    /* Dependency generation.  */
>    struct
>    {

[...snip...]

> diff --git a/libcpp/lex.c b/libcpp/lex.c
> index fa2253d41c3..3fb518e202b 100644
> --- a/libcpp/lex.c
> +++ b/libcpp/lex.c
> @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment)
>      }
>  }
>  
> +namespace bidi {
> +  enum class kind {
> +    NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI
> +  };
> +
> +  /* All the UTF-8 encodings of bidi characters start with E2.  */
> +  constexpr uchar utf8_start = 0xe2;

Is there a difference between "constexpr" vs "const" here?  (sorry for
my ignorance)

> +
> +  /* A vector holding currently open bidi contexts.  We use a char for
> +     each context, its LSB is 1 if it represents a PDF context, 0 if it
> +     represents a PDI context.  The next bit is 1 if this context was open
> +     by a bidi character written as a UCN, and 0 when it was UTF-8.  */
> +  semi_embedded_vec <unsigned char, 16> vec;
> +
> +  /* Close the whole comment/identifier/string literal/character constant
> +     context.  */
> +  void on_close ()
> +  {
> +    vec.truncate (0);
> +  }
> +
> +  /* Pop the last element in the vector.  */
> +  void pop ()
> +  {
> +    unsigned int len = vec.count ();
> +    gcc_checking_assert (len > 0);
> +    vec.truncate (len - 1);
> +  }
> +
> +  /* Return the context of the Ith element.  */
> +  kind ctx_at (unsigned int i)
> +  {
> +    return (vec[i] & 1) ? kind::PDF : kind::PDI;
> +  }
> +
> +  /* Return which context is currently opened.  */
> +  kind current_ctx ()
> +  {
> +    unsigned int len = vec.count ();
> +    if (len == 0)
> +      return kind::NONE;
> +    return ctx_at (len - 1);
> +  }
> +
> +  /* Return true if the current context comes from a UCN origin, that is,
> +     the bidi char which started this bidi context was written as a UCN.  */
> +  bool current_ctx_ucn_p ()
> +  {
> +    unsigned int len = vec.count ();
> +    gcc_checking_assert (len > 0);
> +    return (vec[len - 1] >> 1) & 1;
> +  }
> +
> +  /* We've read a bidi char, update the current vector as necessary.  */
> +  void on_char (kind k, bool ucn_p)
> +  {
> +    switch (k)
> +      {
> +      case kind::LRE:
> +      case kind::RLE:
> +      case kind::LRO:
> +      case kind::RLO:
> +	vec.push (ucn_p ? 3u : 1u);
> +	break;
> +      case kind::LRI:
> +      case kind::RLI:
> +      case kind::FSI:
> +	vec.push (ucn_p ? 2u : 0u);
> +	break;

I don't like the hand-coded bit fields here, where bit 1 and bit 2 in
the above have special meaning, but aren't clearly labelled as such.

Please can you at least use some kind of constant/define to make clear
the meaning of the bits.  Even clearer would be bitfields; is there a
performance reason for not using them?  (though this code is only
called on bidi control chars, which presumably is a rare occurrence). 
My patch here:
  "[PATCH 2/2] Capture locations of bidi chars and underline ranges"
    https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html
did some refactoring of this patch, replacing hand-coded bit
manipulation with bitfields in a struct (as well as then using that as
a good place to stach location_t values, and then using these
locations).
Would it be helpful if I split that part of my patch out?


[...snip...]

> +/* Parse a sequence of 3 bytes starting with P and return its bidi code.  */
> +
> +static bidi::kind
> +get_bidi_utf8 (const unsigned char *const p)
> +{
> +  gcc_checking_assert (p[0] == bidi::utf8_start);
> +
> +  if (p[1] == 0x80)
> +    switch (p[2])

get_bidi_utf8 accesss up to 2 bytes beyond "p"...

[...snip...]

...and is called in various places such as...

> @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile)
>  
>  	  cur = buffer->cur;
>  	}
> +      /* If this is a beginning of a UTF-8 encoding, it might be
> +	 a bidirectional character.  */
> +      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> +	{
> +	  bidi::kind kind = get_bidi_utf8 (cur - 1);
> +	  maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false);
> +	}

Are we guaranteed to have a '\n' at the end of the buffer?  (even for
the final line of the file)  That would ensure that we don't read past
the end of the buffer.

Can we have testcases involving malformed UTF-8, in which, say:
- the final byte of the input file is 0xe2
- the final two bytes of the input file are 0xe2 0x80
for each of block comment, C++-style comment, string-literal,
identifier, etc?
(or is that overkill?)

[...snip...]

> @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
>      {
>        /* Slower version for identifiers containing UCNs
>  	 or extended chars (including $).  */
> -      do {
> -	while (ISIDNUM (*pfile->buffer->cur))
> -	  {
> -	    NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> -	    pfile->buffer->cur++;
> -	  }
> -      } while (forms_identifier_p (pfile, false, nst));
> +      do
> +	{
> +	  while (ISIDNUM (*pfile->buffer->cur))
> +	    {
> +	      NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> +	      pfile->buffer->cur++;
> +	    }
> +	}
> +      while (forms_identifier_p (pfile, false, nst));

Is the above purely a whitespace change?


> +      if (warn_bidi_p)
> +	maybe_warn_bidi_on_close (pfile, pfile->buffer->cur);
>        result = _cpp_interpret_identifier (pfile, base,
>  					  pfile->buffer->cur - base);
>        *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base);
> @@ -1758,6 +2112,8 @@ static void
>  lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
>  {
>    const uchar *pos = base;
> +  const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional)
> +			    != bidirectional_none);

There are lots of places where the patch uses:

const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional)
> +			    != bidirectional_none);

Maybe make it an inline member function:

  bool warn_bidi_p () const
  {
    return CPP_OPTION (this, cpp_warn_bidirectional) != bidirectional_none;
  }

so that these can all be:

  const bool warn_bidi_p = pfile->warn_bidi_p ();

?

[...snip...]

Hope this is constructive; thanks again for the patch
Dave