From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dmalcolm@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
 by sourceware.org (Postfix) with ESMTPS id 609D13858037
 for <gcc-patches@gcc.gnu.org>; Tue, 16 Nov 2021 23:01:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 609D13858037
Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com
 [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-141-wLwrGIWBMSSohKAWq_ZgqA-1; Tue, 16 Nov 2021 18:01:15 -0500
X-MC-Unique: wLwrGIWBMSSohKAWq_ZgqA-1
Received: by mail-qk1-f197.google.com with SMTP id
 s184-20020ae9dec1000000b00462de7f85b9so404334qkf.3
 for <gcc-patches@gcc.gnu.org>; Tue, 16 Nov 2021 15:01:15 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to
 :references:user-agent:mime-version:content-transfer-encoding;
 bh=1ob4RPfOxhrH49RdRM1YWr+++7DsXa1f26hTMw+D78c=;
 b=reGJu5TcUwJiy96bRwVDFQcz/Dea8hYCq1UYzvAMifE1YNF7BlQvbR5mTfBrtKsOfO
 biz09PvC+UWJxp5DnTdTfSePECqcs5agaBpP4pLVzXY8tbAgc9UevT/0m3PHD1KwOpUh
 /2pIkwqdD+GaAVMxqAXbFKAxdqg1rn2JOIh93h52OmoI4zvf/PHcQkHhpa1rBCIlvF0y
 qMirx9m73vzoExJ7fJ4ERHtZSqglMBSKceFpVbxFga8VXMZ2W2owal/U5F+Hiwx7B1l/
 scP6bV/XGUvASHlhB/NkRgjuO06xQ1r7Y2nX2ZmiLlMsm1LWkRQSyy0Iqwu7CDr6jDbK
 KyuA==
X-Gm-Message-State: AOAM533Q76kMsGdVrlAhhFRWbavplOjWFAuK5MSrsHXHeQ1lYNH90LqK
 a3ajyerLry1hwbuo/P64btYDmzwxgGQe+zuArqzV18moNgYEmGi//2UwNzDwjC8A8FKIn39oj7B
 dlqpFCpq6Dmt2t637kQ==
X-Received: by 2002:ad4:5bc6:: with SMTP id t6mr49972771qvt.15.1637103674067; 
 Tue, 16 Nov 2021 15:01:14 -0800 (PST)
X-Google-Smtp-Source: ABdhPJwcasqZLeEd+QjrsmwTxO8FTBDEUhwHnGMF6uoA1ovCC4ZYyNSELvywDA08uyGHYba866ZvBA==
X-Received: by 2002:ad4:5bc6:: with SMTP id t6mr49972714qvt.15.1637103673650; 
 Tue, 16 Nov 2021 15:01:13 -0800 (PST)
Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net.
 [73.69.212.193])
 by smtp.gmail.com with ESMTPSA id m68sm9055487qkb.105.2021.11.16.15.00.59
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 16 Nov 2021 15:01:07 -0800 (PST)
Message-ID: <5422a6074f9a8877158b733c5acb2a9eecd4c000.camel@redhat.com>
Subject: Re: [PATCH v2] libcpp: Implement -Wbidi-chars for CVE-2021-42574
 [PR103026]
From: David Malcolm <dmalcolm@redhat.com>
To: Marek Polacek <polacek@redhat.com>
Cc: Joseph Myers <joseph@codesourcery.com>, Jakub Jelinek
 <jakub@redhat.com>,  Martin Sebor <msebor@redhat.com>, GCC Patches
 <gcc-patches@gcc.gnu.org>
Date: Tue, 16 Nov 2021 18:00:58 -0500
In-Reply-To: <YZQLlHfq740IOtje@redhat.com>
References: <20211101163652.36794-1-polacek@redhat.com>
 <alpine.DEB.2.22.394.2111012158330.1978253@digraph.polyomino.org.uk>
 <YYFzA3Z4rPwTo927@redhat.com>
 <49fc8d97-4cc7-5c78-d65a-02ea1419f483@gmail.com>
 <YYGW5eiZhSxq6aHl@redhat.com> <YYmXt6jB3Qq3qPIW@redhat.com>
 <YZKY2b8utf/5/f6o@redhat.com>
 <a4463efe9cd50165a6fcd1ca05470a03595792f3.camel@redhat.com>
 <YZQLlHfq740IOtje@redhat.com>
User-Agent: Evolution 3.38.4 (3.38.4-1.fc33)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Nov 2021 23:01:23 -0000

> On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote:
> > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
> > > > Ping, can we conclude on the name?   IMHO, -Wbidirectional is just fine,
> > > > but changing the name is a trivial operation. 
> > > 
> > > Here's a patch with a better name (suggested by Jonathan W.).  Otherwise no
> > > changes.
> > 
> > Thanks for implementing this.
> > 
> > > 
> > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > 
> > > -- >8 --
> > > From a link below:
> > > "An issue was discovered in the Bidirectional Algorithm in the Unicode
> > > Specification through 14.0. It permits the visual reordering of
> > > characters via control sequences, which can be used to craft source code
> > > that renders different logic than the logical ordering of tokens
> > > ingested by compilers and interpreters. Adversaries can leverage this to
> > > encode source code for compilers accepting Unicode such that targeted
> > > vulnerabilities are introduced invisibly to human reviewers."
> > > 
> > > More info:
> > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> > > https://trojansource.codes/
> > > 
> > > This is not a compiler bug.  However, to mitigate the problem, this patch
> > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
> > > misleading Unicode bidirectional characters the preprocessor may encounter.

[...snip...]

> > 
> > Terminology nit:
> > The patch is referring to "bidirectional characters", but I think the
> > term "bidirectional control characters" would be better.
> 
> Adjusted.

Thanks.

I wonder if the warning should be -Wbidi-control-chars, but I don't
care enough to insist on it being changed.

>  
> > For example, a passage of text containing both numbers and characters
> > in a right-to-left script could be considered "bidirectional", since
> > the numbers are written from left-to-right.
> > 
> > Specifically, the patch looks for these specific characters:
> >   * U+202A LEFT-TO-RIGHT EMBEDDING
> >   * U+202B RIGHT-TO-LEFT EMBEDDING
> >   * U+202C POP DIRECTIONAL FORMATTING
> >   * U+202D LEFT-TO-RIGHT OVERRIDE
> >   * U+202E RIGHT-TO-LEFT OVERRIDE
> >   * U+2066 LEFT-TO-RIGHT ISOLATE
> >   * U+2067 RIGHT-TO-LEFT ISOLATE
> >   * U+2068 FIRST STRONG ISOLATE
> >   * U+2069 POP DIRECTIONAL ISOLATE
> > 
> > However, the following characters could also be considered as
> > "bidirectional control characters":
> >   * U+200E ‎LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E)
> >   * U+200F ‎RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F)
> > but aren't checked for in the patch.  Should they be?  I can imagine
> > ways in which they could be abused, so I think so.
> 
> I'd only intended to check the bidi chars described in the original
> trojan source pdf, but I added checking for U+200E/U+200F too, since
> it was easy enough.  AFAIK they aren't popped by a PDF/PDI like the
> rest, so don't need to go on the vec, and so we only warn with =any.
> Tests: Wbidi-chars-16.c + Wbidi-chars-17.c

Thanks.  I took a look through the revised patch and I think you
updated things correctly.

[...snip...]

> > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > new file mode 100644
> > > index 00000000000..9fd4bc535ca
> > > --- /dev/null
> > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > @@ -0,0 +1,166 @@
> > > +/* PR preprocessor/103026 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */
> > > +/* Test all bidi chars in various contexts (identifiers, comments,
> > > +   string literals, character constants), both UCN and UTF-8.  The bidi
> > > +   chars here are properly terminated, except for the character constants.  */
> > > +
> > > +/* a b c LRE‪ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */
> > > +/* a b c RLE‫ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */
> > > +/* a b c LRO‭ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */
> > > +/* a b c RLO‮ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */
> > > +/* a b c LRI⁦ 1 2 3 PDI⁩ x y z */
> > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */
> > > +/* a b c RLI⁧ 1 2 3 PDI⁩ x y */
> > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */
> > > +/* a b c FSI⁨ 1 2 3 PDI⁩ x y z */
> > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */
> > 
> > AIUI the Unicode bidirectionality algorithm works at the line level,
> > and so each line in a block comment should be checked individually for
> > unclossed bidi control chars, rather than a block comment as a whole. 
> > Hence I think the test case needs to have block comment test coverage
> > for:
> > - single line blocks
> > - first line of a multiline block comment
> > - middle line of a multiline block comment
> > - final line of a multiline block comment
> > but I think the patch as it stands is only checking for the first of
> > these four cases.
> 
> The patch handles all of them, because of:
> 1534           if (warn_bidi_p)
> 1535             maybe_warn_bidi_on_close (pfile, cur);
> in _cpp_skip_block_comment, but I was lacking some more testing, so I've
> added some testing, and included a new test: Wbidi-chars-15.c.

All of the cases in Wbidi-chars-15.c only test for unparired chars in a
middle line of a multiline block comment; I don't think the patch has
any explicit coverage for unpaired control chars happening in the first
line and last lines of *multiline* block comments.  So it would be good
if Wbidi-chars-15.c could gain some coverage for that (don't have to
handle all the different chars).

[...snip...]

> > > diff --git a/libcpp/lex.c b/libcpp/lex.c
> > > index fa2253d41c3..3fb518e202b 100644
> > > --- a/libcpp/lex.c
> > > +++ b/libcpp/lex.c
> > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment)
> > >      }
> > >  }
> > >  
> > > +namespace bidi {
> > > +  enum class kind {
> > > +    NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI
> > > +  };
> > > +
> > > +  /* All the UTF-8 encodings of bidi characters start with E2.  */
> > > +  constexpr uchar utf8_start = 0xe2;
> > 
> > Is there a difference between "constexpr" vs "const" here?  (sorry for
> > my ignorance)
> 
> I just wanted to make sure that utf8_start will be usable in contexts where
> an integral constant expression is required.  'const' objects need not be
> initialized with compile-time constants, but 'constexpr' objects do.

Thanks.

> > > +
> > > +  /* A vector holding currently open bidi contexts.  We use a char for
> > > +     each context, its LSB is 1 if it represents a PDF context, 0 if it
> > > +     represents a PDI context.  The next bit is 1 if this context was open
> > > +     by a bidi character written as a UCN, and 0 when it was UTF-8.  */
> > > +  semi_embedded_vec <unsigned char, 16> vec;
> > > +
> > > +  /* Close the whole comment/identifier/string literal/character constant
> > > +     context.  */
> > > +  void on_close ()
> > > +  {
> > > +    vec.truncate (0);
> > > +  }
> > > +
> > > +  /* Pop the last element in the vector.  */
> > > +  void pop ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    gcc_checking_assert (len > 0);
> > > +    vec.truncate (len - 1);
> > > +  }
> > > +
> > > +  /* Return the context of the Ith element.  */
> > > +  kind ctx_at (unsigned int i)
> > > +  {
> > > +    return (vec[i] & 1) ? kind::PDF : kind::PDI;
> > > +  }
> > > +
> > > +  /* Return which context is currently opened.  */
> > > +  kind current_ctx ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    if (len == 0)
> > > +      return kind::NONE;
> > > +    return ctx_at (len - 1);
> > > +  }
> > > +
> > > +  /* Return true if the current context comes from a UCN origin, that is,
> > > +     the bidi char which started this bidi context was written as a UCN.  */
> > > +  bool current_ctx_ucn_p ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    gcc_checking_assert (len > 0);
> > > +    return (vec[len - 1] >> 1) & 1;
> > > +  }
> > > +
> > > +  /* We've read a bidi char, update the current vector as necessary.  */
> > > +  void on_char (kind k, bool ucn_p)
> > > +  {
> > > +    switch (k)
> > > +      {
> > > +      case kind::LRE:
> > > +      case kind::RLE:
> > > +      case kind::LRO:
> > > +      case kind::RLO:
> > > +	vec.push (ucn_p ? 3u : 1u);
> > > +	break;
> > > +      case kind::LRI:
> > > +      case kind::RLI:
> > > +      case kind::FSI:
> > > +	vec.push (ucn_p ? 2u : 0u);
> > > +	break;
> > 
> > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in
> > the above have special meaning, but aren't clearly labelled as such.
> > 
> > Please can you at least use some kind of constant/define to make clear
> > the meaning of the bits.  Even clearer would be bitfields; is there a
> > performance reason for not using them?  (though this code is only
> > called on bidi control chars, which presumably is a rare occurrence). 
> > My patch here:
> >   "[PATCH 2/2] Capture locations of bidi chars and underline ranges"
> >     https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html
> > did some refactoring of this patch, replacing hand-coded bit
> > manipulation with bitfields in a struct (as well as then using that as
> > a good place to stach location_t values, and then using these
> > locations).
> > Would it be helpful if I split that part of my patch out?
>  
> I think they are just fine here, given they are used only in bidi:: and
> not outside of it.  And I could just use a simple unsigned char in
> semi_embedded_vec instead of inventing a new struct.
> 
> Your diagnostic patch changes it because you need to remember a location
> too, so we're changing it anyway, so I left it be so that you have fewer
> conflicts.

Fair enough; I'll post an updated version of my followup once yours
goes in.

> > [...snip...]
> > 
> > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code.  */
> > > +
> > > +static bidi::kind
> > > +get_bidi_utf8 (const unsigned char *const p)
> > > +{
> > > +  gcc_checking_assert (p[0] == bidi::utf8_start);
> > > +
> > > +  if (p[1] == 0x80)
> > > +    switch (p[2])
> > 
> > get_bidi_utf8 accesss up to 2 bytes beyond "p"...
> > 
> > [...snip...]
> > 
> > ...and is called in various places such as...
> > 
> > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile)
> > >  
> > >  	  cur = buffer->cur;
> > >  	}
> > > +      /* If this is a beginning of a UTF-8 encoding, it might be
> > > +	 a bidirectional character.  */
> > > +      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> > > +	{
> > > +	  bidi::kind kind = get_bidi_utf8 (cur - 1);
> > > +	  maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false);
> > > +	}
> > 
> > Are we guaranteed to have a '\n' at the end of the buffer?  (even for
> > the final line of the file)  That would ensure that we don't read past
> > the end of the buffer.
> 
> We've discussed this in our internal thread; I think I said that you
> will always get a '\n' so this is not going to read past the end of
> the buffer.  To be sure...
>  
> > Can we have testcases involving malformed UTF-8, in which, say:
> > - the final byte of the input file is 0xe2
> > - the final two bytes of the input file are 0xe2 0x80
> > for each of block comment, C++-style comment, string-literal,
> > identifier, etc?
> > (or is that overkill?)
> 
> ...I'd crafted a malformed text file using hexedit but couldn't get
> it to crash.  I'd rather not include it in the testsuite though.

Fair enough.

> > [...snip...]
> > 
> > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
> > >      {
> > >        /* Slower version for identifiers containing UCNs
> > >  	 or extended chars (including $).  */
> > > -      do {
> > > -	while (ISIDNUM (*pfile->buffer->cur))
> > > -	  {
> > > -	    NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > -	    pfile->buffer->cur++;
> > > -	  }
> > > -      } while (forms_identifier_p (pfile, false, nst));
> > > +      do
> > > +	{
> > > +	  while (ISIDNUM (*pfile->buffer->cur))
> > > +	    {
> > > +	      NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > +	      pfile->buffer->cur++;
> > > +	    }
> > > +	}
> > > +      while (forms_identifier_p (pfile, false, nst));
> > 
> > Is the above purely a whitespace change?
> 
> Yes.

If I'm reading things correctly, these lines in the existing code were
correctly indented, so is there a purpose to this change?  If not,
please can you remove this change from the patch (to minimize the
change to the history).

[...snip...]

> +/* We're closing a bidi context, that is, we've encountered a newline,
> +   are closing a C-style comment, or are at the end of a string literal,
> +   character constant, or identifier.  Warn if this context was not
> +   properly terminated by a PDI or PDF.  P points to the last character
> +   in this context.  */
> +
> +static void
> +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p)
> +{
> +  if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired
> +      && bidi::vec.count () > 0)
> +    {
> +      const location_t loc
> +	= linemap_position_for_column (pfile->line_table,
> +				       CPP_BUF_COLUMN (pfile->buffer, p));
> +      cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0,
> +			     "unpaired UTF-8 bidirectional character "
> +			     "detected");
> +    }

Sorry, I missed this one in my initial review, should be "control
character" here.

[...snip...]

OK for trunk with the above nits fixed.

Thanks
Dave