From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 609D13858037 for ; Tue, 16 Nov 2021 23:01:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 609D13858037 Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-141-wLwrGIWBMSSohKAWq_ZgqA-1; Tue, 16 Nov 2021 18:01:15 -0500 X-MC-Unique: wLwrGIWBMSSohKAWq_ZgqA-1 Received: by mail-qk1-f197.google.com with SMTP id s184-20020ae9dec1000000b00462de7f85b9so404334qkf.3 for ; Tue, 16 Nov 2021 15:01:15 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=1ob4RPfOxhrH49RdRM1YWr+++7DsXa1f26hTMw+D78c=; b=reGJu5TcUwJiy96bRwVDFQcz/Dea8hYCq1UYzvAMifE1YNF7BlQvbR5mTfBrtKsOfO biz09PvC+UWJxp5DnTdTfSePECqcs5agaBpP4pLVzXY8tbAgc9UevT/0m3PHD1KwOpUh /2pIkwqdD+GaAVMxqAXbFKAxdqg1rn2JOIh93h52OmoI4zvf/PHcQkHhpa1rBCIlvF0y qMirx9m73vzoExJ7fJ4ERHtZSqglMBSKceFpVbxFga8VXMZ2W2owal/U5F+Hiwx7B1l/ scP6bV/XGUvASHlhB/NkRgjuO06xQ1r7Y2nX2ZmiLlMsm1LWkRQSyy0Iqwu7CDr6jDbK KyuA== X-Gm-Message-State: AOAM533Q76kMsGdVrlAhhFRWbavplOjWFAuK5MSrsHXHeQ1lYNH90LqK a3ajyerLry1hwbuo/P64btYDmzwxgGQe+zuArqzV18moNgYEmGi//2UwNzDwjC8A8FKIn39oj7B dlqpFCpq6Dmt2t637kQ== X-Received: by 2002:ad4:5bc6:: with SMTP id t6mr49972771qvt.15.1637103674067; Tue, 16 Nov 2021 15:01:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJwcasqZLeEd+QjrsmwTxO8FTBDEUhwHnGMF6uoA1ovCC4ZYyNSELvywDA08uyGHYba866ZvBA== X-Received: by 2002:ad4:5bc6:: with SMTP id t6mr49972714qvt.15.1637103673650; Tue, 16 Nov 2021 15:01:13 -0800 (PST) Received: from t14s.localdomain (c-73-69-212-193.hsd1.ma.comcast.net. [73.69.212.193]) by smtp.gmail.com with ESMTPSA id m68sm9055487qkb.105.2021.11.16.15.00.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Nov 2021 15:01:07 -0800 (PST) Message-ID: <5422a6074f9a8877158b733c5acb2a9eecd4c000.camel@redhat.com> Subject: Re: [PATCH v2] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] From: David Malcolm To: Marek Polacek Cc: Joseph Myers , Jakub Jelinek , Martin Sebor , GCC Patches Date: Tue, 16 Nov 2021 18:00:58 -0500 In-Reply-To: References: <20211101163652.36794-1-polacek@redhat.com> <49fc8d97-4cc7-5c78-d65a-02ea1419f483@gmail.com> User-Agent: Evolution 3.38.4 (3.38.4-1.fc33) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Nov 2021 23:01:23 -0000 > On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > > but changing the name is a trivial operation. > > > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > > changes. > > > > Thanks for implementing this. > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > -- >8 -- > > > From a link below: > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > Specification through 14.0. It permits the visual reordering of > > > characters via control sequences, which can be used to craft source code > > > that renders different logic than the logical ordering of tokens > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > encode source code for compilers accepting Unicode such that targeted > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > More info: > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > https://trojansource.codes/ > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > > misleading Unicode bidirectional characters the preprocessor may encounter. [...snip...] > > > > Terminology nit: > > The patch is referring to "bidirectional characters", but I think the > > term "bidirectional control characters" would be better. > > Adjusted. Thanks. I wonder if the warning should be -Wbidi-control-chars, but I don't care enough to insist on it being changed. > > > For example, a passage of text containing both numbers and characters > > in a right-to-left script could be considered "bidirectional", since > > the numbers are written from left-to-right. > > > > Specifically, the patch looks for these specific characters: > > * U+202A LEFT-TO-RIGHT EMBEDDING > > * U+202B RIGHT-TO-LEFT EMBEDDING > > * U+202C POP DIRECTIONAL FORMATTING > > * U+202D LEFT-TO-RIGHT OVERRIDE > > * U+202E RIGHT-TO-LEFT OVERRIDE > > * U+2066 LEFT-TO-RIGHT ISOLATE > > * U+2067 RIGHT-TO-LEFT ISOLATE > > * U+2068 FIRST STRONG ISOLATE > > * U+2069 POP DIRECTIONAL ISOLATE > > > > However, the following characters could also be considered as > > "bidirectional control characters": > > * U+200E ‎LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > > * U+200F ‎RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > > but aren't checked for in the patch. Should they be? I can imagine > > ways in which they could be abused, so I think so. > > I'd only intended to check the bidi chars described in the original > trojan source pdf, but I added checking for U+200E/U+200F too, since > it was easy enough. AFAIK they aren't popped by a PDF/PDI like the > rest, so don't need to go on the vec, and so we only warn with =any. > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c Thanks. I took a look through the revised patch and I think you updated things correctly. [...snip...] > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > new file mode 100644 > > > index 00000000000..9fd4bc535ca > > > --- /dev/null > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > @@ -0,0 +1,166 @@ > > > +/* PR preprocessor/103026 */ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > > > +/* Test all bidi chars in various contexts (identifiers, comments, > > > + string literals, character constants), both UCN and UTF-8. The bidi > > > + chars here are properly terminated, except for the character constants. */ > > > + > > > +/* a b c LRE‪ 1 2 3 PDF‬ x y z */ > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > > +/* a b c RLE‫ 1 2 3 PDF‬ x y z */ > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > > +/* a b c LRO‭ 1 2 3 PDF‬ x y z */ > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > > +/* a b c RLO‮ 1 2 3 PDF‬ x y z */ > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > > +/* a b c LRI⁦ 1 2 3 PDI⁩ x y z */ > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > > +/* a b c RLI⁧ 1 2 3 PDI⁩ x y */ > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > > +/* a b c FSI⁨ 1 2 3 PDI⁩ x y z */ > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > > > AIUI the Unicode bidirectionality algorithm works at the line level, > > and so each line in a block comment should be checked individually for > > unclossed bidi control chars, rather than a block comment as a whole. > > Hence I think the test case needs to have block comment test coverage > > for: > > - single line blocks > > - first line of a multiline block comment > > - middle line of a multiline block comment > > - final line of a multiline block comment > > but I think the patch as it stands is only checking for the first of > > these four cases. > > The patch handles all of them, because of: > 1534 if (warn_bidi_p) > 1535 maybe_warn_bidi_on_close (pfile, cur); > in _cpp_skip_block_comment, but I was lacking some more testing, so I've > added some testing, and included a new test: Wbidi-chars-15.c. All of the cases in Wbidi-chars-15.c only test for unparired chars in a middle line of a multiline block comment; I don't think the patch has any explicit coverage for unpaired control chars happening in the first line and last lines of *multiline* block comments. So it would be good if Wbidi-chars-15.c could gain some coverage for that (don't have to handle all the different chars). [...snip...] > > > diff --git a/libcpp/lex.c b/libcpp/lex.c > > > index fa2253d41c3..3fb518e202b 100644 > > > --- a/libcpp/lex.c > > > +++ b/libcpp/lex.c > > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) > > > } > > > } > > > > > > +namespace bidi { > > > + enum class kind { > > > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI > > > + }; > > > + > > > + /* All the UTF-8 encodings of bidi characters start with E2. */ > > > + constexpr uchar utf8_start = 0xe2; > > > > Is there a difference between "constexpr" vs "const" here? (sorry for > > my ignorance) > > I just wanted to make sure that utf8_start will be usable in contexts where > an integral constant expression is required. 'const' objects need not be > initialized with compile-time constants, but 'constexpr' objects do. Thanks. > > > + > > > + /* A vector holding currently open bidi contexts. We use a char for > > > + each context, its LSB is 1 if it represents a PDF context, 0 if it > > > + represents a PDI context. The next bit is 1 if this context was open > > > + by a bidi character written as a UCN, and 0 when it was UTF-8. */ > > > + semi_embedded_vec vec; > > > + > > > + /* Close the whole comment/identifier/string literal/character constant > > > + context. */ > > > + void on_close () > > > + { > > > + vec.truncate (0); > > > + } > > > + > > > + /* Pop the last element in the vector. */ > > > + void pop () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + vec.truncate (len - 1); > > > + } > > > + > > > + /* Return the context of the Ith element. */ > > > + kind ctx_at (unsigned int i) > > > + { > > > + return (vec[i] & 1) ? kind::PDF : kind::PDI; > > > + } > > > + > > > + /* Return which context is currently opened. */ > > > + kind current_ctx () > > > + { > > > + unsigned int len = vec.count (); > > > + if (len == 0) > > > + return kind::NONE; > > > + return ctx_at (len - 1); > > > + } > > > + > > > + /* Return true if the current context comes from a UCN origin, that is, > > > + the bidi char which started this bidi context was written as a UCN. */ > > > + bool current_ctx_ucn_p () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + return (vec[len - 1] >> 1) & 1; > > > + } > > > + > > > + /* We've read a bidi char, update the current vector as necessary. */ > > > + void on_char (kind k, bool ucn_p) > > > + { > > > + switch (k) > > > + { > > > + case kind::LRE: > > > + case kind::RLE: > > > + case kind::LRO: > > > + case kind::RLO: > > > + vec.push (ucn_p ? 3u : 1u); > > > + break; > > > + case kind::LRI: > > > + case kind::RLI: > > > + case kind::FSI: > > > + vec.push (ucn_p ? 2u : 0u); > > > + break; > > > > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in > > the above have special meaning, but aren't clearly labelled as such. > > > > Please can you at least use some kind of constant/define to make clear > > the meaning of the bits. Even clearer would be bitfields; is there a > > performance reason for not using them? (though this code is only > > called on bidi control chars, which presumably is a rare occurrence). > > My patch here: > > "[PATCH 2/2] Capture locations of bidi chars and underline ranges" > > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html > > did some refactoring of this patch, replacing hand-coded bit > > manipulation with bitfields in a struct (as well as then using that as > > a good place to stach location_t values, and then using these > > locations). > > Would it be helpful if I split that part of my patch out? > > I think they are just fine here, given they are used only in bidi:: and > not outside of it. And I could just use a simple unsigned char in > semi_embedded_vec instead of inventing a new struct. > > Your diagnostic patch changes it because you need to remember a location > too, so we're changing it anyway, so I left it be so that you have fewer > conflicts. Fair enough; I'll post an updated version of my followup once yours goes in. > > [...snip...] > > > > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ > > > + > > > +static bidi::kind > > > +get_bidi_utf8 (const unsigned char *const p) > > > +{ > > > + gcc_checking_assert (p[0] == bidi::utf8_start); > > > + > > > + if (p[1] == 0x80) > > > + switch (p[2]) > > > > get_bidi_utf8 accesss up to 2 bytes beyond "p"... > > > > [...snip...] > > > > ...and is called in various places such as... > > > > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) > > > > > > cur = buffer->cur; > > > } > > > + /* If this is a beginning of a UTF-8 encoding, it might be > > > + a bidirectional character. */ > > > + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > > > + { > > > + bidi::kind kind = get_bidi_utf8 (cur - 1); > > > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > > > + } > > > > Are we guaranteed to have a '\n' at the end of the buffer? (even for > > the final line of the file) That would ensure that we don't read past > > the end of the buffer. > > We've discussed this in our internal thread; I think I said that you > will always get a '\n' so this is not going to read past the end of > the buffer. To be sure... > > > Can we have testcases involving malformed UTF-8, in which, say: > > - the final byte of the input file is 0xe2 > > - the final two bytes of the input file are 0xe2 0x80 > > for each of block comment, C++-style comment, string-literal, > > identifier, etc? > > (or is that overkill?) > > ...I'd crafted a malformed text file using hexedit but couldn't get > it to crash. I'd rather not include it in the testsuite though. Fair enough. > > [...snip...] > > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > > > { > > > /* Slower version for identifiers containing UCNs > > > or extended chars (including $). */ > > > - do { > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > - { > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > - pfile->buffer->cur++; > > > - } > > > - } while (forms_identifier_p (pfile, false, nst)); > > > + do > > > + { > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > + { > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > + pfile->buffer->cur++; > > > + } > > > + } > > > + while (forms_identifier_p (pfile, false, nst)); > > > > Is the above purely a whitespace change? > > Yes. If I'm reading things correctly, these lines in the existing code were correctly indented, so is there a purpose to this change? If not, please can you remove this change from the patch (to minimize the change to the history). [...snip...] > +/* We're closing a bidi context, that is, we've encountered a newline, > + are closing a C-style comment, or are at the end of a string literal, > + character constant, or identifier. Warn if this context was not > + properly terminated by a PDI or PDF. P points to the last character > + in this context. */ > + > +static void > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > +{ > + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired > + && bidi::vec.count () > 0) > + { > + const location_t loc > + = linemap_position_for_column (pfile->line_table, > + CPP_BUF_COLUMN (pfile->buffer, p)); > + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, > + "unpaired UTF-8 bidirectional character " > + "detected"); > + } Sorry, I missed this one in my initial review, should be "control character" here. [...snip...] OK for trunk with the above nits fixed. Thanks Dave