From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sbergman@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 49B8E3858408
 for <gcc-patches@gcc.gnu.org>; Tue, 30 Nov 2021 15:00:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 49B8E3858408
Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com
 [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-450-cp20vJ2iM4iohHfbYoOlpw-1; Tue, 30 Nov 2021 10:00:05 -0500
X-MC-Unique: cp20vJ2iM4iohHfbYoOlpw-1
Received: by mail-wm1-f72.google.com with SMTP id
 145-20020a1c0197000000b0032efc3eb9bcso13922600wmb.0
 for <gcc-patches@gcc.gnu.org>; Tue, 30 Nov 2021 07:00:04 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:date:mime-version:user-agent:subject
 :content-language:to:cc:references:from:in-reply-to
 :content-transfer-encoding;
 bh=29Bz505DOxBNoMEPH9FkARJBvI1y74r9VmmeRii1+9E=;
 b=caK4LoDi4PwolGj1zt8pwh+fkb3XZ+7c4DIWWM5+/GoyoiFn6FIVLf1iVcp8NGY2L6
 JbGS4cT8va8ths4HGT6/CUFBjw2dMf4dtT7On35AD5ONYXwa4a/ZFdcbRwc1ZfTdd37J
 o0GYRwOVcKrT2P4Fncsxf8zjaJj3KzUwv2jNOFgUaIKvjLF8ZFHO195GyPTKf+Xd+IIe
 seMFAC47HSQTvwzeDUA7PUjY5DIpAE2dHtVtqAVzIzIw2IkZmA1rR9XVTFd7yi9plWfc
 4Twj8pq98eMi2X2FvAzlVsByoVBxcxEdOyl4Zk/H4Rd3YrrIOCFIK4VOcGD4VY99Wg7F
 7U4Q==
X-Gm-Message-State: AOAM530uqunxC2cJK8U5qh2h+pcwDTBQXbFPqvWEPlRdYWfy6R36ZY0O
 RLE2o19HAEGoKXZhhB1vY4T494pRcRKS7hTWolUcWfEsTiNzJO8X08ny0tCfCxvy4gsKBLvZD9A
 tGxuC1IVqPNnPHj88eA==
X-Received: by 2002:a5d:6085:: with SMTP id w5mr43009118wrt.122.1638284403349; 
 Tue, 30 Nov 2021 07:00:03 -0800 (PST)
X-Google-Smtp-Source: ABdhPJyAE78RHceKoV5BqTCxO8n7teAXj+cV3Sx4tYgcZPEL7/lP3l5Orp4yqTQJ1i5oOyFXpkvDtQ==
X-Received: by 2002:a5d:6085:: with SMTP id w5mr43009086wrt.122.1638284403066; 
 Tue, 30 Nov 2021 07:00:03 -0800 (PST)
Received: from [192.168.188.47]
 (dynamic-095-112-025-191.95.112.pool.telefonica.de. [95.112.25.191])
 by smtp.gmail.com with ESMTPSA id s63sm2838757wme.22.2021.11.30.07.00.02
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Tue, 30 Nov 2021 07:00:02 -0800 (PST)
Message-ID: <e33e0c38-5a27-b3da-8483-136496223568@redhat.com>
Date: Tue, 30 Nov 2021 16:00:01 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.3.0
Subject: Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574
 [PR103026]
To: Marek Polacek <polacek@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>, Martin Sebor <msebor@redhat.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>, Joseph Myers <joseph@codesourcery.com>
References: <20211101163652.36794-1-polacek@redhat.com>
 <alpine.DEB.2.22.394.2111012158330.1978253@digraph.polyomino.org.uk>
 <YYFzA3Z4rPwTo927@redhat.com>
 <49fc8d97-4cc7-5c78-d65a-02ea1419f483@gmail.com>
 <YYGW5eiZhSxq6aHl@redhat.com> <YYmXt6jB3Qq3qPIW@redhat.com>
 <YZKY2b8utf/5/f6o@redhat.com>
 <1ab9d323-af24-c2a0-0c6a-596b3f68c605@redhat.com>
 <YaYmoSN/8PBEGavw@redhat.com>
From: Stephan Bergmann <sbergman@redhat.com>
In-Reply-To: <YaYmoSN/8PBEGavw@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Nov 2021 15:00:11 -0000

On 30/11/2021 14:26, Marek Polacek wrote:
> On Tue, Nov 30, 2021 at 09:38:57AM +0100, Stephan Bergmann wrote:
>> On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote:
>>> On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
>>>> Ping, can we conclude on the name?   IMHO, -Wbidirectional is just fine,
>>>> but changing the name is a trivial operation.
>>>
>>> Here's a patch with a better name (suggested by Jonathan W.).  Otherwise no
>>> changes.
>>>
>>> Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
>>>
>>> -- >8 --
>>>   From a link below:
>>> "An issue was discovered in the Bidirectional Algorithm in the Unicode
>>> Specification through 14.0. It permits the visual reordering of
>>> characters via control sequences, which can be used to craft source code
>>> that renders different logic than the logical ordering of tokens
>>> ingested by compilers and interpreters. Adversaries can leverage this to
>>> encode source code for compilers accepting Unicode such that targeted
>>> vulnerabilities are introduced invisibly to human reviewers."
>>>
>>> More info:
>>> https://nvd.nist.gov/vuln/detail/CVE-2021-42574
>>> https://trojansource.codes/
>>>
>>> This is not a compiler bug.  However, to mitigate the problem, this patch
>>> implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
>>> misleading Unicode bidirectional characters the preprocessor may encounter.
>>>
>>> The default is =unpaired, which warns about improperly terminated
>>> bidirectional characters; e.g. a LRE without its appertaining PDF.  The
>>> level =any warns about any use of bidirectional characters.
>>>
>>> This patch handles both UCNs and UTF-8 characters.  UCNs designating
>>> bidi characters in identifiers are accepted since r204886.  Then r217144
>>> enabled -fextended-identifiers by default.  Extended characters in C/C++
>>> identifiers have been accepted since r275979.  However, this patch still
>>> warns about mixing UTF-8 and UCN bidi characters; there seems to be no
>>> good reason to allow mixing them.
>>
>> I wonder what the rationale is to warn about UCNs, like in
>>
>>>          aText = u"\u202D" + aText;
>>
>> (as found in the LibreOffice source code).
> 
> Is this line mixing a UCN and a UTF-8?  Or is it just that you're
> prepending a LRO to aText?  We warn because the LRO is not "closed"
> in the context of its string literal, which was part of the Trojan
> source attack.  So "\u202D ... \u202C" would not warn.
> 
> I'm not sure what workaround I could offer.  Maybe provide an option not to
> warn about UCNs at all, though even that is potentially dangerous -- while
> you can see UCNs in the source code, if you print strings containing them,
> they won't be visible anymore.

I'm not sure what you mean with "mixing a UCN and a UTF-8", but what the 
code apparently does is programmatically constructing a larger piece of 
text by prepending LRO to an existing piece of text.

My understanding is that Trojan Source is concerned with presentation of 
program source code and not with properties of Unicode text constructed 
during the execution of such a program, and from the documentation 
quoted above I understand that -Wbidi-chars is meant to address Trojan 
Source, so I don't understand why you're concerned here with what 
happens "if you print strings containing [UCNs in the source code]".

Short of a source code viewer that interprets UCNs in C/C++ source code 
and renders them in the same way as their corresponding Unicode 
characters, I don't think that UCNs are relevant for Trojan Source, and 
don't understand why -Wbidi-chars would warn about them.

(Also, I noticed that it doesn't work to silence -Werror=bidi-chars= with a

> #pragma GCC diagnostic ignored "-Wbidi-chars"

?)