From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTPS id DBF70385840D for ; Tue, 2 Nov 2021 12:06:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DBF70385840D Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-447-h6XZloftPI-LKVCIdlOMAg-1; Tue, 02 Nov 2021 08:06:09 -0400 X-MC-Unique: h6XZloftPI-LKVCIdlOMAg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 67A8D1018720 for ; Tue, 2 Nov 2021 12:06:08 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.193.172]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E97D019C79; Tue, 2 Nov 2021 12:06:07 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 1A2C65Ir1027510 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Tue, 2 Nov 2021 13:06:05 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 1A2C65dc1027509; Tue, 2 Nov 2021 13:06:05 +0100 Date: Tue, 2 Nov 2021 13:06:05 +0100 From: Jakub Jelinek To: David Malcolm Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027] Message-ID: <20211102120605.GB3230972@tucnak> Reply-To: Jakub Jelinek References: <20211101211412.1123930-1-dmalcolm@redhat.com> <20211102115652.GD304296@tucnak> MIME-Version: 1.0 In-Reply-To: <20211102115652.GD304296@tucnak> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, BODY_8BITS, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Nov 2021 12:06:15 -0000 On Tue, Nov 02, 2021 at 12:56:53PM +0100, Jakub Jelinek wrote: > Consider attached testcases Whomoglyph1.C and Whomoglyph2.C. > On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear > confusion for the reader, something that isn't visible in any of emacs, vim, > joe editors or on the terminal, when f3 uses scope identifier, the casual > reader will expect that it uses N1::N2::scope, but there is no such > variable, only one N1::N2::ѕсоре that visually looks the same, but has > different UTF-8 chars in it. So, name lookup will instead find N1::scope > and use that. > But Whomoglyph2.C will emit warnings that are IMHO not appropriate, > I believe there is no confusion at all there, e.g. for both C and C++, > the f5/f6 case, it doesn't really matter how each of the function names its > own parameter, one can never access another function's parameter. > Ditto for different namespace provided that both namespaces aren't searched > in the same name lookup, or similarly classes etc. > So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++ > FE). > And, another important thing is that most users don't really use unicode in > identifiers, I bet over 99.9% of identifiers don't have any >= 0x80 > characters in it and even when people do use them, confusable identifiers > during the same lookup are even far more unlikely. > So, I think we should optimize for the common case, ASCII only identifiers > and spend as little compile time as possible on this stuff. If we keep doing it in the stringpool, then e.g. one couldn't #include in a program with Russian/Ukrainian/Serbian etc. identifiers where some parameter or automatic variable etc. in some function in that file is called с (Cyrillic letter es), etc. just because in zlib.h one of the arguments in one of the function prototypes is called c (latin small letter c). I'd be afraid most of the users that actually want to use UTF-8 or UCNs in their identifiers would then just need to disable this warning... Jakub