From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jakub@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTPS id DBF70385840D
 for <gcc-patches@gcc.gnu.org>; Tue,  2 Nov 2021 12:06:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DBF70385840D
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-447-h6XZloftPI-LKVCIdlOMAg-1; Tue, 02 Nov 2021 08:06:09 -0400
X-MC-Unique: h6XZloftPI-LKVCIdlOMAg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 67A8D1018720
 for <gcc-patches@gcc.gnu.org>; Tue,  2 Nov 2021 12:06:08 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.193.172])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id E97D019C79;
 Tue,  2 Nov 2021 12:06:07 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
 by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 1A2C65Ir1027510
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Tue, 2 Nov 2021 13:06:05 +0100
Received: (from jakub@localhost)
 by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 1A2C65dc1027509;
 Tue, 2 Nov 2021 13:06:05 +0100
Date: Tue, 2 Nov 2021 13:06:05 +0100
From: Jakub Jelinek <jakub@redhat.com>
To: David Malcolm <dmalcolm@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Initial implementation of -Whomoglyph [PR
 preprocessor/103027]
Message-ID: <20211102120605.GB3230972@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <20211101211412.1123930-1-dmalcolm@redhat.com>
 <20211102115652.GD304296@tucnak>
MIME-Version: 1.0
In-Reply-To: <20211102115652.GD304296@tucnak>
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Nov 2021 12:06:15 -0000

On Tue, Nov 02, 2021 at 12:56:53PM +0100, Jakub Jelinek wrote:
> Consider attached testcases Whomoglyph1.C and Whomoglyph2.C.
> On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear
> confusion for the reader, something that isn't visible in any of emacs, vim,
> joe editors or on the terminal, when f3 uses scope identifier, the casual
> reader will expect that it uses N1::N2::scope, but there is no such
> variable, only one N1::N2::ѕсоре that visually looks the same, but has
> different UTF-8 chars in it.  So, name lookup will instead find N1::scope
> and use that.
> But Whomoglyph2.C will emit warnings that are IMHO not appropriate,
> I believe there is no confusion at all there, e.g. for both C and C++,
> the f5/f6 case, it doesn't really matter how each of the function names its
> own parameter, one can never access another function's parameter.
> Ditto for different namespace provided that both namespaces aren't searched
> in the same name lookup, or similarly classes etc.
> So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++
> FE).
> And, another important thing is that most users don't really use unicode in
> identifiers, I bet over 99.9% of identifiers don't have any >= 0x80
> characters in it and even when people do use them, confusable identifiers
> during the same lookup are even far more unlikely.
> So, I think we should optimize for the common case, ASCII only identifiers
> and spend as little compile time as possible on this stuff.

If we keep doing it in the stringpool, then e.g. one couldn't
#include <zlib.h>
in a program with Russian/Ukrainian/Serbian etc. identifiers where some parameter
or automatic variable etc. in some function in that file is called
с (Cyrillic letter es), etc. just because in zlib.h one of the arguments
in one of the function prototypes is called c (latin small letter c).
I'd be afraid most of the users that actually want to use UTF-8 or UCNs in
their identifiers would then just need to disable this warning...

	Jakub