From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 5E4433857803 for ; Sat, 18 Mar 2023 09:28:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E4433857803 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679131703; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q5orGuU4TMCmhlKo5Igetp5qBFcvthnt6yCbE8r6tWQ=; b=Ld1CjYTx7p/1WmTeUSCWlZxcH8zVbSJ9Rk+K8GbvzxgUu2T4+XLKJb0u6dbnowRcfETTyU JqXlSGYF4mp0KeDeJcIVf0wxNP7nIp/ehe9uw9xcjUpEToTXvytcWA8n6NeLYGNZkREida MPS2wlfUcILoHUVHK6YQImWZ/D8bIss= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-264-clWSOdXbOZiEQ0FM2ZY1_w-1; Sat, 18 Mar 2023 05:28:20 -0400 X-MC-Unique: clWSOdXbOZiEQ0FM2ZY1_w-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E624C185A790; Sat, 18 Mar 2023 09:28:19 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 73A852027046; Sat, 18 Mar 2023 09:28:19 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 32I9SGFv3149187 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Sat, 18 Mar 2023 10:28:16 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 32I9SDlq3149186; Sat, 18 Mar 2023 10:28:13 +0100 Date: Sat, 18 Mar 2023 10:28:13 +0100 From: Jakub Jelinek To: Raiki Tamura Cc: Jonathan Wakely , Mark Wielaard , Thomas Schwinge , Philip Herron , "gcc@gcc.gnu.org" , gcc-rust@gcc.gnu.org, David Edelsohn , Arthur Cohen , Arsen =?utf-8?Q?Arsenovi=C4=87?= Subject: Re: [GSoC] gccrs Unicode support Message-ID: Reply-To: Jakub Jelinek References: <87lejxujso.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sat, Mar 18, 2023 at 05:59:34PM +0900, Raiki Tamura wrote: > 2023年3月18日(土) 17:47 Jonathan Wakely : > > > On Sat, 18 Mar 2023, 08:32 Raiki Tamura via Gcc, wrote: > > > >> Thank you everyone for your advice. > >> Some kinds of names are restricted to unicode alphabetic/numeric in Rust. > >> > > > > Doesn't it use the same rules as C++, based on XID_Start and XID_Continue? > > That should already be supported. > > > > Yes, C++ and Rust use the same rules for identifiers (described in UAX#31) > and we can reuse it in the lexer of gccrs. > I was talking about values of Rust's crate_name attributes, which only > allow Unicode alphabetic/numeric characters. > (Ref: > https://doc.rust-lang.org/reference/crates-and-source-files.html#the-crate_name-attribute > ) That is a pretty simple thing, so no need to use an extra library for that. As is documented in contrib/unicode/README, the Unicode *.txt files are already checked in and there are several generators of tables. libcpp/makeucnid.cc already creates tables based on the UnicodeData.txt DerivedNormalizationProps.txt DerivedCoreProperties.txt files, including NFC/NKFC, it is true it doesn't currently compute whether a character is alphanumeric. That is either Alphabetic DerivedCoreProperties.txt property, or for numeric Nd, Nl or No category (3rd column) in UnicodeData.txt. Should be a few lines to add that support to libcpp/makeucnid.cc, the only question is if it won't make the ucnranges array much larger if it differentiates based on another ALPHANUM flag. If it doesn't grow too much, let's put it there, if it would grow too much, perhaps we should emit it in a separate table. Jakub