From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mark@klomp.org>
Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112])
 by sourceware.org (Postfix) with ESMTPS id 80D4D3858039;
 Sun, 18 Jul 2021 13:23:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 80D4D3858039
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=klomp.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org
Received: from reform (77-167-121-15.hybrid.kpn.net [77.167.121.15])
 (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by gnu.wildebeest.org (Postfix) with ESMTPSA id BD2C6300066F;
 Sun, 18 Jul 2021 15:22:58 +0200 (CEST)
Received: by reform (Postfix, from userid 1000)
 id D12A52E804E0; Sun, 18 Jul 2021 15:22:56 +0200 (CEST)
Date: Sun, 18 Jul 2021 15:22:56 +0200
From: Mark Wielaard <mark@klomp.org>
To: gcc@gcc.gnu.org, gcc-rust@gcc.gnu.org
Subject: rust frontend and UTF-8/unicode processing/properties
Message-ID: <YPQrMBHyu3wRpT5o@wildebeest.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-rust@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: gcc-rust mailing list <gcc-rust.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-rust/>
List-Post: <mailto:gcc-rust@gcc.gnu.org>
List-Help: <mailto:gcc-rust-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Jul 2021 13:23:01 -0000

Hi,

For the gcc rust frontend I was thinking of importing a couple of
gnulib modules to help with UTF-8 processing, conversion to/from
unicode codepoints and determining various properties of those
codepoints. But it seems gcc doesn't yet have any gnulib modules
imported, and maybe other frontends already have helpers to this that
the gcc rust frontend could reuse.

Rust only accepts valid UTF-8 encoded source files, which may or may
not start with UTF-8 BOM character. Whitespace is any codepoint with
the Pattern_White_Space property. Identifiers can start with any
codepoint with the XID_start property plus zero or one codepoints with
XID_continue property. It isn't required, but highly desirable to
detect confusable identifiers according to tr39/Confusable_Detection.

Other names might be constraint to Alphabetic and/or Number categories
(Nd, Nl, No), textual types can only contain Unicode Scalar Values
(any Unicode codepoint except high-surrogate and low-surrogates),
strings in source code can contain unicode escapes (24 bit, up to 6
digits codepoints) but are internally stored as UTF-8 (and must not
encode any surrogates).

Do other gcc frontends handle any of the above already in a way that
might be reusable for other frontends?

Thanks,

Mark