From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jason@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 8CB3038654B5
 for <gcc-rust@gcc.gnu.org>; Sun, 18 Jul 2021 22:24:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8CB3038654B5
Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com
 [209.85.210.200]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-314-zI5CKJ0cOiO91GIdThcoWQ-1; Sun, 18 Jul 2021 18:24:11 -0400
X-MC-Unique: zI5CKJ0cOiO91GIdThcoWQ-1
Received: by mail-pf1-f200.google.com with SMTP id
 h6-20020a62b4060000b02903131bc4a1acso12035787pfn.4
 for <gcc-rust@gcc.gnu.org>; Sun, 18 Jul 2021 15:24:11 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=ZeJ3GQ0nA3kt3dGtM3KB7S6hbQR3HGu40xgClD+OHjI=;
 b=BTP4tThfoR0+BNvZfY9bjKtUz/TsNLwQSoYmMggw/bgtC/5UZlUBGeAXx5L/N+Dvff
 8IUAIZntwyL1UJZz7VOIi8pVlkXYvokidgjfVFec1K7vy6lOSNSjzeywxLNkAy2ttSUr
 8M+dI09VPt5Im6lTOFtu+PzfxuA6EIXW70EqyBXLGoyxrOZ8VsbtWd7eMTkqNyyY9cP1
 kUwTwB/0mlqad8KkBX28Z7HJN9CE0utFnXhmPruxDa5APNUcgeVYMEF9NbPUV61iqjFm
 YIiSv3ga+j3839J3dfz4LJuhJ/znOf8A+1TwPGPp/ZpUvEgGjLAT9hvfHXF6NgPdCS9O
 Z6nQ==
X-Gm-Message-State: AOAM531TTFqgVnSexcMWNKXQKSJs9BKAf0DnwiU/wUYBDuLLREdLymfy
 0HdL4uFzBjxgLL6DgPk/1SvAbTjvm0aryiJGClky8ZtXShkFnrz/KtuNoBPNsV5EtoYKwe7O0Ph
 jpMpICGuBrdhCFYPNVzpO7C4shqdJ7g==
X-Received: by 2002:a63:4242:: with SMTP id p63mr22274372pga.185.1626647050256; 
 Sun, 18 Jul 2021 15:24:10 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJymRMBH4SsjYv1g9hqF6cyVBoKXX5QPFuNFvtf1uSMxVAqUt/4aJG5KOfrygVcTB/hi24BDCM4ecDFcmWr+rlc=
X-Received: by 2002:a63:4242:: with SMTP id p63mr22274353pga.185.1626647049976; 
 Sun, 18 Jul 2021 15:24:09 -0700 (PDT)
MIME-Version: 1.0
References: <YPQrMBHyu3wRpT5o@wildebeest.org>
 <CAKOQZ8zB2L-u61KFgftyoV20TJ8oxdXO6D_v7LiwxgsJ7bxPLg@mail.gmail.com>
In-Reply-To: <CAKOQZ8zB2L-u61KFgftyoV20TJ8oxdXO6D_v7LiwxgsJ7bxPLg@mail.gmail.com>
From: Jason Merrill <jason@redhat.com>
Date: Sun, 18 Jul 2021 15:23:59 -0700
Message-ID: <CADzB+2n6SP7fBnO1GkU+AAWFa85eYb04XXGSEAyHPT-qmGDSOw@mail.gmail.com>
Subject: Re: rust frontend and UTF-8/unicode processing/properties
To: Ian Lance Taylor <iant@google.com>
Cc: Mark Wielaard <mark@klomp.org>, gcc Mailing List <gcc@gcc.gnu.org>,
 gcc-rust@gcc.gnu.org
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: multipart/alternative; boundary="00000000000096d4d505c76d471a"
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, HTML_MESSAGE,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-rust@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: gcc-rust mailing list <gcc-rust.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-rust/>
List-Post: <mailto:gcc-rust@gcc.gnu.org>
List-Help: <mailto:gcc-rust-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-rust>,
 <mailto:gcc-rust-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Jul 2021 22:24:15 -0000

--00000000000096d4d505c76d471a
Content-Type: text/plain; charset="UTF-8"

On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc <gcc@gcc.gnu.org>
wrote:

> On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wrote:
> >
> > For the gcc rust frontend I was thinking of importing a couple of
> > gnulib modules to help with UTF-8 processing, conversion to/from
> > unicode codepoints and determining various properties of those
> > codepoints. But it seems gcc doesn't yet have any gnulib modules
> > imported, and maybe other frontends already have helpers to this that
> > the gcc rust frontend could reuse.
> >
> > Rust only accepts valid UTF-8 encoded source files, which may or may
> > not start with UTF-8 BOM character. Whitespace is any codepoint with
> > the Pattern_White_Space property. Identifiers can start with any
> > codepoint with the XID_start property plus zero or one codepoints with
> > XID_continue property. It isn't required, but highly desirable to
> > detect confusable identifiers according to tr39/Confusable_Detection.
> >
> > Other names might be constraint to Alphabetic and/or Number categories
> > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> > (any Unicode codepoint except high-surrogate and low-surrogates),
> > strings in source code can contain unicode escapes (24 bit, up to 6
> > digits codepoints) but are internally stored as UTF-8 (and must not
> > encode any surrogates).
> >
> > Do other gcc frontends handle any of the above already in a way that
> > might be reusable for other frontends?
>
> I don't know that this is particularly helpful, but the Go frontend
> has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
> Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
> unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
> probably won't be able to use the code directly, and the code in the
> gofrontend directory is also shared with GoLLVM so it can't trivially
> be moved.
>

I believe the UTF-8 handling for the C family front ends is all in libcpp;
I don't think it's factored in a way to be useful to other front ends.

Jason

--00000000000096d4d505c76d471a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Sun, Jul 18, 2021 at 1:13 PM Ian Lance=
 Taylor via Gcc &lt;<a href=3D"mailto:gcc@gcc.gnu.org">gcc@gcc.gnu.org</a>&=
gt; wrote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex">On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard &lt;<a =
href=3D"mailto:mark@klomp.org" target=3D"_blank">mark@klomp.org</a>&gt; wro=
te:<br>
&gt;<br>
&gt; For the gcc rust frontend I was thinking of importing a couple of<br>
&gt; gnulib modules to help with UTF-8 processing, conversion to/from<br>
&gt; unicode codepoints and determining various properties of those<br>
&gt; codepoints. But it seems gcc doesn&#39;t yet have any gnulib modules<b=
r>
&gt; imported, and maybe other frontends already have helpers to this that<=
br>
&gt; the gcc rust frontend could reuse.<br>
&gt;<br>
&gt; Rust only accepts valid UTF-8 encoded source files, which may or may<b=
r>
&gt; not start with UTF-8 BOM character. Whitespace is any codepoint with<b=
r>
&gt; the Pattern_White_Space property. Identifiers can start with any<br>
&gt; codepoint with the XID_start property plus zero or one codepoints with=
<br>
&gt; XID_continue property. It isn&#39;t required, but highly desirable to<=
br>
&gt; detect confusable identifiers according to tr39/Confusable_Detection.<=
br>
&gt;<br>
&gt; Other names might be constraint to Alphabetic and/or Number categories=
<br>
&gt; (Nd, Nl, No), textual types can only contain Unicode Scalar Values<br>
&gt; (any Unicode codepoint except high-surrogate and low-surrogates),<br>
&gt; strings in source code can contain unicode escapes (24 bit, up to 6<br=
>
&gt; digits codepoints) but are internally stored as UTF-8 (and must not<br=
>
&gt; encode any surrogates).<br>
&gt;<br>
&gt; Do other gcc frontends handle any of the above already in a way that<b=
r>
&gt; might be reusable for other frontends?<br>
<br>
I don&#39;t know that this is particularly helpful, but the Go frontend<br>
has this kind of code in gcc/go/gofrontend/lex.cc.=C2=A0 E.g.,<br>
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,<br>
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.=C2=A0 But you<=
br>
probably won&#39;t be able to use the code directly, and the code in the<br=
>
gofrontend directory is also shared with GoLLVM so it can&#39;t trivially<b=
r>
be moved.<br></blockquote><div><br></div><div>I believe the UTF-8 handling =
for the C family front ends is all in libcpp; I don&#39;t think it&#39;s fa=
ctored in a way to be useful to other front ends.</div><div><br></div><div>=
Jason</div></div></div>

--00000000000096d4d505c76d471a--