From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 8CB3038654B5 for ; Sun, 18 Jul 2021 22:24:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8CB3038654B5 Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-314-zI5CKJ0cOiO91GIdThcoWQ-1; Sun, 18 Jul 2021 18:24:11 -0400 X-MC-Unique: zI5CKJ0cOiO91GIdThcoWQ-1 Received: by mail-pf1-f200.google.com with SMTP id h6-20020a62b4060000b02903131bc4a1acso12035787pfn.4 for ; Sun, 18 Jul 2021 15:24:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZeJ3GQ0nA3kt3dGtM3KB7S6hbQR3HGu40xgClD+OHjI=; b=BTP4tThfoR0+BNvZfY9bjKtUz/TsNLwQSoYmMggw/bgtC/5UZlUBGeAXx5L/N+Dvff 8IUAIZntwyL1UJZz7VOIi8pVlkXYvokidgjfVFec1K7vy6lOSNSjzeywxLNkAy2ttSUr 8M+dI09VPt5Im6lTOFtu+PzfxuA6EIXW70EqyBXLGoyxrOZ8VsbtWd7eMTkqNyyY9cP1 kUwTwB/0mlqad8KkBX28Z7HJN9CE0utFnXhmPruxDa5APNUcgeVYMEF9NbPUV61iqjFm YIiSv3ga+j3839J3dfz4LJuhJ/znOf8A+1TwPGPp/ZpUvEgGjLAT9hvfHXF6NgPdCS9O Z6nQ== X-Gm-Message-State: AOAM531TTFqgVnSexcMWNKXQKSJs9BKAf0DnwiU/wUYBDuLLREdLymfy 0HdL4uFzBjxgLL6DgPk/1SvAbTjvm0aryiJGClky8ZtXShkFnrz/KtuNoBPNsV5EtoYKwe7O0Ph jpMpICGuBrdhCFYPNVzpO7C4shqdJ7g== X-Received: by 2002:a63:4242:: with SMTP id p63mr22274372pga.185.1626647050256; Sun, 18 Jul 2021 15:24:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJymRMBH4SsjYv1g9hqF6cyVBoKXX5QPFuNFvtf1uSMxVAqUt/4aJG5KOfrygVcTB/hi24BDCM4ecDFcmWr+rlc= X-Received: by 2002:a63:4242:: with SMTP id p63mr22274353pga.185.1626647049976; Sun, 18 Jul 2021 15:24:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jason Merrill Date: Sun, 18 Jul 2021 15:23:59 -0700 Message-ID: Subject: Re: rust frontend and UTF-8/unicode processing/properties To: Ian Lance Taylor Cc: Mark Wielaard , gcc Mailing List , gcc-rust@gcc.gnu.org X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/alternative; boundary="00000000000096d4d505c76d471a" X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, HTML_MESSAGE, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jul 2021 22:24:15 -0000 --00000000000096d4d505c76d471a Content-Type: text/plain; charset="UTF-8" On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc wrote: > On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard wrote: > > > > For the gcc rust frontend I was thinking of importing a couple of > > gnulib modules to help with UTF-8 processing, conversion to/from > > unicode codepoints and determining various properties of those > > codepoints. But it seems gcc doesn't yet have any gnulib modules > > imported, and maybe other frontends already have helpers to this that > > the gcc rust frontend could reuse. > > > > Rust only accepts valid UTF-8 encoded source files, which may or may > > not start with UTF-8 BOM character. Whitespace is any codepoint with > > the Pattern_White_Space property. Identifiers can start with any > > codepoint with the XID_start property plus zero or one codepoints with > > XID_continue property. It isn't required, but highly desirable to > > detect confusable identifiers according to tr39/Confusable_Detection. > > > > Other names might be constraint to Alphabetic and/or Number categories > > (Nd, Nl, No), textual types can only contain Unicode Scalar Values > > (any Unicode codepoint except high-surrogate and low-surrogates), > > strings in source code can contain unicode escapes (24 bit, up to 6 > > digits codepoints) but are internally stored as UTF-8 (and must not > > encode any surrogates). > > > > Do other gcc frontends handle any of the above already in a way that > > might be reusable for other frontends? > > I don't know that this is particularly helpful, but the Go frontend > has this kind of code in gcc/go/gofrontend/lex.cc. E.g., > Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, > unicode_digits, unicode_letters, Lex::is_unicode_space, etc. But you > probably won't be able to use the code directly, and the code in the > gofrontend directory is also shared with GoLLVM so it can't trivially > be moved. > I believe the UTF-8 handling for the C family front ends is all in libcpp; I don't think it's factored in a way to be useful to other front ends. Jason --00000000000096d4d505c76d471a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Sun, Jul 18, 2021 at 1:13 PM Ian Lance= Taylor via Gcc <gcc@gcc.gnu.org&= gt; wrote:
On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wro= te:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules > imported, and maybe other frontends already have helpers to this that<= br> > the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may > not start with UTF-8 BOM character. Whitespace is any codepoint with > the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with=
> XID_continue property. It isn't required, but highly desirable to<= br> > detect confusable identifiers according to tr39/Confusable_Detection.<= br> >
> Other names might be constraint to Alphabetic and/or Number categories=
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6 > digits codepoints) but are internally stored as UTF-8 (and must not > encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that > might be reusable for other frontends?

I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.=C2=A0 E.g.,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.=C2=A0 But you<= br> probably won't be able to use the code directly, and the code in the gofrontend directory is also shared with GoLLVM so it can't trivially be moved.

I believe the UTF-8 handling = for the C family front ends is all in libcpp; I don't think it's fa= ctored in a way to be useful to other front ends.

= Jason
--00000000000096d4d505c76d471a--