From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by sourceware.org (Postfix) with ESMTPS id 1E4DE3888C65 for ; Fri, 23 Jul 2021 11:29:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1E4DE3888C65 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com Received: by mail-wr1-x42f.google.com with SMTP id z7so1945410wrn.11 for ; Fri, 23 Jul 2021 04:29:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to; bh=4f84XWB4qK+D/LvfCQJmK2hFroFHNoIkavmKTug8Hxs=; b=T8x3OEgnew5Rw1X77smRL9W9fEL0S9EeZ1yfN35qPu08GasMO7KfzlShYS7lgc1KnT U+T3Jmjv4Qh37TRm9UIeHXgEhH2L7enGhZAunvXvXWqM1pMaF6fV1/qDLzhJQly24K44 Eh5WF1cgZITulqW+lC6XR0aIJ4h7p0qfnJwmKGuW3H4vv2Ph4PuCeWT7ZTDy/f0jGwYn IIQaGYy9JhHJgfws+y451VVD1O/9heRtshrel17T9DK9uZp9YjYEejHHlDwfkBaqj53y vR4XAAxt1LLSnetDgTE+10wvP7PZMcId5thuVXjLhVLw4k79hzeTp0c1foxbdVosMxUH cwFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=4f84XWB4qK+D/LvfCQJmK2hFroFHNoIkavmKTug8Hxs=; b=SDz2BAdONI16rzd1ssRmvnEIPGBGryYzyPS288NUvaNCLfvmodrrZ7eV/6h7kSfB10 SknSlIIklmDH3VcreVwh6Q15QpvcD1G0oUH669Bzp0f//lFBx6noczyp1lnnlBAS20wd YE3cxaX2T2cSWVun8G+3KAiGY8fn66W4RczOrNM6KeA/GVljEVuL1oU5DgzLvRAHQNcn f3Zjp3Sk4CRE/3neeNK8aM9osKwuv4JJcmNiOaGyC9mvmPNmE+054fA7Or5AvZg2kGky YpRVIga0dXJSNH2ZS7ov8RlrUitHKKV3+jNxjTmRALWe06BJSPQxxDW/9wg50cu6XNXA g1Mw== X-Gm-Message-State: AOAM530YVIwzEGAR712P0HGqDohMQwsltL1jKrvkH2XPMRlaUJqAbbuR tfFdUMZW5wojJ+pJl/A53+KYXi0FozPVZg== X-Google-Smtp-Source: ABdhPJwMnzrMJt2yJt9Ak4Qh8CVpIEBwk3XEjd3MoaD+PAKGLUY/t3jmfYm3MbbswhETGtjfSKe2gQ== X-Received: by 2002:a5d:568a:: with SMTP id f10mr4830483wrv.293.1627039786883; Fri, 23 Jul 2021 04:29:46 -0700 (PDT) Received: from [192.168.0.40] ([86.14.124.218]) by smtp.gmail.com with ESMTPSA id r4sm33163508wre.84.2021.07.23.04.29.45 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 23 Jul 2021 04:29:46 -0700 (PDT) Subject: Re: rust frontend and UTF-8/unicode processing/properties To: Jason Merrill , Ian Lance Taylor Cc: gcc Mailing List , Mark Wielaard , gcc-rust@gcc.gnu.org References: From: Philip Herron Message-ID: Date: Fri, 23 Jul 2021 12:29:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ecv80y1jz7qLg6clWC542VBwyQSYQ63m1" X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, HTML_MESSAGE, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=unavailable autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Jul 2021 11:29:49 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --ecv80y1jz7qLg6clWC542VBwyQSYQ63m1 Content-Type: multipart/mixed; boundary="e1qWYjmJdfghA4DpODjWQXnh0xwICwS2f"; protected-headers="v1" From: Philip Herron To: Jason Merrill , Ian Lance Taylor Cc: gcc Mailing List , Mark Wielaard , gcc-rust@gcc.gnu.org Message-ID: Subject: Re: rust frontend and UTF-8/unicode processing/properties References: In-Reply-To: --e1qWYjmJdfghA4DpODjWQXnh0xwICwS2f Content-Type: multipart/alternative; boundary="------------B270EFBB9E55642FAE7D2A1B" Content-Language: en-US This is a multi-part message in MIME format. --------------B270EFBB9E55642FAE7D2A1B Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 18/07/2021 23:23, Jason Merrill via Gcc-rust wrote: > On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc > > wrote: > > On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard > wrote: > > > > For the gcc rust frontend I was thinking of importing a couple of= > > gnulib modules to help with UTF-8 processing, conversion to/from > > unicode codepoints and determining various properties of those > > codepoints. But it seems gcc doesn't yet have any gnulib modules > > imported, and maybe other frontends already have helpers to this > that > > the gcc rust frontend could reuse. > > > > Rust only accepts valid UTF-8 encoded source files, which may or = may > > not start with UTF-8 BOM character. Whitespace is any codepoint w= ith > > the Pattern_White_Space property. Identifiers can start with any > > codepoint with the XID_start property plus zero or one > codepoints with > > XID_continue property. It isn't required, but highly desirable to= > > detect confusable identifiers according to > tr39/Confusable_Detection. > > > > Other names might be constraint to Alphabetic and/or Number > categories > > (Nd, Nl, No), textual types can only contain Unicode Scalar Value= s > > (any Unicode codepoint except high-surrogate and low-surrogates),= > > strings in source code can contain unicode escapes (24 bit, up to= 6 > > digits codepoints) but are internally stored as UTF-8 (and must n= ot > > encode any surrogates). > > > > Do other gcc frontends handle any of the above already in a way t= hat > > might be reusable for other frontends? > > I don't know that this is particularly helpful, but the Go frontend= > has this kind of code in gcc/go/gofrontend/lex.cc.=C2=A0 E.g., > Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, > unicode_digits, unicode_letters, Lex::is_unicode_space, etc.=C2=A0 = But you > probably won't be able to use the code directly, and the code in th= e > gofrontend directory is also shared with GoLLVM so it can't trivial= ly > be moved. > > > I believe the UTF-8 handling for the C family front ends is all in > libcpp; I don't think it's factored in a way to be useful to other > front ends. > > Jason > I think it would be ideal to reuse code if possible, it seems like this project could be an opportunity to at try and make patches to try and reuse code from other parts of GCC. I bootstrapped the front-end code from gccgo, which has really helped the project to get going, I would like to try and give back where I can. Thanks --Phil --------------B270EFBB9E55642FAE7D2A1B Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
On 18/07/2021 23:23, Jason Merrill via= Gcc-rust wrote:
On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor= via Gcc <gcc@gcc.gnu.org> wrote:
On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wrote:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules
> imported, and maybe other frontends already have helpers to this that
> the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may
> not start with UTF-8 BOM character. Whitespace is any codepoint with
> the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with
> XID_continue property. It isn't required, but highly desirable to
> detect confusable identifiers according to tr39/Confusable_Detection.
>
> Other names might be constraint to Alphabetic and/or Number categories
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6
> digits codepoints) but are internally stored as UTF-8 (and must not
> encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that
> might be reusable for other frontends?

I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.=C2=A0 E.g.= ,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, unicode_digits, unicode_letters, Lex::is_unicode_space, etc.=C2=A0 But you
probably won't be able to use the code directly, and the code in the
gofrontend directory is also shared with GoLLVM so it can't trivially
be moved.

I believe the UTF-8 handling for the C family front ends is all in libcpp; I don't think it's factored in a way to be useful to other front ends.

Jason

I think it would be ideal to reuse code if possible, it seems like this project could be an opportunity to at try and make patches to try and reuse code from other parts of GCC. I bootstrapped the front-end code from gccgo, which has really helped the project to get going, I would like to try and give back where I can.

Thanks

--Phil

--------------B270EFBB9E55642FAE7D2A1B-- --e1qWYjmJdfghA4DpODjWQXnh0xwICwS2f-- --ecv80y1jz7qLg6clWC542VBwyQSYQ63m1 Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsD5BAABCAAjFiEET83ATZOayqRjyL0Cr7gxHEFOdpkFAmD6qCkFAwAAAAAACgkQr7gxHEFOdpne tgwAgr6epBVTPko0sB134cFsLpsbcScvyV+TAO30GqAcAqvypyc3mDtKzlo8ahw6rjgVvw8h06qK wdMxJgdZ7Z22I+zB0cbnLRpbjfTf07aB0jflKYtLteadVJts1aapSJKsEIgf5UHZb/VWkgRQ8oE3 LGJhwVy/8pFjbCOX4m11zax2jHmLpZRywS1VsW4PNs3zz0YIva4dgI9Vc9GemSiFPxAIcc3nmn23 SXkOozmxdDhH50NBZVEd9w0H0sFypUMdOePNiIFIFKn8hR3OldHNcYPFrq++XZEwH6ms4v8l4lne nS8HIWu8v1lVSWrCuYdEfS2atdL1CaiuD+LyruPc6tHtam/uwByM0jiU70KqjMpi82svfYrjpbDN GqiXy3MOpn2LIjN6vWA6vKnTcDUrssllwPMmaTKlcJ5RGUcFwaw8yeQZxqu5hMz056hGy+coixwO kGFZqEkA31TF1o87D75aa4lzkoNuPdOYT2gYjfYn7/bD8X+R7zbwxwYsfkkq =FPwz -----END PGP SIGNATURE----- --ecv80y1jz7qLg6clWC542VBwyQSYQ63m1--