public inbox for gcc-rust@gcc.gnu.org
 help / color / mirror / Atom feed
From: Philip Herron <philip.herron@embecosm.com>
To: Jason Merrill <jason@redhat.com>, Ian Lance Taylor <iant@google.com>
Cc: gcc Mailing List <gcc@gcc.gnu.org>,
	Mark Wielaard <mark@klomp.org>,
	gcc-rust@gcc.gnu.org
Subject: Re: rust frontend and UTF-8/unicode processing/properties
Date: Fri, 23 Jul 2021 12:29:45 +0100	[thread overview]
Message-ID: <d3d2ad5c-d5a1-f16e-d56c-c4e3352d4f7e@embecosm.com> (raw)
In-Reply-To: <CADzB+2n6SP7fBnO1GkU+AAWFa85eYb04XXGSEAyHPT-qmGDSOw@mail.gmail.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 2753 bytes --]

On 18/07/2021 23:23, Jason Merrill via Gcc-rust wrote:
> On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc
> <gcc@gcc.gnu.org <mailto:gcc@gcc.gnu.org>> wrote:
>
>     On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org
>     <mailto:mark@klomp.org>> wrote:
>     >
>     > For the gcc rust frontend I was thinking of importing a couple of
>     > gnulib modules to help with UTF-8 processing, conversion to/from
>     > unicode codepoints and determining various properties of those
>     > codepoints. But it seems gcc doesn't yet have any gnulib modules
>     > imported, and maybe other frontends already have helpers to this
>     that
>     > the gcc rust frontend could reuse.
>     >
>     > Rust only accepts valid UTF-8 encoded source files, which may or may
>     > not start with UTF-8 BOM character. Whitespace is any codepoint with
>     > the Pattern_White_Space property. Identifiers can start with any
>     > codepoint with the XID_start property plus zero or one
>     codepoints with
>     > XID_continue property. It isn't required, but highly desirable to
>     > detect confusable identifiers according to
>     tr39/Confusable_Detection.
>     >
>     > Other names might be constraint to Alphabetic and/or Number
>     categories
>     > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
>     > (any Unicode codepoint except high-surrogate and low-surrogates),
>     > strings in source code can contain unicode escapes (24 bit, up to 6
>     > digits codepoints) but are internally stored as UTF-8 (and must not
>     > encode any surrogates).
>     >
>     > Do other gcc frontends handle any of the above already in a way that
>     > might be reusable for other frontends?
>
>     I don't know that this is particularly helpful, but the Go frontend
>     has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
>     Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
>     unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
>     probably won't be able to use the code directly, and the code in the
>     gofrontend directory is also shared with GoLLVM so it can't trivially
>     be moved.
>
>
> I believe the UTF-8 handling for the C family front ends is all in
> libcpp; I don't think it's factored in a way to be useful to other
> front ends.
>
> Jason
>
I think it would be ideal to reuse code if possible, it seems like this
project could be an opportunity to at try and make patches to try and
reuse code from other parts of GCC. I bootstrapped the front-end code
from gccgo, which has really helped the project to get going, I would
like to try and give back where I can.

Thanks

--Phil


[-- Attachment #1.1.2: Type: text/html, Size: 4461 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 665 bytes --]

  reply	other threads:[~2021-07-23 11:29 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-18 13:22 Mark Wielaard
2021-07-18 20:12 ` Ian Lance Taylor
2021-07-18 22:23   ` Jason Merrill
2021-07-23 11:29     ` Philip Herron [this message]
     [not found] ` <d5e7434b-80e8-2817-ed87-a23ef2ac0cbb@uma.es>
     [not found]   ` <CAOWUKr0Sd3RRSy2cuqMLj--KTWqOz=nQMxmx7ahM8YunrFzEig@mail.gmail.com>
2023-03-15 11:00     ` [GSoC] gccrs Unicode support Philip Herron
2023-03-15 14:53       ` Arsen Arsenović
2023-03-15 15:18       ` Jakub Jelinek
2023-03-16  8:57         ` Raiki Tamura
2023-03-16  9:28         ` Thomas Schwinge
2023-03-16 12:58           ` Mark Wielaard
2023-03-16 13:07             ` Jakub Jelinek
2023-03-18  8:31             ` Raiki Tamura
2023-03-18  8:47               ` Jonathan Wakely
2023-03-18  8:59                 ` Raiki Tamura
2023-03-18  9:28                   ` Jakub Jelinek
2023-03-20 10:19                     ` Raiki Tamura

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d3d2ad5c-d5a1-f16e-d56c-c4e3352d4f7e@embecosm.com \
    --to=philip.herron@embecosm.com \
    --cc=gcc-rust@gcc.gnu.org \
    --cc=gcc@gcc.gnu.org \
    --cc=iant@google.com \
    --cc=jason@redhat.com \
    --cc=mark@klomp.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).