On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc <gcc@gcc.gnu.org> wrote:
On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wrote:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules
> imported, and maybe other frontends already have helpers to this that
> the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may
> not start with UTF-8 BOM character. Whitespace is any codepoint with
> the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with
> XID_continue property. It isn't required, but highly desirable to
> detect confusable identifiers according to tr39/Confusable_Detection.
>
> Other names might be constraint to Alphabetic and/or Number categories
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6
> digits codepoints) but are internally stored as UTF-8 (and must not
> encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that
> might be reusable for other frontends?

I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
probably won't be able to use the code directly, and the code in the
gofrontend directory is also shared with GoLLVM so it can't trivially
be moved.

I believe the UTF-8 handling for the C family front ends is all in libcpp; I don't think it's factored in a way to be useful to other front ends.

Jason