On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc wrote: > On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard wrote: > > > > For the gcc rust frontend I was thinking of importing a couple of > > gnulib modules to help with UTF-8 processing, conversion to/from > > unicode codepoints and determining various properties of those > > codepoints. But it seems gcc doesn't yet have any gnulib modules > > imported, and maybe other frontends already have helpers to this that > > the gcc rust frontend could reuse. > > > > Rust only accepts valid UTF-8 encoded source files, which may or may > > not start with UTF-8 BOM character. Whitespace is any codepoint with > > the Pattern_White_Space property. Identifiers can start with any > > codepoint with the XID_start property plus zero or one codepoints with > > XID_continue property. It isn't required, but highly desirable to > > detect confusable identifiers according to tr39/Confusable_Detection. > > > > Other names might be constraint to Alphabetic and/or Number categories > > (Nd, Nl, No), textual types can only contain Unicode Scalar Values > > (any Unicode codepoint except high-surrogate and low-surrogates), > > strings in source code can contain unicode escapes (24 bit, up to 6 > > digits codepoints) but are internally stored as UTF-8 (and must not > > encode any surrogates). > > > > Do other gcc frontends handle any of the above already in a way that > > might be reusable for other frontends? > > I don't know that this is particularly helpful, but the Go frontend > has this kind of code in gcc/go/gofrontend/lex.cc. E.g., > Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space, > unicode_digits, unicode_letters, Lex::is_unicode_space, etc. But you > probably won't be able to use the code directly, and the code in the > gofrontend directory is also shared with GoLLVM so it can't trivially > be moved. > I believe the UTF-8 handling for the C family front ends is all in libcpp; I don't think it's factored in a way to be useful to other front ends. Jason