public inbox for gcc-rust@gcc.gnu.org
 help / color / mirror / Atom feed
* rust frontend and UTF-8/unicode processing/properties
@ 2021-07-18 13:22 Mark Wielaard
  2021-07-18 20:12 ` Ian Lance Taylor
       [not found] ` <d5e7434b-80e8-2817-ed87-a23ef2ac0cbb@uma.es>
  0 siblings, 2 replies; 16+ messages in thread
From: Mark Wielaard @ 2021-07-18 13:22 UTC (permalink / raw)
  To: gcc, gcc-rust

Hi,

For the gcc rust frontend I was thinking of importing a couple of
gnulib modules to help with UTF-8 processing, conversion to/from
unicode codepoints and determining various properties of those
codepoints. But it seems gcc doesn't yet have any gnulib modules
imported, and maybe other frontends already have helpers to this that
the gcc rust frontend could reuse.

Rust only accepts valid UTF-8 encoded source files, which may or may
not start with UTF-8 BOM character. Whitespace is any codepoint with
the Pattern_White_Space property. Identifiers can start with any
codepoint with the XID_start property plus zero or one codepoints with
XID_continue property. It isn't required, but highly desirable to
detect confusable identifiers according to tr39/Confusable_Detection.

Other names might be constraint to Alphabetic and/or Number categories
(Nd, Nl, No), textual types can only contain Unicode Scalar Values
(any Unicode codepoint except high-surrogate and low-surrogates),
strings in source code can contain unicode escapes (24 bit, up to 6
digits codepoints) but are internally stored as UTF-8 (and must not
encode any surrogates).

Do other gcc frontends handle any of the above already in a way that
might be reusable for other frontends?

Thanks,

Mark


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rust frontend and UTF-8/unicode processing/properties
  2021-07-18 13:22 rust frontend and UTF-8/unicode processing/properties Mark Wielaard
@ 2021-07-18 20:12 ` Ian Lance Taylor
  2021-07-18 22:23   ` Jason Merrill
       [not found] ` <d5e7434b-80e8-2817-ed87-a23ef2ac0cbb@uma.es>
  1 sibling, 1 reply; 16+ messages in thread
From: Ian Lance Taylor @ 2021-07-18 20:12 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: gcc, gcc-rust

On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wrote:
>
> For the gcc rust frontend I was thinking of importing a couple of
> gnulib modules to help with UTF-8 processing, conversion to/from
> unicode codepoints and determining various properties of those
> codepoints. But it seems gcc doesn't yet have any gnulib modules
> imported, and maybe other frontends already have helpers to this that
> the gcc rust frontend could reuse.
>
> Rust only accepts valid UTF-8 encoded source files, which may or may
> not start with UTF-8 BOM character. Whitespace is any codepoint with
> the Pattern_White_Space property. Identifiers can start with any
> codepoint with the XID_start property plus zero or one codepoints with
> XID_continue property. It isn't required, but highly desirable to
> detect confusable identifiers according to tr39/Confusable_Detection.
>
> Other names might be constraint to Alphabetic and/or Number categories
> (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> (any Unicode codepoint except high-surrogate and low-surrogates),
> strings in source code can contain unicode escapes (24 bit, up to 6
> digits codepoints) but are internally stored as UTF-8 (and must not
> encode any surrogates).
>
> Do other gcc frontends handle any of the above already in a way that
> might be reusable for other frontends?

I don't know that this is particularly helpful, but the Go frontend
has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
probably won't be able to use the code directly, and the code in the
gofrontend directory is also shared with GoLLVM so it can't trivially
be moved.

Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rust frontend and UTF-8/unicode processing/properties
  2021-07-18 20:12 ` Ian Lance Taylor
@ 2021-07-18 22:23   ` Jason Merrill
  2021-07-23 11:29     ` Philip Herron
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Merrill @ 2021-07-18 22:23 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Mark Wielaard, gcc Mailing List, gcc-rust

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc <gcc@gcc.gnu.org>
wrote:

> On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org> wrote:
> >
> > For the gcc rust frontend I was thinking of importing a couple of
> > gnulib modules to help with UTF-8 processing, conversion to/from
> > unicode codepoints and determining various properties of those
> > codepoints. But it seems gcc doesn't yet have any gnulib modules
> > imported, and maybe other frontends already have helpers to this that
> > the gcc rust frontend could reuse.
> >
> > Rust only accepts valid UTF-8 encoded source files, which may or may
> > not start with UTF-8 BOM character. Whitespace is any codepoint with
> > the Pattern_White_Space property. Identifiers can start with any
> > codepoint with the XID_start property plus zero or one codepoints with
> > XID_continue property. It isn't required, but highly desirable to
> > detect confusable identifiers according to tr39/Confusable_Detection.
> >
> > Other names might be constraint to Alphabetic and/or Number categories
> > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
> > (any Unicode codepoint except high-surrogate and low-surrogates),
> > strings in source code can contain unicode escapes (24 bit, up to 6
> > digits codepoints) but are internally stored as UTF-8 (and must not
> > encode any surrogates).
> >
> > Do other gcc frontends handle any of the above already in a way that
> > might be reusable for other frontends?
>
> I don't know that this is particularly helpful, but the Go frontend
> has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
> Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
> unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
> probably won't be able to use the code directly, and the code in the
> gofrontend directory is also shared with GoLLVM so it can't trivially
> be moved.
>

I believe the UTF-8 handling for the C family front ends is all in libcpp;
I don't think it's factored in a way to be useful to other front ends.

Jason

[-- Attachment #2: Type: text/html, Size: 2632 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: rust frontend and UTF-8/unicode processing/properties
  2021-07-18 22:23   ` Jason Merrill
@ 2021-07-23 11:29     ` Philip Herron
  0 siblings, 0 replies; 16+ messages in thread
From: Philip Herron @ 2021-07-23 11:29 UTC (permalink / raw)
  To: Jason Merrill, Ian Lance Taylor; +Cc: gcc Mailing List, Mark Wielaard, gcc-rust


[-- Attachment #1.1.1: Type: text/plain, Size: 2753 bytes --]

On 18/07/2021 23:23, Jason Merrill via Gcc-rust wrote:
> On Sun, Jul 18, 2021 at 1:13 PM Ian Lance Taylor via Gcc
> <gcc@gcc.gnu.org <mailto:gcc@gcc.gnu.org>> wrote:
>
>     On Sun, Jul 18, 2021 at 6:23 AM Mark Wielaard <mark@klomp.org
>     <mailto:mark@klomp.org>> wrote:
>     >
>     > For the gcc rust frontend I was thinking of importing a couple of
>     > gnulib modules to help with UTF-8 processing, conversion to/from
>     > unicode codepoints and determining various properties of those
>     > codepoints. But it seems gcc doesn't yet have any gnulib modules
>     > imported, and maybe other frontends already have helpers to this
>     that
>     > the gcc rust frontend could reuse.
>     >
>     > Rust only accepts valid UTF-8 encoded source files, which may or may
>     > not start with UTF-8 BOM character. Whitespace is any codepoint with
>     > the Pattern_White_Space property. Identifiers can start with any
>     > codepoint with the XID_start property plus zero or one
>     codepoints with
>     > XID_continue property. It isn't required, but highly desirable to
>     > detect confusable identifiers according to
>     tr39/Confusable_Detection.
>     >
>     > Other names might be constraint to Alphabetic and/or Number
>     categories
>     > (Nd, Nl, No), textual types can only contain Unicode Scalar Values
>     > (any Unicode codepoint except high-surrogate and low-surrogates),
>     > strings in source code can contain unicode escapes (24 bit, up to 6
>     > digits codepoints) but are internally stored as UTF-8 (and must not
>     > encode any surrogates).
>     >
>     > Do other gcc frontends handle any of the above already in a way that
>     > might be reusable for other frontends?
>
>     I don't know that this is particularly helpful, but the Go frontend
>     has this kind of code in gcc/go/gofrontend/lex.cc.  E.g.,
>     Lex::fetch_char, Lex::advance_one_utf8_char, unicode_space,
>     unicode_digits, unicode_letters, Lex::is_unicode_space, etc.  But you
>     probably won't be able to use the code directly, and the code in the
>     gofrontend directory is also shared with GoLLVM so it can't trivially
>     be moved.
>
>
> I believe the UTF-8 handling for the C family front ends is all in
> libcpp; I don't think it's factored in a way to be useful to other
> front ends.
>
> Jason
>
I think it would be ideal to reuse code if possible, it seems like this
project could be an opportunity to at try and make patches to try and
reuse code from other parts of GCC. I bootstrapped the front-end code
from gccgo, which has really helped the project to get going, I would
like to try and give back where I can.

Thanks

--Phil


[-- Attachment #1.1.2: Type: text/html, Size: 4461 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 665 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
       [not found]   ` <CAOWUKr0Sd3RRSy2cuqMLj--KTWqOz=nQMxmx7ahM8YunrFzEig@mail.gmail.com>
@ 2023-03-15 11:00     ` Philip Herron
  2023-03-15 14:53       ` Arsen Arsenović
  2023-03-15 15:18       ` Jakub Jelinek
  0 siblings, 2 replies; 16+ messages in thread
From: Philip Herron @ 2023-03-15 11:00 UTC (permalink / raw)
  To: Raiki Tamura; +Cc: gcc, gcc-rust, David Edelsohn, Arthur Cohen

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

Hi Raiki

Excellent work on getting up to speed on the rust front-end. From my
perspective I am interested to see what the wider GCC community thinks
about using https://www.gnu.org/software/libunistring/ library within GCC
instead of rolling our own, this means it will be another dependency on GCC.

The other option is there is already code in the other front-ends to do
this so in the worst case it should be possible to extract something out of
them and possibly make this a shared piece of functionality which we can
mentor you through.

Thanks

--Phil

On Mon, 13 Mar 2023 at 16:19, Raiki Tamura via Gcc <gcc@gcc.gnu.org> wrote:

> Hello,
>
> My name is Raiki Tamura, an undergraduate student at Kyoto University in
> Japan and I want to work on Unicode support in gccrs this year.
> I have already written my proposal (linked below) and shared it with the
> gccrs team in Zulip.
> In the project, I am planning to use the GNU unistring library to handle
> Unicode characters and the GNU IDN library to normalize identifiers.
> According to my potential mentor, it would provide Unicode libraries for
> all frontends in GCC. If there are concerns or feedback about this, please
> tell me about it.
> Thank you.
>
> Link to my proposal:
>
> https://docs.google.com/document/d/1MgsbJMF-p-ndgrX2iKeWDR5KPSWw9Z7onsHIiZ2pPKs/edit?usp=sharing
>
> Raiki Tamura
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-15 11:00     ` [GSoC] gccrs Unicode support Philip Herron
@ 2023-03-15 14:53       ` Arsen Arsenović
  2023-03-15 15:18       ` Jakub Jelinek
  1 sibling, 0 replies; 16+ messages in thread
From: Arsen Arsenović @ 2023-03-15 14:53 UTC (permalink / raw)
  To: Philip Herron; +Cc: Raiki Tamura, gcc-rust, David Edelsohn, Arthur Cohen, gcc

[-- Attachment #1: Type: text/plain, Size: 2079 bytes --]


Philip Herron via Gcc <gcc@gcc.gnu.org> writes:

> Hi Raiki

Welcome, Raiki!

> Excellent work on getting up to speed on the rust front-end. From my
> perspective I am interested to see what the wider GCC community thinks
> about using https://www.gnu.org/software/libunistring/ library within GCC
> instead of rolling our own, this means it will be another dependency on GCC.

As my $0.02, it is likely best not to create yet another
re-implementation.  There's already precedent for including dependencies
that can do a very complex job well, like GMP and MPFR.

Text handling is deceivingly simple, and in practice, nobody seems to
get it fully right.  The effort is minimized, and yet most effectively
shared, if done in a library.

(note: I don't have a horse in the race wrt which specific library to
use, as I'm no expert, but I suspect libunistring could work well)

Have a wonderful day!

> The other option is there is already code in the other front-ends to do
> this so in the worst case it should be possible to extract something out of
> them and possibly make this a shared piece of functionality which we can
> mentor you through.
>
> Thanks
>
> --Phil
>
> On Mon, 13 Mar 2023 at 16:19, Raiki Tamura via Gcc <gcc@gcc.gnu.org> wrote:
>
>> Hello,
>>
>> My name is Raiki Tamura, an undergraduate student at Kyoto University in
>> Japan and I want to work on Unicode support in gccrs this year.
>> I have already written my proposal (linked below) and shared it with the
>> gccrs team in Zulip.
>> In the project, I am planning to use the GNU unistring library to handle
>> Unicode characters and the GNU IDN library to normalize identifiers.
>> According to my potential mentor, it would provide Unicode libraries for
>> all frontends in GCC. If there are concerns or feedback about this, please
>> tell me about it.
>> Thank you.
>>
>> Link to my proposal:
>>
>> https://docs.google.com/document/d/1MgsbJMF-p-ndgrX2iKeWDR5KPSWw9Z7onsHIiZ2pPKs/edit?usp=sharing
>>
>> Raiki Tamura
>>


-- 
Arsen Arsenović

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-15 11:00     ` [GSoC] gccrs Unicode support Philip Herron
  2023-03-15 14:53       ` Arsen Arsenović
@ 2023-03-15 15:18       ` Jakub Jelinek
  2023-03-16  8:57         ` Raiki Tamura
  2023-03-16  9:28         ` Thomas Schwinge
  1 sibling, 2 replies; 16+ messages in thread
From: Jakub Jelinek @ 2023-03-15 15:18 UTC (permalink / raw)
  To: Philip Herron; +Cc: Raiki Tamura, gcc, gcc-rust, David Edelsohn, Arthur Cohen

On Wed, Mar 15, 2023 at 11:00:19AM +0000, Philip Herron via Gcc wrote:
> Excellent work on getting up to speed on the rust front-end. From my
> perspective I am interested to see what the wider GCC community thinks
> about using https://www.gnu.org/software/libunistring/ library within GCC
> instead of rolling our own, this means it will be another dependency on GCC.
> 
> The other option is there is already code in the other front-ends to do
> this so in the worst case it should be possible to extract something out of
> them and possibly make this a shared piece of functionality which we can
> mentor you through.

I don't know what exactly Rust FE needs in this area, but e.g. libcpp
already handles whatever C/C++ need from Unicode support POV and can handle
it without any extra libraries.
So, if we could avoid the extra dependency, it would be certainly better,
unless you really need massive amounts of code from those libraries.
libcpp already e.g. provides mapping of unicode character names to code
points, determining which unicode characters can appear at the start or
in the middle of identifiers, etc.

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-15 15:18       ` Jakub Jelinek
@ 2023-03-16  8:57         ` Raiki Tamura
  2023-03-16  9:28         ` Thomas Schwinge
  1 sibling, 0 replies; 16+ messages in thread
From: Raiki Tamura @ 2023-03-16  8:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Arthur Cohen, David Edelsohn, Philip Herron, gcc, gcc-rust

[-- Attachment #1: Type: text/plain, Size: 1669 bytes --]

Sorry for resending this email. I forgot using “Reply All”.

Thank you for your response, Arsen and Jakub.
I did not know C++ also supports Unicode identifiers.
I looked a little into C++ and found C++ accepts the same form of
identifiers as Rust.
So I will do further investigation of libcpp with the hope that it can also
be used in the Rust frontend.

Raiki Tamura

On Thu, Mar 16, 2023 at 0:18 Jakub Jelinek <jakub@redhat.com> wrote:

> On Wed, Mar 15, 2023 at 11:00:19AM +0000, Philip Herron via Gcc wrote:
> > Excellent work on getting up to speed on the rust front-end. From my
> > perspective I am interested to see what the wider GCC community thinks
> > about using https://www.gnu.org/software/libunistring/ library within
> GCC
> > instead of rolling our own, this means it will be another dependency on
> GCC.
> >
> > The other option is there is already code in the other front-ends to do
> > this so in the worst case it should be possible to extract something out
> of
> > them and possibly make this a shared piece of functionality which we can
> > mentor you through.
>
> I don't know what exactly Rust FE needs in this area, but e.g. libcpp
> already handles whatever C/C++ need from Unicode support POV and can handle
> it without any extra libraries.
> So, if we could avoid the extra dependency, it would be certainly better,
> unless you really need massive amounts of code from those libraries.
> libcpp already e.g. provides mapping of unicode character names to code
> points, determining which unicode characters can appear at the start or
> in the middle of identifiers, etc.
>
>         Jakub
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-15 15:18       ` Jakub Jelinek
  2023-03-16  8:57         ` Raiki Tamura
@ 2023-03-16  9:28         ` Thomas Schwinge
  2023-03-16 12:58           ` Mark Wielaard
  1 sibling, 1 reply; 16+ messages in thread
From: Thomas Schwinge @ 2023-03-16  9:28 UTC (permalink / raw)
  To: Raiki Tamura, Jakub Jelinek, Philip Herron
  Cc: gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović,
	Mark Wielaard

Hi!

(By the way, this GSoC project is being discussed in GCC/Rust Zulip:
<https://gcc-rust.zulipchat.com/#narrow/stream/327528-GSoC/topic/Unicode.20support>.)

I'm now also putting Mark Wielaard in CC; he once also started discussing
this topic, "thinking of importing a couple of gnulib modules to help
with UTF-8 processing [unless] other gcc frontends handle [these things]
already in a way that might be reusable".  See the thread starting at
<https://inbox.sourceware.org/gcc/YPQrMBHyu3wRpT5o@wildebeest.org>
"rust frontend and UTF-8/unicode processing/properties".

On 2023-03-15T16:18:18+0100, Jakub Jelinek via Gcc <gcc@gcc.gnu.org> wrote:
> On Wed, Mar 15, 2023 at 11:00:19AM +0000, Philip Herron via Gcc wrote:
>> Excellent work on getting up to speed on the rust front-end. From my
>> perspective I am interested to see what the wider GCC community thinks
>> about using https://www.gnu.org/software/libunistring/ library within GCC
>> instead of rolling our own, this means it will be another dependency on GCC.
>>
>> The other option is there is already code in the other front-ends to do
>> this so in the worst case it should be possible to extract something out of
>> them and possibly make this a shared piece of functionality which we can
>> mentor you through.
>
> I don't know what exactly Rust FE needs in this area, but e.g. libcpp
> already handles whatever C/C++ need from Unicode support POV and can handle
> it without any extra libraries.
> So, if we could avoid the extra dependency, it would be certainly better,
> unless you really need massive amounts of code from those libraries.
> libcpp already e.g. provides mapping of unicode character names to code
> points, determining which unicode characters can appear at the start or
> in the middle of identifiers, etc.

So that's exactly the answer that I supposed you or someone else would
give.  ;-)

That means, GCC/Rust has some investigation to do: whether what libcpp
contains is (a) sufficient for its needs, and (b) whether that code can
be reused/extracted/refactored in a sensible way, into GCC-level shared
source code file, to be used by several front ends (possibly via libcpp).
(I suppose GCC/Rust shouldn't link in libcpp directly.)


Thanks for the input, all!


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-16  9:28         ` Thomas Schwinge
@ 2023-03-16 12:58           ` Mark Wielaard
  2023-03-16 13:07             ` Jakub Jelinek
  2023-03-18  8:31             ` Raiki Tamura
  0 siblings, 2 replies; 16+ messages in thread
From: Mark Wielaard @ 2023-03-16 12:58 UTC (permalink / raw)
  To: Thomas Schwinge, Raiki Tamura, Jakub Jelinek, Philip Herron
  Cc: gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović

Hi,

On Thu, 2023-03-16 at 10:28 +0100, Thomas Schwinge wrote:
> I'm now also putting Mark Wielaard in CC; he once also started discussing
> this topic, "thinking of importing a couple of gnulib modules to help
> with UTF-8 processing [unless] other gcc frontends handle [these things]
> already in a way that might be reusable".  See the thread starting at
> <https://inbox.sourceware.org/gcc/YPQrMBHyu3wRpT5o@wildebeest.org>
> "rust frontend and UTF-8/unicode processing/properties".

Thanks. BTW. I am not currently working on this.
Note the responses in the above thread by Ian and Jason who pointed out
that some of the requirements of the gccrs frontend might be covered in
the go frontend and libcpp, but not really in a reusable way.

One other thing you might want to coordinate on is NFC normalization
and Confusable Detection for identifiers.
https://unicode.org/reports/tr39/#Confusable_Detection
There has been some work on this by David Malcolm and Marek Polacek
https://developers.redhat.com/articles/2022/01/12/prevent-trojan-source-attacks-gcc-12
But that is on a slightly higher source level (not specific to
identifiers).

You might want to research whether NFC normalization of identifiers is
required to be done by the lexer or parser in Rust and how it interacts
with proc macros.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-16 12:58           ` Mark Wielaard
@ 2023-03-16 13:07             ` Jakub Jelinek
  2023-03-18  8:31             ` Raiki Tamura
  1 sibling, 0 replies; 16+ messages in thread
From: Jakub Jelinek @ 2023-03-16 13:07 UTC (permalink / raw)
  To: Mark Wielaard
  Cc: Thomas Schwinge, Raiki Tamura, Philip Herron, gcc, gcc-rust,
	David Edelsohn, Arthur Cohen, Arsen Arsenović

On Thu, Mar 16, 2023 at 01:58:57PM +0100, Mark Wielaard wrote:
> On Thu, 2023-03-16 at 10:28 +0100, Thomas Schwinge wrote:
> > I'm now also putting Mark Wielaard in CC; he once also started discussing
> > this topic, "thinking of importing a couple of gnulib modules to help
> > with UTF-8 processing [unless] other gcc frontends handle [these things]
> > already in a way that might be reusable".  See the thread starting at
> > <https://inbox.sourceware.org/gcc/YPQrMBHyu3wRpT5o@wildebeest.org>
> > "rust frontend and UTF-8/unicode processing/properties".
> 
> Thanks. BTW. I am not currently working on this.
> Note the responses in the above thread by Ian and Jason who pointed out
> that some of the requirements of the gccrs frontend might be covered in
> the go frontend and libcpp, but not really in a reusable way.

libcpp can be certainly linked into the gccrs FE and specific functions
called from it even if libcpp isn't used as a preprocessor for the language.
Small changes to libcpp are obviously possible as well to make it work.

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-16 12:58           ` Mark Wielaard
  2023-03-16 13:07             ` Jakub Jelinek
@ 2023-03-18  8:31             ` Raiki Tamura
  2023-03-18  8:47               ` Jonathan Wakely
  1 sibling, 1 reply; 16+ messages in thread
From: Raiki Tamura @ 2023-03-18  8:31 UTC (permalink / raw)
  To: Mark Wielaard
  Cc: Thomas Schwinge, Jakub Jelinek, Philip Herron, gcc, gcc-rust,
	David Edelsohn, Arthur Cohen, Arsen Arsenović

[-- Attachment #1: Type: text/plain, Size: 841 bytes --]

Thank you everyone for your advice.
Some kinds of names are restricted to unicode alphabetic/numeric in Rust.
And the current definition of the table defined in libcpp/ucind.h lacks
some rows representing which characters are alphabetic/numeric.
But it is not a problem because it seems to be easy to add missing rows to
the table and use it in the Rust frontend.

2023年3月16日(木) 21:59 Mark Wielaard <mark@klomp.org>:

> You might want to research whether NFC normalization of identifiers is
> required to be done by the lexer or parser in Rust and how it interacts
> with proc macros.


Yes, NFC normalization must be done by the lexer, which may be complex and
hard to implement.
libunistring can also be used for normalization, so is it good to use
libunistring only in the normalization process?

Raiki Tamura

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-18  8:31             ` Raiki Tamura
@ 2023-03-18  8:47               ` Jonathan Wakely
  2023-03-18  8:59                 ` Raiki Tamura
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Wakely @ 2023-03-18  8:47 UTC (permalink / raw)
  To: Raiki Tamura
  Cc: Mark Wielaard, Thomas Schwinge, Jakub Jelinek, Philip Herron,
	gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović

[-- Attachment #1: Type: text/plain, Size: 301 bytes --]

On Sat, 18 Mar 2023, 08:32 Raiki Tamura via Gcc, <gcc@gcc.gnu.org> wrote:

> Thank you everyone for your advice.
> Some kinds of names are restricted to unicode alphabetic/numeric in Rust.
>

Doesn't it use the same rules as C++, based on XID_Start and XID_Continue?
That should already be supported.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-18  8:47               ` Jonathan Wakely
@ 2023-03-18  8:59                 ` Raiki Tamura
  2023-03-18  9:28                   ` Jakub Jelinek
  0 siblings, 1 reply; 16+ messages in thread
From: Raiki Tamura @ 2023-03-18  8:59 UTC (permalink / raw)
  To: Jonathan Wakely
  Cc: Mark Wielaard, Thomas Schwinge, Jakub Jelinek, Philip Herron,
	gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

2023年3月18日(土) 17:47 Jonathan Wakely <jwakely.gcc@gmail.com>:

> On Sat, 18 Mar 2023, 08:32 Raiki Tamura via Gcc, <gcc@gcc.gnu.org> wrote:
>
>> Thank you everyone for your advice.
>> Some kinds of names are restricted to unicode alphabetic/numeric in Rust.
>>
>
> Doesn't it use the same rules as C++, based on XID_Start and XID_Continue?
> That should already be supported.
>

Yes, C++ and Rust use the same rules for identifiers (described in UAX#31)
and we can reuse it in the lexer of gccrs.
I was talking about values of Rust's crate_name attributes, which only
allow Unicode alphabetic/numeric characters.
(Ref:
https://doc.rust-lang.org/reference/crates-and-source-files.html#the-crate_name-attribute
)

Raiki Tamura

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-18  8:59                 ` Raiki Tamura
@ 2023-03-18  9:28                   ` Jakub Jelinek
  2023-03-20 10:19                     ` Raiki Tamura
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2023-03-18  9:28 UTC (permalink / raw)
  To: Raiki Tamura
  Cc: Jonathan Wakely, Mark Wielaard, Thomas Schwinge, Philip Herron,
	gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović

On Sat, Mar 18, 2023 at 05:59:34PM +0900, Raiki Tamura wrote:
> 2023年3月18日(土) 17:47 Jonathan Wakely <jwakely.gcc@gmail.com>:
> 
> > On Sat, 18 Mar 2023, 08:32 Raiki Tamura via Gcc, <gcc@gcc.gnu.org> wrote:
> >
> >> Thank you everyone for your advice.
> >> Some kinds of names are restricted to unicode alphabetic/numeric in Rust.
> >>
> >
> > Doesn't it use the same rules as C++, based on XID_Start and XID_Continue?
> > That should already be supported.
> >
> 
> Yes, C++ and Rust use the same rules for identifiers (described in UAX#31)
> and we can reuse it in the lexer of gccrs.
> I was talking about values of Rust's crate_name attributes, which only
> allow Unicode alphabetic/numeric characters.
> (Ref:
> https://doc.rust-lang.org/reference/crates-and-source-files.html#the-crate_name-attribute
> )

That is a pretty simple thing, so no need to use an extra library for that.
As is documented in contrib/unicode/README, the Unicode *.txt files are
already checked in and there are several generators of tables.
libcpp/makeucnid.cc already creates tables based on the
UnicodeData.txt DerivedNormalizationProps.txt DerivedCoreProperties.txt
files, including NFC/NKFC, it is true it doesn't currently compute
whether a character is alphanumeric.  That is either Alphabetic
DerivedCoreProperties.txt property, or for numeric Nd, Nl or No category
(3rd column) in UnicodeData.txt.  Should be a few lines to add that support
to libcpp/makeucnid.cc, the only question is if it won't make the ucnranges
array much larger if it differentiates based on another ALPHANUM flag.
If it doesn't grow too much, let's put it there, if it would grow too much,
perhaps we should emit it in a separate table.

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GSoC] gccrs Unicode support
  2023-03-18  9:28                   ` Jakub Jelinek
@ 2023-03-20 10:19                     ` Raiki Tamura
  0 siblings, 0 replies; 16+ messages in thread
From: Raiki Tamura @ 2023-03-20 10:19 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jonathan Wakely, Mark Wielaard, Thomas Schwinge, Philip Herron,
	gcc, gcc-rust, David Edelsohn, Arthur Cohen, Arsen Arsenović

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

2023年3月18日(土) 18:28 Jakub Jelinek <jakub@redhat.com>:

> That is a pretty simple thing, so no need to use an extra library for that.
> As is documented in contrib/unicode/README, the Unicode *.txt files are
> already checked in and there are several generators of tables.
> libcpp/makeucnid.cc already creates tables based on the
> UnicodeData.txt DerivedNormalizationProps.txt DerivedCoreProperties.txt
> files, including NFC/NKFC, it is true it doesn't currently compute
> whether a character is alphanumeric.  That is either Alphabetic
> DerivedCoreProperties.txt property, or for numeric Nd, Nl or No category
> (3rd column) in UnicodeData.txt.  Should be a few lines to add that support
> to libcpp/makeucnid.cc, the only question is if it won't make the ucnranges
> array much larger if it differentiates based on another ALPHANUM flag.
> If it doesn't grow too much, let's put it there, if it would grow too much,
> perhaps we should emit it in a separate table.
>

Sounds good. I have got a concrete idea of implementation.
Thank you everyone for giving your advice.

Sincerely yours,
Raiki Tamura

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-03-20 10:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-18 13:22 rust frontend and UTF-8/unicode processing/properties Mark Wielaard
2021-07-18 20:12 ` Ian Lance Taylor
2021-07-18 22:23   ` Jason Merrill
2021-07-23 11:29     ` Philip Herron
     [not found] ` <d5e7434b-80e8-2817-ed87-a23ef2ac0cbb@uma.es>
     [not found]   ` <CAOWUKr0Sd3RRSy2cuqMLj--KTWqOz=nQMxmx7ahM8YunrFzEig@mail.gmail.com>
2023-03-15 11:00     ` [GSoC] gccrs Unicode support Philip Herron
2023-03-15 14:53       ` Arsen Arsenović
2023-03-15 15:18       ` Jakub Jelinek
2023-03-16  8:57         ` Raiki Tamura
2023-03-16  9:28         ` Thomas Schwinge
2023-03-16 12:58           ` Mark Wielaard
2023-03-16 13:07             ` Jakub Jelinek
2023-03-18  8:31             ` Raiki Tamura
2023-03-18  8:47               ` Jonathan Wakely
2023-03-18  8:59                 ` Raiki Tamura
2023-03-18  9:28                   ` Jakub Jelinek
2023-03-20 10:19                     ` Raiki Tamura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).