public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/94990] New: NFC / NFD in identifiers
@ 2020-05-07 22:54 Arfrever.FTA at GMail dot Com
  2020-05-07 23:08 ` [Bug c/94990] " joseph at codesourcery dot com
  0 siblings, 1 reply; 2+ messages in thread
From: Arfrever.FTA at GMail dot Com @ 2020-05-07 22:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990

            Bug ID: 94990
           Summary: NFC / NFD in identifiers
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Arfrever.FTA at GMail dot Com
  Target Milestone: ---

GCC 10 introduced support for non-ASCII characters in identifiers.
However it is incomplete in context of NFC / NFD [1]:

$ gcc -o /tmp/test -x c - <<<"int ś = 123; int main() {return ś;}"
<stdin>: In function ‘main’:
<stdin>:1:34: warning: `s\U00000301' is not in NFC [-Wnormalized=]
<stdin>:1:34: error: ‘ś’ undeclared (first use in this function)
<stdin>:1:34: note: each undeclared identifier is reported only once for each
function it appears in
$ 
(In first place, [LATIN SMALL LETTER S WITH ACUTE] is used, in second place,
[LATIN SMALL LETTER S, COMBINING ACUTE ACCENT] is used.)


Since many potential sequences are not possible in NFC form [2][3], it would
make more sense for GCC to perform NFD normalization [4] of all identifiers.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
[2] https://en.wikipedia.org/wiki/Precomposed_character
[3]
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode
[4] https://en.wikipedia.org/wiki/Combining_character


For comparison, at least Python language performs some normalization of
identifiers:
$ python -c 'ś = 123; print(ś)'
123
$ 
Python identifiers are normalized to NFC form when possible, and are kept in
NFD form otherwise:
$ python -c $'á = 1\nb́ = 1\nimport unicodedata\nfor k, v in
dict(globals()).items():\n if v == 1:\n  print(k, [unicodedata.name(c) for c in
k])'
á ['LATIN SMALL LETTER A WITH ACUTE']
b́ ['LATIN SMALL LETTER B', 'COMBINING ACUTE ACCENT']
$

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug c/94990] NFC / NFD in identifiers
  2020-05-07 22:54 [Bug c/94990] New: NFC / NFD in identifiers Arfrever.FTA at GMail dot Com
@ 2020-05-07 23:08 ` joseph at codesourcery dot com
  0 siblings, 0 replies; 2+ messages in thread
From: joseph at codesourcery dot com @ 2020-05-07 23:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990

--- Comment #1 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
Note that ISO C references ISO 10646, not Unicode, so normalization forms 
are not part of the C notion of identifier characters and differently 
normalized forms are different identifiers as far as C is concerned.

The reason the -Wnormalized= options prefer NFC and don't have an option 
-Wnormalized=nfd is that many characters were only valid in C99 in the 
precomposed forms (C11 added more combining characters to the set allowed 
in identifiers).  Any Unicode character sequence can of course be 
converted to an NFC form if desired; some characters there may use 
precomposed forms and some may use combining characters.

If you wish to use NFD in your code, you should probably set your editor 
to generate NFD source files and compile with -Wno-normalized.

(A separate issue is that the Unicode data used in GCC for -Wnormalized= 
was last updated in 2013 and needs updating to a newer version of Unicode.  
Since the update I did in 2013 introduced automated generation of the 
relevant code from Unicode data, such an update to use newer Unicode data 
should be straightforward.)

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-05-07 23:08 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 22:54 [Bug c/94990] New: NFC / NFD in identifiers Arfrever.FTA at GMail dot Com
2020-05-07 23:08 ` [Bug c/94990] " joseph at codesourcery dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).