public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/94990] New: NFC / NFD in identifiers
@ 2020-05-07 22:54 Arfrever.FTA at GMail dot Com
2020-05-07 23:08 ` [Bug c/94990] " joseph at codesourcery dot com
0 siblings, 1 reply; 2+ messages in thread
From: Arfrever.FTA at GMail dot Com @ 2020-05-07 22:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990
Bug ID: 94990
Summary: NFC / NFD in identifiers
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: Arfrever.FTA at GMail dot Com
Target Milestone: ---
GCC 10 introduced support for non-ASCII characters in identifiers.
However it is incomplete in context of NFC / NFD [1]:
$ gcc -o /tmp/test -x c - <<<"int ś = 123; int main() {return ś;}"
<stdin>: In function ‘main’:
<stdin>:1:34: warning: `s\U00000301' is not in NFC [-Wnormalized=]
<stdin>:1:34: error: ‘ś’ undeclared (first use in this function)
<stdin>:1:34: note: each undeclared identifier is reported only once for each
function it appears in
$
(In first place, [LATIN SMALL LETTER S WITH ACUTE] is used, in second place,
[LATIN SMALL LETTER S, COMBINING ACUTE ACCENT] is used.)
Since many potential sequences are not possible in NFC form [2][3], it would
make more sense for GCC to perform NFD normalization [4] of all identifiers.
[1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
[2] https://en.wikipedia.org/wiki/Precomposed_character
[3]
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode
[4] https://en.wikipedia.org/wiki/Combining_character
For comparison, at least Python language performs some normalization of
identifiers:
$ python -c 'ś = 123; print(ś)'
123
$
Python identifiers are normalized to NFC form when possible, and are kept in
NFD form otherwise:
$ python -c $'á = 1\nb́ = 1\nimport unicodedata\nfor k, v in
dict(globals()).items():\n if v == 1:\n print(k, [unicodedata.name(c) for c in
k])'
á ['LATIN SMALL LETTER A WITH ACUTE']
b́ ['LATIN SMALL LETTER B', 'COMBINING ACUTE ACCENT']
$
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Bug c/94990] NFC / NFD in identifiers
2020-05-07 22:54 [Bug c/94990] New: NFC / NFD in identifiers Arfrever.FTA at GMail dot Com
@ 2020-05-07 23:08 ` joseph at codesourcery dot com
0 siblings, 0 replies; 2+ messages in thread
From: joseph at codesourcery dot com @ 2020-05-07 23:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990
--- Comment #1 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
Note that ISO C references ISO 10646, not Unicode, so normalization forms
are not part of the C notion of identifier characters and differently
normalized forms are different identifiers as far as C is concerned.
The reason the -Wnormalized= options prefer NFC and don't have an option
-Wnormalized=nfd is that many characters were only valid in C99 in the
precomposed forms (C11 added more combining characters to the set allowed
in identifiers). Any Unicode character sequence can of course be
converted to an NFC form if desired; some characters there may use
precomposed forms and some may use combining characters.
If you wish to use NFD in your code, you should probably set your editor
to generate NFD source files and compile with -Wno-normalized.
(A separate issue is that the Unicode data used in GCC for -Wnormalized=
was last updated in 2013 and needs updating to a newer version of Unicode.
Since the update I did in 2013 introduced automated generation of the
relevant code from Unicode data, such an update to use newer Unicode data
should be straightforward.)
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-05-07 23:08 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 22:54 [Bug c/94990] New: NFC / NFD in identifiers Arfrever.FTA at GMail dot Com
2020-05-07 23:08 ` [Bug c/94990] " joseph at codesourcery dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).