From: Joseph Myers <joseph@codesourcery.com>
To: Lewis Hyatt <lhyatt@gmail.com>
Cc: <gcc-patches@gcc.gnu.org>
Subject: Re: Patch to support extended characters in C/C++ identifiers
Date: Tue, 10 Sep 2019 23:47:00 -0000 [thread overview]
Message-ID: <alpine.DEB.2.21.1909102334390.25537@digraph.polyomino.org.uk> (raw)
In-Reply-To: <20190812220121.GA9251@ldh.local>
On Mon, 12 Aug 2019, Lewis Hyatt wrote:
> Hello-
>
> The attached patch for libcpp adds support for extended characters (e.g. UTF-8)
> in identifiers. A preliminary version of the patch was posted on PR c/67224 as
> Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and
> discussed with Joseph Myers. Here is an updated patch incorporating all
> feedback received so far. I hope it is suitable now; please let me know if I
> can do anything else to make it ready for you to apply. I am happy to work on
> it further, whatever is needed. I can't easily test on anything other than
> x86_64-linux though. I did bootstrap all languages and run all tests on that
> platform, everything was good.
>
> The (relatively short) changes to libcpp are included inline here. I attached
> the test cases as a gzipped patch to avoid any problems with the encoding (the
> test cases contain some invalid UTF-8 and also other encodings such as latin-1
> as part of the testing).
>
> Thanks for taking a look at it!
Thanks, I think this is OK with a few updates to the documentation.
Specifically:
cpp.texi says:
In the 1999 C standard, identifiers may contain letters which are not
part of the ``basic source character set'', at the implementation's
discretion (such as accented Latin letters, Greek letters, or Chinese
ideograms). This may be done with an extended character set, or the
@samp{\u} and @samp{\U} escape sequences. GCC only accepts such
characters in the @samp{\u} and @samp{\U} forms.
and it's no longer accurate to say that only the \u and \U forms are
accepted.
cpp.texi, section "Implementation-defined behavior", discusses
implementation-defined characters in identifiers. It should say that GCC
accepts exactly those multibyte characters that correspond to UCNs for
characters permitted by the chosen version of the C or C++ standard.
cppopts.texi documents -fextended-identifiers as "Accept universal
character names in identifiers.". That needs to say the characters are
also accepted directly in the identifiers.
I should also note that a few of the tests added by the test are testing
things that are properties of the implementation that might arguably be
bugs, rather than standard features, and so perhaps should at least have
comments added saying they are testing those implementation properties.
gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c, testing invalid UTF-8, is relying
on GCC, in its default -finput-charset=utf-8 mode, not actually checking
that the input is valid UTF-8. It's clear that avoiding such a check
makes sense in strings and comments, both as a matter of efficiency and
because it's likely to do the right thing for a lot of user programs that
use non-UTF-8 character sets in those places and just need the bytes in
the strings to be passed through to the compiler output (rather than
requiring users to specify -finput-charset and -fexec-charset for those
programs). Outside those contexts it's less obvious what's the best way
to behave (this sort of test, where the stray non-UTF-8 bytes are in text
that disappears as a result of macro expansion, is certainly a corner
case).
gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and
gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in
C++, where strictly the results they expect show that GCC does not conform
to the C++ standard requirement to convert all extended characters to UCNs
(because C++ does not have the special C rule making it
implementation-defined whether the \ of a UCN in a string literal is
doubled when stringizing).
--
Joseph S. Myers
joseph@codesourcery.com
next prev parent reply other threads:[~2019-09-10 23:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-12 22:16 Lewis Hyatt
2019-08-15 7:41 ` Jason Merrill
2019-08-15 12:51 ` Joseph Myers
2019-09-10 23:47 ` Joseph Myers [this message]
2019-09-11 14:32 ` Lewis Hyatt
2019-09-12 0:33 ` Joseph Myers
2019-09-12 20:30 ` Lewis Hyatt
2019-09-19 19:57 ` Joseph Myers
2019-09-19 20:08 ` Lewis Hyatt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.21.1909102334390.25537@digraph.polyomino.org.uk \
--to=joseph@codesourcery.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=lhyatt@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).