public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
@ 2020-03-13 13:44 marxin at gcc dot gnu.org
  2020-03-13 13:45 ` [Bug preprocessor/94168] " marxin at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-03-13 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

            Bug ID: 94168
           Summary: [10 Regression] error: extended character § is not
                    valid in an identifier since
                    r10-3309-g7d112d6670a0e0e6
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: rejects-valid
          Severity: normal
          Priority: P3
         Component: preprocessor
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marxin at gcc dot gnu.org
  Target Milestone: ---

I see the following regression:

$ cat red.cc
#ifdef WINDOWS
§
#endif

$ g++-9 red.cc -c
$ g++ red.cc -c
red.cc:2:1: error: extended character § is not valid in an identifier
    2 | ��
      | ^

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
@ 2020-03-13 13:45 ` marxin at gcc dot gnu.org
  2020-03-13 16:44 ` lhyatt at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-03-13 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-03-13
   Target Milestone|---                         |10.0
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
      Known to fail|                            |10.0
      Known to work|                            |9.3.0
                 CC|                            |lhyatt at gmail dot com

--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
It's reduced from hfst-ospell package.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
  2020-03-13 13:45 ` [Bug preprocessor/94168] " marxin at gcc dot gnu.org
@ 2020-03-13 16:44 ` lhyatt at gmail dot com
  2020-03-13 18:16 ` joseph at codesourcery dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: lhyatt at gmail dot com @ 2020-03-13 16:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #2 from Lewis Hyatt <lhyatt at gmail dot com> ---
(In reply to Martin Liška from comment #0)
> I see the following regression:
> 
> $ cat red.cc
> #ifdef WINDOWS
> §
> #endif
> 
> $ g++-9 red.cc -c
> $ g++ red.cc -c
> red.cc:2:1: error: extended character § is not valid in an identifier
>     2 | ��
>       | ^

The corrupted colorization in the diagnostics is a bug that I submitted a patch
for already. David prefers that fix to wait for GCC 11.

Regarding the behavior, if you replace the § with the equivalent UCN:

#ifdef WINDOWS
\u00A7
#endif

then you will get the same behavior with older GCC before my patch too. My
patch causes the UTF-8 to be interpreted as an identifier rather than a stray
token, hence it ends up with the same error.

As it happens, if you compile your test in C mode, it will succeed, because the
UTF-8 logic for C mode treats the invalid character as a stray token rather
than part of an identifier, then it gets compiled out fine. In C++, it is
rather a syntax error by design so it triggers this error. When switching to
UCN syntax, it is an error for both C and C++ so fails either way.

Looking at the relevant code in charset.c (_cpp_valid_ucn and _cpp_valid_utf8)
... I think it is probably just a matter of checking pfile->state.skipping in
more places. I made _cpp_valid_utf8 so as to preserve all the analogous
behavior of the existing _cpp_valid_ucn. It seems that _cpp_valid_ucn checks
pfile->state.skipping in some cases, like for $ in identifiers, but not for
others, such as the invalid character case.

I am happy to submit a patch to fix this, but I am not sure in what all cases
it is correct to skip the error. For instance, this code can be made to trigger
an error too, in C90 mode:

$ cat t.c
#ifdef WINDOWS
int \u00E4;
#endif

$ gcc-8 -c t.c -std=c90 -fextended-identifiers
t.c:2:5: warning: universal character names are only valid in C++ and C99
 int \u00E4;
     ^

That is because _cpp_valid_ucn doesn't check pfile->state.skipping for this
case either. I think, especially in C++, there are probably at least some cases
where an error should be triggered even in conditionally compiled code, but I
don't know enough off hand to say for sure.

FWIW, the below patch fixes the present issue, but it doesn't tackle equivalent
UCN behavior or fix the related issues... I just need some guidance as to the
expected behavior to do that.

-Lewis

diff --git a/libcpp/charset.c b/libcpp/charset.c
index d9281c5fb97..129f234349e 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -1260,7 +1260,7 @@ _cpp_valid_utf8 (cpp_reader *pfile,
             way).  In C, this byte rather becomes grammatically a separate
             token.  */

-         if (CPP_OPTION (pfile, cplusplus))
+         if (!pfile->state.skipping && CPP_OPTION (pfile, cplusplus))
            cpp_error (pfile, CPP_DL_ERROR,
                       "extended character %.*s is not valid in an identifier",
                       (int) (*pstr - base), base);
@@ -1273,7 +1273,7 @@ _cpp_valid_utf8 (cpp_reader *pfile,
          break;

        case 2:
-         if (identifier_pos == 1)
+         if (!pfile->state.skipping && identifier_pos == 1)
            {
              /* This is treated the same way in C++ or C99 -- lexed as an
                 identifier which is then invalid because an identifier is

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
  2020-03-13 13:45 ` [Bug preprocessor/94168] " marxin at gcc dot gnu.org
  2020-03-13 16:44 ` lhyatt at gmail dot com
@ 2020-03-13 18:16 ` joseph at codesourcery dot com
  2020-03-16  8:06 ` marxin at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: joseph at codesourcery dot com @ 2020-03-13 18:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #3 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
The reasoning for rejecting this (for UCNs in both C and C++, for other 
characters in C++ because of the C++ rule that such characters get 
converted to UCNs) is that the constraints on permitted characters in 
identifiers appear to me to apply to any pp-token that matches the syntax 
productions for "identifier", whether or not that pp-token ends up getting 
converted to a token.  This is similar to

#if 0
"multiline
string"
#endif

being disallowed as well.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-03-13 18:16 ` joseph at codesourcery dot com
@ 2020-03-16  8:06 ` marxin at gcc dot gnu.org
  2020-03-16 10:29 ` marxin at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-03-16  8:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #4 from Martin Liška <marxin at gcc dot gnu.org> ---
Thank you for the analysis and suggested patch.
The original source code looks like this:

#ifdef WINDOWS
static std::string wide_string_to_string(const std::wstring & wstr)
{
  int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0],
(int)§wstr.size(), NULL, 0, NULL, NULL);
  std::string str( size_needed, 0 );
  WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &str[0],
size_needed, NULL, NULL);
  return str;
}
#endif

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-03-16  8:06 ` marxin at gcc dot gnu.org
@ 2020-03-16 10:29 ` marxin at gcc dot gnu.org
  2020-04-01  7:52 ` rguenth at gcc dot gnu.org
  2020-04-01  7:56 ` marxin at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-03-16 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #5 from Martin Liška <marxin at gcc dot gnu.org> ---
I reported that upstream as well:
https://github.com/hfst/hfst-ospell/issues/49

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-03-16 10:29 ` marxin at gcc dot gnu.org
@ 2020-04-01  7:52 ` rguenth at gcc dot gnu.org
  2020-04-01  7:56 ` marxin at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-01  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |INVALID

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
I read Josephs comment so that GCC is correct to reject the code and Martins
quoting of the original testcase shows it's likely a genuine bug in the
program (a typo of some sorts).

Thus closing as INVALID.  Please reopen if any of the above is wrong.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug preprocessor/94168] [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6
  2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-04-01  7:52 ` rguenth at gcc dot gnu.org
@ 2020-04-01  7:56 ` marxin at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-04-01  7:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> I read Josephs comment so that GCC is correct to reject the code and Martins
> quoting of the original testcase shows it's likely a genuine bug in the
> program (a typo of some sorts).

Yes. A real one bug that was already fixed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-04-01  7:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 13:44 [Bug preprocessor/94168] New: [10 Regression] error: extended character § is not valid in an identifier since r10-3309-g7d112d6670a0e0e6 marxin at gcc dot gnu.org
2020-03-13 13:45 ` [Bug preprocessor/94168] " marxin at gcc dot gnu.org
2020-03-13 16:44 ` lhyatt at gmail dot com
2020-03-13 18:16 ` joseph at codesourcery dot com
2020-03-16  8:06 ` marxin at gcc dot gnu.org
2020-03-16 10:29 ` marxin at gcc dot gnu.org
2020-04-01  7:52 ` rguenth at gcc dot gnu.org
2020-04-01  7:56 ` marxin at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).