[Bug other/102099] New: Diagnostics do not consider the user's locale when printing source lines

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug other/102099] New: Diagnostics do not consider the user's locale when printing source lines
@ 2021-08-27 15:13 lhyatt at gcc dot gnu.org
  2021-08-27 15:13 ` [Bug other/102099] " lhyatt at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: lhyatt at gcc dot gnu.org @ 2021-08-27 15:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102099

            Bug ID: 102099
           Summary: Diagnostics do not consider the user's locale when
                    printing source lines
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lhyatt at gcc dot gnu.org
  Target Milestone: ---

If the user has a non-UTF-8 locale configured, they will currently still
receive UTF-8 output from GCC's stderr under some conditions:

-If a filename for which diagnostics are issued contains extended characters
-If a source line for which diagnostics are issued contains extended
characters.

When a source line contains identifier names with extended characters, the
C/C++ front ends take care to convert them to the user's locale, by always
passing them to identifier_to_locale() before output. However, this only
affects the diagnostics messages generated by the front end, and does not
affect the source line itself which is printed separately.

Example:

$ cat á.cpp
int á = 0;
int á = 1;

#in UTF-8 locale, looks fine
$ gcc -c á.cpp
á.cpp:2:5: error: redefinition of ‘int á’
    2 | int á = 1;
      |     ^
á.cpp:1:5: note: ‘int á’ previously defined here
    1 | int á = 0;
      |     ^

#in C locale, only partially converted to UCNs
$ LC_ALL=C gcc -c á.cpp
á.cpp:2:5: error: redefinition of 'int \U000000e1'
    2 | int á = 1;
      |     ^
á.cpp:1:5: note: 'int \U000000e1' previously defined here
    1 | int á = 0;
      |     ^

The attached patch arranges for the output to be rather:

#corrected by this patch
$  LC_ALL=C gcc -c á.cpp
\U000000e1.cpp:2:5: error: redefinition of 'int \U000000e1'
    2 | int \U000000e1 = 1;
      |     ^
\U000000e1.cpp:1:5: note: 'int \U000000e1' previously defined here
    1 | int \U000000e1 = 0;
      |     ^

In the above example I showed C locale, where the extended characters need to
be escaped, but the patch would also handle e.g. latin1 locale, where it would
output as expected, using iconv to convert to the output charset.

The patch is pretty complete, and bootstraps all languages with no regression.
However there are a couple potential issues with it that may need to be
discussed before it's ready to be used, so I have held off submitting to
gcc-patches for now. The two main points of concern are:

1. Diagnostics recently acquired a lot of infrastructure to know the correct
display width of extended characters, so that things like carets and label
lines show up at the correct place. This infrastructure however is not
currently able to handle locale dependence of the display width. Changing that
is rather complicated, because determining that the display width of "á" is
actually 10 columns instead of 1 (in case of UCN escape), requires attempting
to convert the character to the user's locale (perhaps with iconv), and
determining if it can be displayed or requires an escape. So the process of
determining the display width becomes an expensive operation that should be
optimized and performed once for the line, not something that can be computed
on the fly as is done now. This breaks the assumptions in the design of the
current approach and so would require it to be redone. That is certainly doable
but it seems unfortunate to make that process much more complicated, for what's
not probably a commonly needed use case. I suspect, that in many cases, users
with a C locale configured actually still see UTF-8 output fine in their
terminal anyway... The output with UCN escapes already looks bad, so perhaps
having misaligned labels and carets is not a big deal and it's fine as it is.

2. The testsuite always runs with LC_ALL=C currently. Therefore, after the
change in this patch, a test is no longer able to test for UTF-8 output in
diagnostics, it will be UCN escaped instead. There is one such test currently
(gcc.dg/diagnostic-input-charset-1.c). It doesn't seem suitable to change that
test to look for UCN escapes, because the purpose of that test is to confirm
that correct UTF-8 is generated when an input file is in another charset. So I
instead added a new option -fdiagnostics-format=force-utf8. This is the same as
-fdiagnostics-format=text except it disables the conversion to the user's
locale and restores the previous behavior. That seemed more simple than adding
ability to change the locale in the testsuite, plus I thought users may want
this option for themselves for some reason, if say they do not have access to a
UTF-8 locale somehow but their terminal still displays it fine. So that much
seems fine, however there is a wrinkle here that I am not sure how to fix. The
user probably expects that this new option will cause all diagnostics output to
be UTF-8 regardless of locale. But some of the output is not generated by the
diagnostics infrastructure at all. For example, localized strings are converted
by libintl and always come out in the current locale. I am not sure how
hard/easy it is to change this. But as an example, suppose you configure a
latin1 German locale and compile the above test case:

á.cpp:2:5: Fehler: Redefinition von »int á«
    2 | int á = 1;
      |     ^
á.cpp:1:5: Anmerkung: »int á« wurde bereits hier definiert
    1 | int á = 0;
      |     ^

In the above output, the quote characters "»«" are generated by the
internationalization library and are already converted to latin1 when
diagnostics infrastructure sees them. So currently, with the new
-fdiagnostics-format=force-utf8 option, the output is not what's expected...
the á is encoded in UTF-8, but the quote stays latin-1. I am not sure what's
the best way to address this. One option would be to make this new option
undocumented, and reserve it just for use by the testsuite, for which this
would not be an issue. Otherwise need to find a way to either disable the
conversion to the locale from the translation library, or translate it back
when this option is in effect.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug other/102099] Diagnostics do not consider the user's locale when printing source lines
  2021-08-27 15:13 [Bug other/102099] New: Diagnostics do not consider the user's locale when printing source lines lhyatt at gcc dot gnu.org
@ 2021-08-27 15:13 ` lhyatt at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: lhyatt at gcc dot gnu.org @ 2021-08-27 15:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102099

--- Comment #1 from Lewis Hyatt <lhyatt at gcc dot gnu.org> ---
Created attachment 51365
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51365&action=edit
Tested patch

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-08-27 15:13 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-27 15:13 [Bug other/102099] New: Diagnostics do not consider the user's locale when printing source lines lhyatt at gcc dot gnu.org
2021-08-27 15:13 ` [Bug other/102099] " lhyatt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).