From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id D69653858C60; Fri, 27 Aug 2021 15:13:12 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D69653858C60
From: "lhyatt at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug other/102099] New: Diagnostics do not consider the user's
 locale when printing source lines
Date: Fri, 27 Aug 2021 15:13:12 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: other
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: lhyatt at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-102099-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Aug 2021 15:13:12 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102099

            Bug ID: 102099
           Summary: Diagnostics do not consider the user's locale when
                    printing source lines
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lhyatt at gcc dot gnu.org
  Target Milestone: ---

If the user has a non-UTF-8 locale configured, they will currently still
receive UTF-8 output from GCC's stderr under some conditions:

-If a filename for which diagnostics are issued contains extended characters
-If a source line for which diagnostics are issued contains extended
characters.

When a source line contains identifier names with extended characters, the
C/C++ front ends take care to convert them to the user's locale, by always
passing them to identifier_to_locale() before output. However, this only
affects the diagnostics messages generated by the front end, and does not
affect the source line itself which is printed separately.

Example:

$ cat =C3=A1.cpp
int =C3=A1 =3D 0;
int =C3=A1 =3D 1;

#in UTF-8 locale, looks fine
$ gcc -c =C3=A1.cpp
=C3=A1.cpp:2:5: error: redefinition of =E2=80=98int =C3=A1=E2=80=99
    2 | int =C3=A1 =3D 1;
      |     ^
=C3=A1.cpp:1:5: note: =E2=80=98int =C3=A1=E2=80=99 previously defined here
    1 | int =C3=A1 =3D 0;
      |     ^

#in C locale, only partially converted to UCNs
$ LC_ALL=3DC gcc -c =C3=A1.cpp
=C3=A1.cpp:2:5: error: redefinition of 'int \U000000e1'
    2 | int =C3=A1 =3D 1;
      |     ^
=C3=A1.cpp:1:5: note: 'int \U000000e1' previously defined here
    1 | int =C3=A1 =3D 0;
      |     ^

The attached patch arranges for the output to be rather:

#corrected by this patch
$  LC_ALL=3DC gcc -c =C3=A1.cpp
\U000000e1.cpp:2:5: error: redefinition of 'int \U000000e1'
    2 | int \U000000e1 =3D 1;
      |     ^
\U000000e1.cpp:1:5: note: 'int \U000000e1' previously defined here
    1 | int \U000000e1 =3D 0;
      |     ^

In the above example I showed C locale, where the extended characters need =
to
be escaped, but the patch would also handle e.g. latin1 locale, where it wo=
uld
output as expected, using iconv to convert to the output charset.

The patch is pretty complete, and bootstraps all languages with no regressi=
on.
However there are a couple potential issues with it that may need to be
discussed before it's ready to be used, so I have held off submitting to
gcc-patches for now. The two main points of concern are:

1. Diagnostics recently acquired a lot of infrastructure to know the correct
display width of extended characters, so that things like carets and label
lines show up at the correct place. This infrastructure however is not
currently able to handle locale dependence of the display width. Changing t=
hat
is rather complicated, because determining that the display width of "=C3=
=A1" is
actually 10 columns instead of 1 (in case of UCN escape), requires attempti=
ng
to convert the character to the user's locale (perhaps with iconv), and
determining if it can be displayed or requires an escape. So the process of
determining the display width becomes an expensive operation that should be
optimized and performed once for the line, not something that can be comput=
ed
on the fly as is done now. This breaks the assumptions in the design of the
current approach and so would require it to be redone. That is certainly do=
able
but it seems unfortunate to make that process much more complicated, for wh=
at's
not probably a commonly needed use case. I suspect, that in many cases, use=
rs
with a C locale configured actually still see UTF-8 output fine in their
terminal anyway... The output with UCN escapes already looks bad, so perhaps
having misaligned labels and carets is not a big deal and it's fine as it i=
s.

2. The testsuite always runs with LC_ALL=3DC currently. Therefore, after the
change in this patch, a test is no longer able to test for UTF-8 output in
diagnostics, it will be UCN escaped instead. There is one such test current=
ly
(gcc.dg/diagnostic-input-charset-1.c). It doesn't seem suitable to change t=
hat
test to look for UCN escapes, because the purpose of that test is to confirm
that correct UTF-8 is generated when an input file is in another charset. S=
o I
instead added a new option -fdiagnostics-format=3Dforce-utf8. This is the s=
ame as
-fdiagnostics-format=3Dtext except it disables the conversion to the user's
locale and restores the previous behavior. That seemed more simple than add=
ing
ability to change the locale in the testsuite, plus I thought users may want
this option for themselves for some reason, if say they do not have access =
to a
UTF-8 locale somehow but their terminal still displays it fine. So that much
seems fine, however there is a wrinkle here that I am not sure how to fix. =
The
user probably expects that this new option will cause all diagnostics outpu=
t to
be UTF-8 regardless of locale. But some of the output is not generated by t=
he
diagnostics infrastructure at all. For example, localized strings are conve=
rted
by libintl and always come out in the current locale. I am not sure how
hard/easy it is to change this. But as an example, suppose you configure a
latin1 German locale and compile the above test case:

=C3=A1.cpp:2:5: Fehler: Redefinition von =C2=BBint =C3=A1=C2=AB
    2 | int =C3=A1 =3D 1;
      |     ^
=C3=A1.cpp:1:5: Anmerkung: =C2=BBint =C3=A1=C2=AB wurde bereits hier defini=
ert
    1 | int =C3=A1 =3D 0;
      |     ^

In the above output, the quote characters "=C2=BB=C2=AB" are generated by t=
he
internationalization library and are already converted to latin1 when
diagnostics infrastructure sees them. So currently, with the new
-fdiagnostics-format=3Dforce-utf8 option, the output is not what's expected=
...
the =C3=A1 is encoded in UTF-8, but the quote stays latin-1. I am not sure =
what's
the best way to address this. One option would be to make this new option
undocumented, and reserve it just for use by the testsuite, for which this
would not be an issue. Otherwise need to find a way to either disable the
conversion to the locale from the translation library, or translate it back
when this option is in effect.=