From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D69653858C60; Fri, 27 Aug 2021 15:13:12 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D69653858C60 From: "lhyatt at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug other/102099] New: Diagnostics do not consider the user's locale when printing source lines Date: Fri, 27 Aug 2021 15:13:12 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: other X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: lhyatt at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Aug 2021 15:13:12 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102099 Bug ID: 102099 Summary: Diagnostics do not consider the user's locale when printing source lines Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: lhyatt at gcc dot gnu.org Target Milestone: --- If the user has a non-UTF-8 locale configured, they will currently still receive UTF-8 output from GCC's stderr under some conditions: -If a filename for which diagnostics are issued contains extended characters -If a source line for which diagnostics are issued contains extended characters. When a source line contains identifier names with extended characters, the C/C++ front ends take care to convert them to the user's locale, by always passing them to identifier_to_locale() before output. However, this only affects the diagnostics messages generated by the front end, and does not affect the source line itself which is printed separately. Example: $ cat =C3=A1.cpp int =C3=A1 =3D 0; int =C3=A1 =3D 1; #in UTF-8 locale, looks fine $ gcc -c =C3=A1.cpp =C3=A1.cpp:2:5: error: redefinition of =E2=80=98int =C3=A1=E2=80=99 2 | int =C3=A1 =3D 1; | ^ =C3=A1.cpp:1:5: note: =E2=80=98int =C3=A1=E2=80=99 previously defined here 1 | int =C3=A1 =3D 0; | ^ #in C locale, only partially converted to UCNs $ LC_ALL=3DC gcc -c =C3=A1.cpp =C3=A1.cpp:2:5: error: redefinition of 'int \U000000e1' 2 | int =C3=A1 =3D 1; | ^ =C3=A1.cpp:1:5: note: 'int \U000000e1' previously defined here 1 | int =C3=A1 =3D 0; | ^ The attached patch arranges for the output to be rather: #corrected by this patch $ LC_ALL=3DC gcc -c =C3=A1.cpp \U000000e1.cpp:2:5: error: redefinition of 'int \U000000e1' 2 | int \U000000e1 =3D 1; | ^ \U000000e1.cpp:1:5: note: 'int \U000000e1' previously defined here 1 | int \U000000e1 =3D 0; | ^ In the above example I showed C locale, where the extended characters need = to be escaped, but the patch would also handle e.g. latin1 locale, where it wo= uld output as expected, using iconv to convert to the output charset. The patch is pretty complete, and bootstraps all languages with no regressi= on. However there are a couple potential issues with it that may need to be discussed before it's ready to be used, so I have held off submitting to gcc-patches for now. The two main points of concern are: 1. Diagnostics recently acquired a lot of infrastructure to know the correct display width of extended characters, so that things like carets and label lines show up at the correct place. This infrastructure however is not currently able to handle locale dependence of the display width. Changing t= hat is rather complicated, because determining that the display width of "=C3= =A1" is actually 10 columns instead of 1 (in case of UCN escape), requires attempti= ng to convert the character to the user's locale (perhaps with iconv), and determining if it can be displayed or requires an escape. So the process of determining the display width becomes an expensive operation that should be optimized and performed once for the line, not something that can be comput= ed on the fly as is done now. This breaks the assumptions in the design of the current approach and so would require it to be redone. That is certainly do= able but it seems unfortunate to make that process much more complicated, for wh= at's not probably a commonly needed use case. I suspect, that in many cases, use= rs with a C locale configured actually still see UTF-8 output fine in their terminal anyway... The output with UCN escapes already looks bad, so perhaps having misaligned labels and carets is not a big deal and it's fine as it i= s. 2. The testsuite always runs with LC_ALL=3DC currently. Therefore, after the change in this patch, a test is no longer able to test for UTF-8 output in diagnostics, it will be UCN escaped instead. There is one such test current= ly (gcc.dg/diagnostic-input-charset-1.c). It doesn't seem suitable to change t= hat test to look for UCN escapes, because the purpose of that test is to confirm that correct UTF-8 is generated when an input file is in another charset. S= o I instead added a new option -fdiagnostics-format=3Dforce-utf8. This is the s= ame as -fdiagnostics-format=3Dtext except it disables the conversion to the user's locale and restores the previous behavior. That seemed more simple than add= ing ability to change the locale in the testsuite, plus I thought users may want this option for themselves for some reason, if say they do not have access = to a UTF-8 locale somehow but their terminal still displays it fine. So that much seems fine, however there is a wrinkle here that I am not sure how to fix. = The user probably expects that this new option will cause all diagnostics outpu= t to be UTF-8 regardless of locale. But some of the output is not generated by t= he diagnostics infrastructure at all. For example, localized strings are conve= rted by libintl and always come out in the current locale. I am not sure how hard/easy it is to change this. But as an example, suppose you configure a latin1 German locale and compile the above test case: =C3=A1.cpp:2:5: Fehler: Redefinition von =C2=BBint =C3=A1=C2=AB 2 | int =C3=A1 =3D 1; | ^ =C3=A1.cpp:1:5: Anmerkung: =C2=BBint =C3=A1=C2=AB wurde bereits hier defini= ert 1 | int =C3=A1 =3D 0; | ^ In the above output, the quote characters "=C2=BB=C2=AB" are generated by t= he internationalization library and are already converted to latin1 when diagnostics infrastructure sees them. So currently, with the new -fdiagnostics-format=3Dforce-utf8 option, the output is not what's expected= ... the =C3=A1 is encoded in UTF-8, but the quote stays latin-1. I am not sure = what's the best way to address this. One option would be to make this new option undocumented, and reserve it just for use by the testsuite, for which this would not be an issue. Otherwise need to find a way to either disable the conversion to the locale from the translation library, or translate it back when this option is in effect.=