Output of `locale -a` could be in mixed encodings?

public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed

* Output of `locale -a` could be in mixed encodings?
@ 2015-01-21  1:38 Carlos O'Donell
  2015-01-21  2:06 ` Paul Eggert
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Carlos O'Donell @ 2015-01-21  1:38 UTC (permalink / raw)
  To: GNU C Library, libc-locales

I'm going to ramble a bit here because the problem is rambling.

The output of `locale -a` can't be easily grepped.

[carlos@athas intl]$ locale -a | grep bok
Binary file (standard input) matches

The name of various localizations are written in their respective
encodings e.g. ISO-8859-1.

Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
version. This makes it difficult to use grep to parse `locale -a`
output in anything but ISO-8859-1.

e.g.
[carlos@athas intl]$ export LANG=C
[carlos@athas intl]$ locale -a | grep bok
bokmal
bokmï¿½l

A naive fix is for `locale` to examine the present locale and
use iconv to convert the names to the target locale. So for example
if the user is using en_US.UTF8 then the above would get converted
to:

bokmal
bokmÃ¥l

There is also one more ISO-8859-1 name in locale.alias with a 
diacritic:

franÃ§ais

The problem then is that if you took that UTF8 converted name of
`bokmÃ¥l` and tried to call setlocale with that, it would fail.
It fails because the name in UTF8 doesn't match the name in
ISO-8859-1 that's stored as the alias or official locale name.

That is to say that you could have two apparently identical source
files, one works (encoded in ISO-8859-1) and one doesn't (encoded
UTF-8). This is because setlocale takes a `char *` as input for
the name of the locale.

e.g.

cat >> setlocale_iso88591.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_iso88591[] = "bokmï¿½l";
int
main (void)
{
  char *result;
  result = setlocale(LC_ALL, bokmal_iso88591);
  if (result == NULL)
    {
      perror ("setlocale");
      exit(1);
    }
  printf ("setlocale() = %s\n", result);
  return 0;
}
EOF

cat >> setlocale_utf8.c <<EOF
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
char bokmal_utf8[] = "bokmÃ¥l";
int
main (void)
{
  char *result;
  result = setlocale(LC_ALL, bokmal_utf8);
  if (result == NULL)
    {
      perror ("setlocale");
      exit(1);
    }
  printf ("setlocale() = %s\n", result);
  return 0;
}

cat >> build.sh <<EOF
gcc -Wall -pedantic -O0 -g3 -o setlocale_fail setlocale_utf8.c
gcc -Wall -pedantic -O0 -g3 -o setlocale_pass setlocale_iso88591.c
EOF

chmod u+x build.sh
./build.sh

[carlos@athas setlocale]$ ./setlocale_fail 
setlocale: No such file or directory
[carlos@athas setlocale]$ ./setlocale_pass 
setlocale() = bokmï¿½l

The literal bytes passed to setlocale for the name of the locale
must be in ISO-8859-1 in order to be identified as the
nb_NO.ISO-8859-1 locale that is eventually loaded.

This means that changing the output encoding from `locale -a`
would break programs trying to use that output to set a 
locale.

Using `locale -a -v` you can see that it's an ISO-8859-1 locale,
and surmise the name of the locale is encoded in ISO-8859-1, and
that you need to convert it to display it in UTF8 correctly.

e.g.
locale: bokm<E5>l          archive: /usr/lib/locale/locale-archive
-------------------------------------------------------------------------------
    title | Norwegian (Bokmal) locale for Norway
   source | Norsk Standardiseringsforbund
  address | University Library, Drammensveien 41, N-9242 Oslo, Norge
    email | bug-glibc-locales@gnu.org
 language | Norwegian, Bokm<E5>l
territory | Norway
 revision | 1.0
     date | 2000-06-29
  codeset | ISO-8859-1

In summary:

The output of `locale -a` could be in mixed encodings.

The locale name must be exactly as `locale -a` prints it for it
to work with setlocale(), those exact bytes.

You can't easily use grep to process the output of `locale -a`.

We should stop using aliases that are anything but ASCII to avoid
future problems.

Questions:

Can we make this any better?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Output of `locale -a` could be in mixed encodings?
  2015-01-21  1:38 Output of `locale -a` could be in mixed encodings? Carlos O'Donell
@ 2015-01-21  2:06 ` Paul Eggert
  2015-01-21  2:30   ` Carlos O'Donell
  2015-01-21  2:37 ` Joseph Myers
  2015-01-21 16:19 ` Martin Sebor
  2 siblings, 1 reply; 6+ messages in thread
From: Paul Eggert @ 2015-01-21  2:06 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, libc-locales

Carlos O'Donell wrote:
> We should stop using aliases that are anything but ASCII to avoid
> future problems.

Yes, that sounds right.  If you comment out the "franÃ§ais" and "bokmÃ¥l" lines 
from locale.alias, I assume that fixes the problem?  If so, let's do that.  The 
fancier fixes you mention also sound nice, but omitting the non-ASCII aliases 
should be helpful anyway, for programs other than 'locale' itself.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Output of `locale -a` could be in mixed encodings?
  2015-01-21  2:06 ` Paul Eggert
@ 2015-01-21  2:30   ` Carlos O'Donell
  2015-01-21  4:50     ` Paul Eggert
  0 siblings, 1 reply; 6+ messages in thread
From: Carlos O'Donell @ 2015-01-21  2:30 UTC (permalink / raw)
  To: Paul Eggert, GNU C Library, libc-locales

On 01/20/2015 09:04 PM, Paul Eggert wrote:
> Carlos O'Donell wrote:
>> We should stop using aliases that are anything but ASCII to avoid 
>> future problems.
> 
> Yes, that sounds right.  If you comment out the "franÃ§ais" and
> "bokmÃ¥l" lines from locale.alias, I assume that fixes the problem?
> If so, let's do that.  The fancier fixes you mention also sound nice,
> but omitting the non-ASCII aliases should be helpful anyway, for
> programs other than 'locale' itself.

Won't such a fix break existing applications relying on this alias
to operate correctly?

What I need is a way to mark the locale deprecated, accept it as an
alias, but not display it in `locale -a`.

Thus I think the robust backwards compatible solution is just slightly
more complicated.

In that case we would add a `francais` option and hide `franÃ§ais`
and `bokmÃ¥l` but still accept them. We would also add a test to validate
that all of the glibc locales are within ASCII.

Did I miss something?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Output of `locale -a` could be in mixed encodings?
  2015-01-21  1:38 Output of `locale -a` could be in mixed encodings? Carlos O'Donell
  2015-01-21  2:06 ` Paul Eggert
@ 2015-01-21  2:37 ` Joseph Myers
  2015-01-21 16:19 ` Martin Sebor
  2 siblings, 0 replies; 6+ messages in thread
From: Joseph Myers @ 2015-01-21  2:37 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: GNU C Library, libc-locales

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

On Tue, 20 Jan 2015, Carlos O'Donell wrote:

> The problem then is that if you took that UTF8 converted name of
> `bokmÃ¥l` and tried to call setlocale with that, it would fail.
> It fails because the name in UTF8 doesn't match the name in
> ISO-8859-1 that's stored as the alias or official locale name.

This could be a bug in setlocale.

POSIX says the locale name is a "character string", which is defined as a 
sequence of multibyte characters.  So arguably it should be interpreted in 
the current locale's character set (and so work if the LC_CTYPE before 
setlocale is that of a UTF-8 locale, fail if it's ASCII or ISO-8859-1).  
Except that the statement about being a character string is not CX-shaded, 
so should not be taken as intending any semantics beyond those in ISO C, 
and I don't see ISO C requiring any such thing.  (That said, I think 
interpreting the locale name in the current locale makes sense anyway, and 
is at least consistent with ISO C, even if not required.)

Now, we should also probably say that all non-ASCII locale names are 
deprecated (so this would just be a matter of adding a few more aliases 
for this locale using different encodings).  And then we could say that 
the locale utility doesn't output any non-ASCII locale names - as long as 
each locale has a valid ASCII name, I think that's conforming to POSIX.  
In fact, these aliases are already deprecated (locale.alias says "This 
file is obsolete ... Nobody should rely on the names defined here").

It's also the case that there's an existing weak deprecation of non-UTF-8 
locales (in the sense that every locale with a non-UTF-8 character set is 
supposed to have a corresponding locale with UTF-8 character set - if any 
don't, that's a bug unless there's some other reason for the locale to be 
deprecated whatever the character set - and the threshold for adding any 
new non-UTF-8 locales should be higher than for adding new UTF-8 locales).

>  language | Norwegian, Bokm<E5>l

That part of the output, however, should clearly be output in the user's 
locale character set - not in the character set of the locale in question.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Output of `locale -a` could be in mixed encodings?
  2015-01-21  2:30   ` Carlos O'Donell
@ 2015-01-21  4:50     ` Paul Eggert
  0 siblings, 0 replies; 6+ messages in thread
From: Paul Eggert @ 2015-01-21  4:50 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, libc-locales

Carlos O'Donell wrote:

> Won't such a fix break existing applications relying on this alias
> to operate correctly?

There shouldn't be any such applications.  Applications are already on notice 
that they should not be relying on these obsolete aliases.

This bug is biting us now because recent versions of GNU grep are stricter about 
treating non-text input as binary.  Existing applications that apply 'grep' to 
locale.alias may have problems if locale.alias is incompatible with the current 
locale's encoding.  This suggests that locale.alias (including its comments) 
should be ASCII only.

> What I need is a way to mark the locale deprecated, accept it as an
> alias, but not display it in `locale -a`.

That would also be nice, though I expect it's lower priority.  I assume that it 
could be done with an ASCII-only locale.alias, somehow.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Output of `locale -a` could be in mixed encodings?
  2015-01-21  1:38 Output of `locale -a` could be in mixed encodings? Carlos O'Donell
  2015-01-21  2:06 ` Paul Eggert
  2015-01-21  2:37 ` Joseph Myers
@ 2015-01-21 16:19 ` Martin Sebor
  2 siblings, 0 replies; 6+ messages in thread
From: Martin Sebor @ 2015-01-21 16:19 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, libc-locales

On 01/20/2015 06:38 PM, Carlos O'Donell wrote:
> I'm going to ramble a bit here because the problem is rambling.
>
> The output of `locale -a` can't be easily grepped.
>
> [carlos@athas intl]$ locale -a | grep bok
> Binary file (standard input) matches
>
> The name of various localizations are written in their respective
> encodings e.g. ISO-8859-1.
>
> Thus the Bokmal name is output in ISO-8859-1 along with an ASCII
> version. This makes it difficult to use grep to parse `locale -a`
> output in anything but ISO-8859-1.
>
> e.g.
> [carlos@athas intl]$ export LANG=C
> [carlos@athas intl]$ locale -a | grep bok
> bokmal
> bokmï¿½l

Alternatively, the GNU grep -a option works:

$ LC_ALL=$(locale -a | grep -a bokm | tail -n1) locale | grep -a LC_ALL
LC_ALL=bokmï¿½l

>
> A naive fix is for `locale` to examine the present locale and
> use iconv to convert the names to the target locale. So for example
> if the user is using en_US.UTF8 then the above would get converted
> to:

I'm not sure if POSIX intends to allow that when the -a option
is used or whether what the implementation does in that case
is unspecified:

   The application shall ensure that the LANG, LC_* , and [XSI]
   [Option Start] NLSPATH [Option End]  environment variables
   specify the current locale environment to be written out;
   they shall be used if the -a option is not specified.

Perhaps because (as you noted) converting the string to some
other encoding would make it unusable as the name of the same
locale.

> In summary:
>
> The output of `locale -a` could be in mixed encodings.
>
> The locale name must be exactly as `locale -a` prints it for it
> to work with setlocale(), those exact bytes.
>
> You can't easily use grep to process the output of `locale -a`.
>
> We should stop using aliases that are anything but ASCII to avoid
> future problems.

This seems like the same problem as with file names that
contain non-ASCII characters. The only robust solution is
to avoid using such characters. Both by the implementation
and by applications (e.g., in locale names created by users
via the localedef utility).

Martin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-21 16:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-21  1:38 Output of `locale -a` could be in mixed encodings? Carlos O'Donell
2015-01-21  2:06 ` Paul Eggert
2015-01-21  2:30   ` Carlos O'Donell
2015-01-21  4:50     ` Paul Eggert
2015-01-21  2:37 ` Joseph Myers
2015-01-21 16:19 ` Martin Sebor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).