The C locale

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* The C locale
@ 2009-08-30 16:59 Andy Koppe
  2009-08-31  0:53 ` Christopher Faylor
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-08-30 16:59 UTC (permalink / raw)
  To: cygwin

Trying to reply to Tuomo Valkonen's post about locale issues, I got
rather confused about the C locale. The manual and the POSIX standard
say that it supports ASCII only, so in theory anything above 0x7F
should be rejected. In practice though, both Cygwin 1.5 and 1.7 do
support characters above 0x7F in the C locale, which could be quite
useful. Trouble is, they do so rather inconsistenly.

Both in 1.5 and 1.7, the mb conversion functions treat such characters
as ISO-8859-1. In other words, conversion between chars and wchars are
simple casts (except that wchars above 0xFF can't be converted). This
makes some sense.

Filename handling is different though. Cygwin 1.5 translates filenames
according to the system's ANSI codepage. I guess the inconsistency
with the mb functions didn't really matter, as the mb functions were
pretty much useless anyway, and supporting the system codepage was
more important.

So, with Cygwin 1.7, I'd have expected filename handling in the C
locale to either use ISO-8859-1 for consistency with the mb functions,
or the ANSI codepage for compatibility with 1.5. In actual fact
though, it uses UTF-8.

Is this on purpose? If so, shouldn't the multibyte conversions
functions in the C locale use UTF-8 as well?

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-08-30 16:59 The C locale Andy Koppe
@ 2009-08-31  0:53 ` Christopher Faylor
  2009-09-02  6:29   ` Andy Koppe
  0 siblings, 1 reply; 51+ messages in thread
From: Christopher Faylor @ 2009-08-31  0:53 UTC (permalink / raw)
  To: cygwin

On Sun, Aug 30, 2009 at 05:59:11PM +0100, Andy Koppe wrote:
>Trying to reply to Tuomo Valkonen's post about locale issues, I got
>rather confused about the C locale. The manual and the POSIX standard
>say that it supports ASCII only, so in theory anything above 0x7F
>should be rejected. In practice though, both Cygwin 1.5 and 1.7 do
>support characters above 0x7F in the C locale, which could be quite
>useful. Trouble is, they do so rather inconsistenly.
>
>Both in 1.5 and 1.7, the mb conversion functions treat such characters
>as ISO-8859-1. In other words, conversion between chars and wchars are
>simple casts (except that wchars above 0xFF can't be converted). This
>makes some sense.
>
>Filename handling is different though. Cygwin 1.5 translates filenames
>according to the system's ANSI codepage. I guess the inconsistency
>with the mb functions didn't really matter, as the mb functions were
>pretty much useless anyway, and supporting the system codepage was
>more important.
>
>So, with Cygwin 1.7, I'd have expected filename handling in the C
>locale to either use ISO-8859-1 for consistency with the mb functions,
>or the ANSI codepage for compatibility with 1.5. In actual fact
>though, it uses UTF-8.
>
>Is this on purpose? If so, shouldn't the multibyte conversions
>functions in the C locale use UTF-8 as well?

Since Cygin has a clear system that it is supposed to be emulating,
the real question is "What does Linux do?"

cgf

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-08-31  0:53 ` Christopher Faylor
@ 2009-09-02  6:29   ` Andy Koppe
  2009-09-02 11:48     ` Eric Blake
  2009-09-02 13:56     ` IWAMURO Motonori
  0 siblings, 2 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-02  6:29 UTC (permalink / raw)
  To: cygwin

Christopher Faylor:
>Andy Koppe:
>>Trying to reply to [banned]'s post about locale issues, I got
>>rather confused about the C locale. The manual and the POSIX standard
>>say that it supports ASCII only, so in theory anything above 0x7F
>>should be rejected. In practice though, both Cygwin 1.5 and 1.7 do
>>support characters above 0x7F in the C locale, which could be quite
>>useful. Trouble is, they do so rather inconsistenly.
>>
>>Both in 1.5 and 1.7, the mb conversion functions treat such characters
>>as ISO-8859-1. In other words, conversion between chars and wchars are
>>simple casts (except that wchars above 0xFF can't be converted). This
>>makes some sense.
>>
>>Filename handling is different though. Cygwin 1.5 translates filenames
>>according to the system's ANSI codepage. I guess the inconsistency
>>with the mb functions didn't really matter, as the mb functions were
>>pretty much useless anyway, and supporting the system codepage was
>>more important.
>>
>>So, with Cygwin 1.7, I'd have expected filename handling in the C
>>locale to either use ISO-8859-1 for consistency with the mb functions,
>>or the ANSI codepage for compatibility with 1.5. In actual fact
>>though, it uses UTF-8.
>>
>>Is this on purpose? If so, shouldn't the multibyte conversions
>>functions in the C locale use UTF-8 as well?
>
>Since Cygwin has a clear system that it is supposed to be emulating,
>the real question is "What does Linux do?"

Tried it on Debian and Suse: the multibyte conversion functions are
strict ASCII, i.e. anything
beyond 0x7F is considered an encoding error.

POSIX requires that ASCII is supported in the C locale, but does not
actually outlaw ASCII-compatible extensions beyond that.

Locales don't affect filenames on Linux, i.e. any sequence of bytes
passed to open() goes straight to disk (except for the path
separator). This effectively means that filenames are encoded in
whatever charset happened to be active at the time the file was
created. Hence anyone accessing it with a different charset setting
will get gibberish.

POSIX is impressively unhelpful on the topic of filenames. All it
guarantees for filenames is the "portable filename character set":
ASCII letters and digits, plus the hyphen, dot, and underscore.

So altogether we've got no fewer than four choices here:
- strict ASCII (as with Linux mb functions)
- ISO-8859-1 (as with newlib mb functions)
- Default Windows ANSI/OEM codepage (as with Cygwin 1.5 filenames)
- UTF-8 (as with Cygwin 1.7 filenames)

In Cygwin 1.5, both file operations and the console use the default
Windows codepage, which often contains all the characters a user cares
about. If you set up readline for 8-bit I/O and change the console
font to something useful, this works reasonably well, including
Cygwin-created filenames showing up correctly in Explorer.

A rather important exception is 'ls', which seems to have its own
hardcoded limitation to 7 bits for the C locale: anything non-ASCII is
shown as '? there'. Things do work correctly elsewhere though, e.g. in
bash tab completion or Midnight Commander.

A user with such a setup who upgrades to 1.7 will find that things
will no longer work as before, since filenames are translated to UTF-8
whereas the console now seems to use ISO-8859-1 (presumably via the mb
functions) by default. Hence a file called 'bäh' in Explorer (with
a-umlaut in the middle), will show as 'bÃ¤h' instead.

And if you try to create 'bäh' in Cygwin 1.7, you actually get a file
called 'b', because the 'ä' (0xE4) in ISO-8859-1 turns into an
encoding error when interpreted as UTF-8, and the name simply seems to
be truncated at that point.

I see two good solutions:
- Use the default Windows codepage for filenames, console, and
multibyte functions. This is what happens already if you specifiy a
locale with a language but no charset, e.g. "en". Maximum 1.5
compatibility.
- Use UTF-8 throughout. Full Unicode support out-of-the box.

And a cheap'n'nasty one:
- Restrict the multibyte functions and console to 7-bit ASCII. Still
means it's inconsistent with the filename conversions, but at least
non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't
show at all.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-02  6:29   ` Andy Koppe
@ 2009-09-02 11:48     ` Eric Blake
  2009-09-02 20:10       ` Andy Koppe
  2009-09-02 13:56     ` IWAMURO Motonori
  1 sibling, 1 reply; 51+ messages in thread
From: Eric Blake @ 2009-09-02 11:48 UTC (permalink / raw)
  To: cygwin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Andy Koppe on 9/2/2009 12:29 AM:
> A rather important exception is 'ls', which seems to have its own
> hardcoded limitation to 7 bits for the C locale: anything non-ASCII is
> shown as '? there'.

That's only because the current build of cygwin ls pre-dates a lot of the
locale support.  I'm hoping that when I get time to build coreutils 7.5,
that ls will start printing characters marked printable in the current locale.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqeW5oACgkQ84KuGfSFAYCorACgwpbJ4oKz8+iEiwj5CkFgDBi+
+fkAoMJBlo9tZIyVzArULs9ZBQXREaI1
=ESA8
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-02 11:48     ` Eric Blake
@ 2009-09-02 20:10       ` Andy Koppe
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-02 20:10 UTC (permalink / raw)
  To: cygwin

Eric Blake:
>> A rather important exception is 'ls', which seems to have its own
>> hardcoded limitation to 7 bits for the C locale: anything non-ASCII is
>> shown as '? there'.
>
> That's only because the current build of cygwin ls pre-dates a lot of the
> locale support.  I'm hoping that when I get time to build coreutils 7.5,
> that ls will start printing characters marked printable in the current locale.

Don't worry, on 1.7 it already works fine in locales other than "C".
And it turns out that the restriction with the latter is due to newlib
being inconsistent: whereas the conversion functions use ISO-8859-1,
the ctype functions insist on ASCII, i.e. the isbla() functions return
0 for anything above 0x7F.

So in the C locale we've currently got UTF-8 for filenames, ISO-8859-1
for the console and multibyte conversions, and ASCII for the ctype
functions.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-02  6:29   ` Andy Koppe
  2009-09-02 11:48     ` Eric Blake
@ 2009-09-02 13:56     ` IWAMURO Motonori
  2009-09-07 20:08       ` Andy Koppe
  1 sibling, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-02 13:56 UTC (permalink / raw)
  To: cygwin

Hi.

2009/9/2 Andy Koppe <andy.koppe@gmail.com>:
> I see two good solutions:
> - Use the default Windows codepage for filenames, console, and
> multibyte functions. This is what happens already if you specifiy a
> locale with a language but no charset, e.g. "en". Maximum 1.5
> compatibility.
> - Use UTF-8 throughout. Full Unicode support out-of-the box.

I want to use UTF-8 throughout.
Because:
- a lot of UNIX tools using network (e.g. rsync, scp, ...) treat the
file name as 8bit byte array.
- default locale of modern UNIX based OS is *.UTF-8.
- The file with the filename including the character outside the
codepage (e.g. files in iTunes folder) can be handled.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-02 13:56     ` IWAMURO Motonori
@ 2009-09-07 20:08       ` Andy Koppe
  2009-09-08 19:35         ` Corinna Vinschen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-07 20:08 UTC (permalink / raw)
  To: cygwin

2009/9/2 IWAMURO Motonori:
> I want to use UTF-8 throughout.
> Because:
> - a lot of UNIX tools using network (e.g. rsync, scp, ...) treat the
> file name as 8bit byte array.
> - default locale of modern UNIX based OS is *.UTF-8.
> - The file with the filename including the character outside the
> codepage (e.g. files in iTunes folder) can be handled.

I'm minded to agree, but actually there's a big stumbling block here:
many interactive programs in Cygwin do not (yet) support UTF-8, e.g.
nano, mutt, and mc. If you try, you get all sorts of funny effects
with invalid characters and mispositioned cursors. That's not
acceptable as default.

Which leaves one apparently good solution for the "C" locale:
>> - Use the default Windows codepage for filenames, console, and
>> multibyte functions. This is what happens already if you specifiy a
>> locale with a language but no charset, e.g. "en". Maximum 1.5
>> compatibility.

On a closely related note, Debian are introducing a "C.UTF-8" locale
as a language-neutral locale with a UTF-8 character set. This is
useful for choosing UTF-8 without picking up language-specific stuff
like sorting rules. See here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776. It's a rather
lengthy thread, but in the end they did decide to go for it.

Cygwin 1.7, through newlib, already has "C-UTF-8", as well as the
likes of "C-ISO-8859-1" or "C-SJIS". So how about replacing the "C-"
with "C." in those, considering that Cygwin has no backward
compatibility requirement regarding those?

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-07 20:08       ` Andy Koppe
@ 2009-09-08 19:35         ` Corinna Vinschen
  2009-09-08 20:48           ` Andy Koppe
  2009-09-08 21:49           ` Andy Koppe
  0 siblings, 2 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-08 19:35 UTC (permalink / raw)
  To: cygwin

On Sep  7 21:08, Andy Koppe wrote:
> Which leaves one apparently good solution for the "C" locale:
> >> - Use the default Windows codepage for filenames, console, and
> >> multibyte functions. This is what happens already if you specifiy a
> >> locale with a language but no charset, e.g. "en". Maximum 1.5
> >> compatibility.

UTF-8 has been chosen because it has the advantage that every UTF-16
Windows filename will result in a valid multibyte string.  Every choice
has its advantage and its trade-offs.  Maximum 1.5 compatibility
(what for and how long?) vs. maximum default usability in the long run
(at least I hope so).

> On a closely related note, Debian are introducing a "C.UTF-8" locale
> as a language-neutral locale with a UTF-8 character set. This is
> useful for choosing UTF-8 without picking up language-specific stuff
> like sorting rules. See here:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776. It's a rather
> lengthy thread, but in the end they did decide to go for it.

Doesn't just setting LC_CTYPE=fo_ba.UTF-8 has the same result?

> Cygwin 1.7, through newlib, already has "C-UTF-8", as well as the
> likes of "C-ISO-8859-1" or "C-SJIS". So how about replacing the "C-"
> with "C." in those, considering that Cygwin has no backward
> compatibility requirement regarding those?

No, but newlib has.  That was the only reason to keep these specifiers.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-08 19:35         ` Corinna Vinschen
@ 2009-09-08 20:48           ` Andy Koppe
  2009-09-08 21:49           ` Andy Koppe
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-08 20:48 UTC (permalink / raw)
  To: cygwin

2009/9/8 Corinna Vinschen:
>> Which leaves one apparently good solution for the "C" locale:
>> >> - Use the default Windows codepage for filenames, console, and
>> >> multibyte functions. This is what happens already if you specifiy a
>> >> locale with a language but no charset, e.g. "en". Maximum 1.5
>> >> compatibility.
>
> UTF-8 has been chosen because it has the advantage that every UTF-16
> Windows filename will result in a valid multibyte string.

Fair enough, if the console and the character conversion functions
used UTF-8 as well (and if applications such as mc, nano and mutt were
rebuilt with UTF-8 support).

Unfortunately, they use ISO-8859-1, so out-of-the box the support for
non-ASCII characters in Cygwin 1.7 is effectively broken. Please see
posts earlier in this thread for the problems caused by this.

Yes, users can set a locale variable to get this working, but hacking
Cygwin.bat or finding the Windows environment variable dialog isn't
exactly intuitive. And they didn't have to do that in 1.5 to at least
get the Windows "ANSI" codepage working.

> Every choice has its advantage and its trade-offs.

The current choices have nothing but disadvantages, due to mixing of
UTF-8 and ISO-8859-1.

Besides, regarding the Windows codepage, wasn't the ^N scheme
introduced to deal with filename characters outside the current
charset?

>> On a closely related note, Debian are introducing a "C.UTF-8" locale
>> as a language-neutral locale with a UTF-8 character set. This is
>> useful for choosing UTF-8 without picking up language-specific stuff
>> like sorting rules. See here:
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776. It's a rather
>> lengthy thread, but in the end they did decide to go for it.
>
> Doesn't just setting LC_CTYPE=fo_ba.UTF-8 has the same result?

For newlib, yes, because it doesn't (yet) care about the language
part. But the language part nevertheless matters for many programs,
and it may also matter when connecting to other hosts, e.g. by
changing the sort order in 'ls'.

"C.charset" would mean: give me all the default behaviours, except
that I want this specific charset.

>> Cygwin 1.7, through newlib, already has "C-UTF-8", as well as the
>> likes of "C-ISO-8859-1" or "C-SJIS". So how about replacing the "C-"
>> with "C." in those, considering that Cygwin has no backward
>> compatibility requirement regarding those?
>
> No, but newlib has.

Understood. I meant a __CYGWIN__-guarded change.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-08 19:35         ` Corinna Vinschen
  2009-09-08 20:48           ` Andy Koppe
@ 2009-09-08 21:49           ` Andy Koppe
  2009-09-21 10:38             ` Corinna Vinschen
  1 sibling, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-08 21:49 UTC (permalink / raw)
  To: cygwin

ps:
> Maximum 1.5 compatibility (what for and how long?)  vs. maximum
> default usability in the long run (at least I hope so).

Compatibilty for users upgrading to 1.7, who are used to being able to
use the non-ASCII chars in their ANSI codepage, which is usually all
they care about. And who have files encoded in that codepage, while
being blissfully unaware what stuff like "LC_CTYPE" or "CP1251" means.
And who are therefore going to complain about Cygwin 1.7 breaking
their files.

Using UTF-8 throughout is a worthwhile aim of course, but it's a bumpy
road to get there, with lots of apps not yet ready. Moreover, is there
actually any other OS where the "C" locale uses UTF-8? Afaik, Linuxes
just set LANG to *.UTF-8 somewhere in the startup scripts.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-08 21:49           ` Andy Koppe
@ 2009-09-21 10:38             ` Corinna Vinschen
  2009-09-21 13:08               ` Lapo Luchini
                                 ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-21 10:38 UTC (permalink / raw)
  To: cygwin

On Sep  8 22:49, Andy Koppe wrote:
> ps:
> > Maximum 1.5 compatibility (what for and how long?)  vs. maximum
> > default usability in the long run (at least I hope so).
> 
> Compatibilty for users upgrading to 1.7, who are used to being able to
> use the non-ASCII chars in their ANSI codepage, which is usually all
> they care about. And who have files encoded in that codepage, while
> being blissfully unaware what stuff like "LC_CTYPE" or "CP1251" means.
> And who are therefore going to complain about Cygwin 1.7 breaking
> their files.
> 
> Using UTF-8 throughout is a worthwhile aim of course, but it's a bumpy
> road to get there, with lots of apps not yet ready. Moreover, is there
> actually any other OS where the "C" locale uses UTF-8? Afaik, Linuxes
> just set LANG to *.UTF-8 somewhere in the startup scripts.

Back from vacation I re-read this thread now and I have to say I just
don't know what is the best course of action here.

The idea to use UTF-8 for filename and console operations by default was
to get the least problems converting from UTF-16 to multibyte, so that
readdir() always returns a valid filename.  Since the filename is
supposed to be just a NUL-terminated stream of bytes, the application
shouldn't care what the filename looks like, it should just always use
it as is.  In contrast to Linux filesystems, where the filename actually
*is* a simple byte stream, we have to convert the filename back and
forth from and to UTF-16.

As for the conversion of filenames, you get the same problem on Linux if
the filename contains non-ASCII bytes and these bytes are not a valid
multibyte character in the current locale.

Referring to another of your mails in this thread:

> A user with such a setup who upgrades to 1.7 will find that things
> will no longer work as before, since filenames are translated to UTF-8
> whereas the console now seems to use ISO-8859-1 (presumably via the mb
> functions) by default. Hence a file called 'b\344h' in Explorer (with
> a-umlaut in the middle), will show as 'bÃƒÂ¤h' instead.

That's because the console uses the ascii conversion by default which
is the newlib implementation just passing through all bytes unconverted,
even the >=0x80 ones.  That's ISO-8859-1 conincidentally.  However, that
means the console uses the same conversion as the application.  Only the
filename conversion uses UTF-8.

> And if you try to create 'b\344h' in Cygwin 1.7, you actually get a file
> called 'b', because the '\344' (0xE4) in ISO-8859-1 turns into an
> encoding error when interpreted as UTF-8, and the name simply seems to
> be truncated at that point.

Yes, that *is* a problem.

> I see two good solutions:
> - Use the default Windows codepage for filenames, console, and
> multibyte functions. This is what happens already if you specifiy a
> locale with a language but no charset, e.g. "en". Maximum 1.5
> compatibility.

Hmm, yes, that might be an option.  Allowing the C.UTF-8 locale
could workaround the remaining problems.

> - Use UTF-8 throughout. Full Unicode support out-of-the box.

What means "throughout"?  Do you want ASCII multibyte conversion to 
use UTF-8 as well?  Of course that will still result in problems if
a shell script has a filename hardcoded in, say, CP1252.

> And a cheap'n'nasty one:
> - Restrict the multibyte functions and console to 7-bit ASCII. Still
> means it's inconsistent with the filename conversions, but at least
> non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't
> show at all.

I remember having seen this on Linux as well in some GUI applications.

Apart from that, the fourth solution is to stick to the current
implementation to use UTF-8 for filenames by default and relaxed ASCII
(ISO-8859-1) as provided by newlib for everything else.

The problem is, I don't know for sure what the best appraoch is, and it
seems nobody except you and Iwamuro are actually interested to discuss
this.  And you both have a contrary opinion in this matter.

Personally I have no problem with the current approach.  I understand
the potential problems, but, as usual, solving it one way results in
problems in another scenario and vice versa.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-21 10:38             ` Corinna Vinschen
@ 2009-09-21 13:08               ` Lapo Luchini
  2009-09-21 14:39               ` Charles Wilson
  2009-09-21 21:20               ` Andy Koppe
  2 siblings, 0 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-21 13:08 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1246 bytes --]

Corinna Vinschen wrote:
>> And if you try to create 'b\344h' in Cygwin 1.7, you actually get a file
>> called 'b', because the '\344' (0xE4) in ISO-8859-1 turns into an
>> encoding error when interpreted as UTF-8, and the name simply seems to
>> be truncated at that point.
> 
> Yes, that *is* a problem.

Doesn't seems to be exactly that simple: it doesn't stop on the FIRST
non-UTF8 character, but just before the LAST one.
So I guess it's not because it's an encoding error (I doubt the
conversion is made from the end to the start?) but something more complex.

http://cygwin.com/ml/cygwin/2009-09/msg00329.html

> Personally I have no problem with the current approach.  I understand
> the potential problems, but, as usual, solving it one way results in
> problems in another scenario and vice versa.

FWIW I do like the current approach: for example I can transfer with
rsync and commit and checkout with monotone any filename including
Japanese characters...
(well, except the names of the aforementioned thread, but that's a bug
which can be solved, not something implied in the current approach)

-- 
Lapo Luchini - http://lapo.it/

“C is quirky, flawed, and an enormous success.” (Dennis M. Ritchie)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 898 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-21 10:38             ` Corinna Vinschen
  2009-09-21 13:08               ` Lapo Luchini
@ 2009-09-21 14:39               ` Charles Wilson
  2009-09-21 21:20               ` Andy Koppe
  2 siblings, 0 replies; 51+ messages in thread
From: Charles Wilson @ 2009-09-21 14:39 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> The problem is, I don't know for sure what the best appraoch is, and it
> seems nobody except you and Iwamuro are actually interested to discuss
> this.

I don't know about anyone else, but I haven't chimed in because I don't
know enough about the issue to have an intelligent opinion.  I think the
problem is that the already limited universe of cygwin contributors
becomes very tiny when those ignorant of NLS/char-encoding issues are
self-excluded. Sorry.

I've just been hoping that those of you who DO know enough about this
issue can reach a mutually satisfactory compromise/solution.

--
Chuck

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-21 10:38             ` Corinna Vinschen
  2009-09-21 13:08               ` Lapo Luchini
  2009-09-21 14:39               ` Charles Wilson
@ 2009-09-21 21:20               ` Andy Koppe
  2009-09-22  5:59                 ` Lapo Luchini
  2009-09-24  7:03                 ` IWAMURO Motonori
  2 siblings, 2 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-21 21:20 UTC (permalink / raw)
  To: cygwin

2009/9/21 Corinna Vinschen:
> Back from vacation I re-read this thread now and I have to say I just
> don't know what is the best course of action here.

I'm afraid I can only reiterate what I said previously.

Let's use the Windows "ANSI" codepage as the character set for the C
locale, for both the conversion functions and filenames. This means
CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
ones, and so on.

This way, the non-ASCII needs of most users are covered
out-of-the-box, and compatibility with Cygwin 1.5 and users'
ANSI-encoded files is ensured. Applications that still assume that a
byte and a character are the same thing work correctly (except that
they'll treat East Asian doublebyte chars as two characters, but a
different default charset won't cure that).

Filenames created on the Cygwin side show up correctly in Explorer.
Windows filenames show up correctly in Cygwin as long as they're
limited to the ANSI codepage. The ^N encoding nevertheless ensures
that UTF-16 characters outside that codepage are uniquely represented.

Beyond that, encourage maintainers to make their applications
UTF-8-capable and encourage users to choose a UTF-8 locale. Consider
adding a locale setting to setup.exe that gets written to cygwin.bat.

> The idea to use UTF-8 for filename and console operations by default was
> to get the least problems converting from UTF-16 to multibyte, so that
> readdir() always returns a valid filename.

But the ^N scheme does ensure that for any charset anyway, doesn't it?

> As for the conversion of filenames, you get the same problem on Linux if
> the filename contains non-ASCII bytes and these bytes are not a valid
> multibyte character in the current locale.

Yes, but Cygwin does actually have a big advantage here. Unlike Linux,
where the filename encoding is basically undefined, we *know* that
Windows filenames are always encoded as UTF-16. Therefore, the Cygwin
file functions do have the chance to always translate filenames
correctly into the application's locale.

And with any locale except "C" and "POSIX",  this is working very
well, due to your great work implementing all the difficult bits such
as the ^N and 0xDC?? encodings and UTF-16 surrogates (and
notwithstanding the issue with translating 0xDC??s to charsets other
than UTF-8).

>> I see two good solutions:
>> - Use the default Windows codepage for filenames, console, and
>> multibyte functions. This is what happens already if you specifiy a
>> locale with a language but no charset, e.g. "en". Maximum 1.5
>> compatibility.
>
> Hmm, yes, that might be an option.  Allowing the C.UTF-8 locale
> could workaround the remaining problems.

Not sure that the C.UTF-8 locale is necessary for that, but it would
be nice to have, and it's easy to implement.

>> - Use UTF-8 throughout. Full Unicode support out-of-the box.
>
> What means "throughout"?  Do you want ASCII multibyte conversion to
> use UTF-8 as well?

Yep, that was the idea, but later on I realised that it's not a good
one, because too many applications still assume that a byte and a
character are the same thing. For example, start nano in a UTF-8
locale, enter a few umlauts, and move the cursor around, and you'll
see some weird effects. Similarly, filenames with non-ASCII chars will
corrupt midnight commander's display.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-21 21:20               ` Andy Koppe
@ 2009-09-22  5:59                 ` Lapo Luchini
  2009-09-22  6:23                   ` Lapo Luchini
  2009-09-22  6:47                   ` Andy Koppe
  2009-09-24  7:03                 ` IWAMURO Motonori
  1 sibling, 2 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-22  5:59 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1454 bytes --]

Andy Koppe wrote:
> This way, the non-ASCII needs of most users are covered
> out-of-the-box [...]
> Windows filenames show up correctly in Cygwin as long as they're
> limited to the ANSI codepage.

I fail to see how that is a desiderable thing.
Filesystem is UTF-16, Cygwin is now Unicode-aware, but anything that
doesn't fit ANSI is thrown away for the sake of retro-compatibility of
Cygwin-1.5 which was not Unicode-aware?

As a user, the ability to show correctly formatted UTF-8 filenames is
one of the features I most appreciated in Cygwin-1.7 and reverting that
would be a serious setback... even writing that in a ChangeLog would be
a bit troublesome... "we added support for Unicode - except you can't
use for anything you couldn't already do before when it was not there,
since we're using ANSI as an intermediate format anyways"?

> For example, start nano in a UTF-8 locale, enter a few umlauts, and
> move the cursor around, and you'll see some weird effects.

IMHO a bit of "weird effects" while moving cursor are a much less severe
problem that being unable to write the filename "like I think it is".
Using ANSI, anything over U+100 Unicode would be an ugly ^N-encoded
3-bytes-per-char ugly stuff which no human can "see" as the filename he
intended to use.

just my 2c

-- 
Lapo Luchini - http://lapo.it/

“You don't have to distrust the government to want to use cryptography.”
(Phil Zimmermann)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 896 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22  5:59                 ` Lapo Luchini
@ 2009-09-22  6:23                   ` Lapo Luchini
  2009-09-22  6:50                     ` Andy Koppe
  2009-09-22  6:47                   ` Andy Koppe
  1 sibling, 1 reply; 51+ messages in thread
From: Lapo Luchini @ 2009-09-22  6:23 UTC (permalink / raw)
  To: cygwin

Lapo Luchini wrote:
> I fail to see how that is a desiderable thing.
> Filesystem is UTF-16, Cygwin is now Unicode-aware, but anything that
> doesn't fit ANSI is thrown away for the sake of retro-compatibility of
> Cygwin-1.5 which was not Unicode-aware?

On a second reading, I guess you meant that *ONLY for LANG=C* and leave
the current usage for LANG=xx_XX.UTF-8, is that so?

In that case, the "forced ANSI retro-compatibility" would only bit
people with a missing (or messed up) environment and that's an
ill-defined situation where nobody can really argue if he gets
suboptimal results.
Also, when used for scripts with no user-interaction being capable to
save every filename and read it again the same format is a sure pro.
If, OTOH, you're an interactive shell and your user likely wants to "see
stuff" well, he better set a LANG env.

PS: this would work around the "LANG=C ls" file-not-found issues, but a
solution for "LANG=xx_XX.UTF-8 ls" issue would still be neeeded.
(same goes for `find $dir -delete`)

-- 
Lapo Luchini - http://lapo.it/

â€œTwo can keep a secret if one is dead.â€ (anonymous)

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22  6:23                   ` Lapo Luchini
@ 2009-09-22  6:50                     ` Andy Koppe
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-22  6:50 UTC (permalink / raw)
  To: cygwin

2009/9/22 Lapo Luchini:
> On a second reading, I guess you meant that *ONLY for LANG=C* and leave
> the current usage for LANG=xx_XX.UTF-8, is that so?

Yes, this thread is solely about the C locale.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22  5:59                 ` Lapo Luchini
  2009-09-22  6:23                   ` Lapo Luchini
@ 2009-09-22  6:47                   ` Andy Koppe
  2009-09-22  8:43                     ` Lapo Luchini
  1 sibling, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-22  6:47 UTC (permalink / raw)
  To: cygwin

2009/9/22 Lapo Luchini:
> Andy Koppe wrote:
>> This way, the non-ASCII needs of most users are covered
>> out-of-the-box [...]
>> Windows filenames show up correctly in Cygwin as long as they're
>> limited to the ANSI codepage.
>
> I fail to see how that is a desiderable thing.
> Filesystem is UTF-16, Cygwin is now Unicode-aware, but anything that
> doesn't fit ANSI is thrown away [...]?

No, it isn't. UTF-16 filename characters that can't be represented in
the current charset are encoded by a ^N followed by the character's
UTF-8 representation.

The current C locale, on the other hand, simply represents all
non-ASCII characters as UTF-8, even though the application charset is
ISO-8859-1. This means that even those characters that can be
represented in the application charset show up incorrectly. For
example, a Windows filename "bäh" turns into "bÅ¤h" in the C locale,
while it shows up correctly with explicitly set ISO-8859-1 or CP1252.

> As a user, the ability to show correctly formatted UTF-8 filenames is
> one of the features I most appreciated in Cygwin-1.7

That ability isn't going anywhere. As before, you need to set your
locale to one with a UTF-8 charset to get full UTF-8 support.

Btw, are you actually using the C locale?

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22  6:47                   ` Andy Koppe
@ 2009-09-22  8:43                     ` Lapo Luchini
  2009-09-22 12:50                       ` Andy Koppe
  0 siblings, 1 reply; 51+ messages in thread
From: Lapo Luchini @ 2009-09-22  8:43 UTC (permalink / raw)
  To: cygwin

Andy Koppe wrote:
> No, it isn't. UTF-16 filename characters that can't be represented in
> the current charset are encoded by a ^N followed by the character's
> UTF-8 representation.

OK, right.

> For example, a Windows filename "bÃ¤h" turns into "bÃ…Â¤h" in the C locale,
> while it shows up correctly with explicitly set ISO-8859-1 or CP1252.

Uh? Doesn't seem so to me: if I create "bÃ¤h" in WindowsExplorer, then
open up an UTF-8 mintty console I have a consistent output with both
LANG=C and LANG=it_IT.UTF-8 (of course, since right now C is UTF-8):

% LANG=C ls -l|egrep b.h
-rw-r--r-- 1 lapo None     0 Sep 22 09:53 bÃ¤h
% LANG=it_IT.UTF-8 ls -l|egrep b.h
-rw-r--r-- 1 lapo None     0 22 Sep 09:53 bÃ¤h

So I'm not sure what do you mean with 'a Windows filename "bÃ¤h" turns
into "bÃ…Â¤h" in the C locale'... you mean that a script sees it as
62C3A468 as opposed as 62E468? Or that actual "bÃ…Â¤h" is shown somewhere?

As "bÃ…Â¤h" is just a representation, and it depends on the charset the
console expects (and in fact in this UTF-8-encoded message, it will be
probably represented with 62C385C2A468)... if the console is UTF-8,
what's currently shown is what I'd expect.
If OTOH we're talking what it is in raw form and not of what is shown
(i.e. about "3 bytes" vs a "4 bytes" string) well, that's a different
issue, and I'm not sure why a program should prefer a 3-byte
representations as opposed to a 4-byte one...?

But OTOH as far as "not caring" goes, it sure can be a nice feature to
be retro-compatible in that single case, since the behavior is not
well-defined anyways...
But again, if a script creates a filename that happens to contain
Japanese characters (or even umlauts or r-quotes/l-quotes) I would
expect to see that on the filesystem too, and not some random-looking
escaped-sequence...

> Btw, are you actually using the C locale?

Not usually, but it happens from time to time (mostly in script, or in
cases such as the monotone "make check" unit tests; one which tries to
create UTF-8 filenames and then ISO-8859-1 filenames currently fail).

-- 
Lapo Luchini - http://lapo.it/

â€œEndure. In enduring, grow strong.â€ (Dak'kon, videogame "Torment", 1999)

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22  8:43                     ` Lapo Luchini
@ 2009-09-22 12:50                       ` Andy Koppe
  2009-09-22 16:26                         ` Lapo Luchini
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-22 12:50 UTC (permalink / raw)
  To: cygwin

2009/9/22 Lapo Luchini:
>> For example, a Windows filename "bäh" turns into "bÅ¤h" in the C locale,
>> while it shows up correctly with explicitly set ISO-8859-1 or CP1252.
>
> Uh? Doesn't seem so to me: if I create "bäh" in WindowsExplorer, then
> open up an UTF-8 mintty console I have a consistent output with both
> LANG=C and LANG=it_IT.UTF-8 (of course, since right now C is UTF-8):
>
> % LANG=C ls -l|egrep b.h
> -rw-r--r-- 1 lapo None     0 Sep 22 09:53 bäh
> % LANG=it_IT.UTF-8 ls -l|egrep b.h
> -rw-r--r-- 1 lapo None     0 22 Sep 09:53 bäh

You've presumably got mintty set to UTF-8, hence mintty's output
conversion turned ls's ISO-8859-1 "Å¤" (i.e. "\xC3\xA4") into "ä".


> So I'm not sure what do you mean with 'a Windows filename "bäh" turns
> into "bÅ¤h" in the C locale'... you mean that a script sees it as
> 62C3A468 as opposed as 62E468? Or that actual "bÅ¤h" is shown somewhere?

Both. For the latter, try it in the default Cygwin console, without
any locale variables set.


> But OTOH as far as "not caring" goes, it sure can be a nice feature to
> be retro-compatible in that single case

Thanks. Unfortunately the "C" locale is rather important though,
because that's what people will be using unless they go to the effort
of finding out how to set a different locale.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22 12:50                       ` Andy Koppe
@ 2009-09-22 16:26                         ` Lapo Luchini
  2009-09-22 16:49                           ` Mark J. Reed
  2009-09-22 22:11                           ` Thorsten Kampe
  0 siblings, 2 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-22 16:26 UTC (permalink / raw)
  To: cygwin

Andy Koppe wrote:
> You've presumably got mintty set to UTF-8, hence mintty's output
> conversion turned ls's ISO-8859-1 "Ã…Â¤" (i.e. "\xC3\xA4") into "Ã¤".

There never was any ISO-8859-1 "Ã…Â¤" in the first place, only one
a-umlaut entered in WindowsExplorer (in the expected way) and correctly
interpreted by a UTF8-capable terminal which is doing his job.

Nobody ever intended to write a Latin1 string with the meaning of
"A-ring + currency symbol" which has been translated by chance in a
a-umlaut...

>> you mean that a script sees it as 62C3A468 as opposed as 62E468?
>> Or that actual "bÃ…Â¤h" is shown somewhere?
> 
> Both. For the latter, try it in the default Cygwin console, without
> any locale variables set.

OK, if you consider "what is shown in cmd.exe" as "the real stuff" then
I agree with you.

But cmd.exe isn't even capable of printing the Euro sign (no cygwin
involved, I mean the plain Windows Prompt), I guess there's no hope to
ever seeing in there anything but a very limited output...
(which surprises me a bit: Euro sign is present in CP1252)

I agree with you that the "default console" installed by the default
installation SHOULD be able to show the more common accents at the very
least (Ã Ã¨Ã©Ã¬Ã²Ã¹ in Italy, umaluts and ÃŸ in Germany and so on,), but
wouldn't it be possible to offer the user *something better* than plain
limited cmd.exe, in the default installation?

-- 
Lapo Luchini - http://lapo.it/

â€œThere is no reason anyone would want a computer in their home.â€ (Ken
Olson, founder of DEC, 1977)

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22 16:26                         ` Lapo Luchini
@ 2009-09-22 16:49                           ` Mark J. Reed
  2009-09-22 17:04                             ` Lapo Luchini
  2009-09-22 22:11                           ` Thorsten Kampe
  1 sibling, 1 reply; 51+ messages in thread
From: Mark J. Reed @ 2009-09-22 16:49 UTC (permalink / raw)
  To: cygwin

On Tue, Sep 22, 2009 at 12:26 PM, Lapo Luchini wrote:
> There never was any ISO-8859-1 "Å¤" in the first place, only one
> a-umlaut entered in WindowsExplorer (in the expected way) and correctly
> interpreted by a UTF8-capable terminal which is doing his job.
>
> Nobody ever intended to write a Latin1 string with the meaning of
> "A-ring + currency symbol" which has been translated by chance in a
> a-umlaut...

Yes, but it's working because you (1) lied about your locale (using C
when your terminal is set to UTF-8) and (2) happen to have your
terminal set to UTF-8, which is how Cygwin happens to be encoding the
character.   It's a big accident and stops working if you were
actually using a non-UTF-8 terminal and locale (hopefully matching
ones).

-- 
Mark J. Reed <markjreed@gmail.com>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22 16:49                           ` Mark J. Reed
@ 2009-09-22 17:04                             ` Lapo Luchini
  0 siblings, 0 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-22 17:04 UTC (permalink / raw)
  To: cygwin

Mark J. Reed wrote:
> Yes, but it's working because you (1) lied about your locale (using C
> when your terminal is set to UTF-8) and (2) happen to have your
> terminal set to UTF-8, which is how Cygwin happens to be encoding the
> character.   It's a big accident and stops working if you were
> actually using a non-UTF-8 terminal and locale (hopefully matching
> ones).

I'm very sorry, but I still can't see your point... =(

It's true, "by accident" my terminal is using the more general
ASCII-compatible charset possible (that is, UTF-8) and Cygwin is
currently using that as a default as well, ok.
So LANG=C works essentially because my terminal uses THE SAME charset as
Cygwin uses by default (and not specifically because that's UTF-8).

But OTOH if LANG=C used CP1252 it would only work only if my terminal
"by accident" was using the very same CP1252 and would stop working if I
were using a non-CP1252 terminal and matching locale.
How is this a fundamentally different case?

In the first case I have to match my terminal, but I can see *any*
character really and never get any "surprise".
In the second case I can use default cmd.exe, but I get a crippled
output in many possible usecases.

The main reason I see for using CP1252 (or anything that's the default
CP, CP1252 is just an example) is that cygwin-in-cmd.exe would show the
*same* crippledness shown by the default native WindowsPrompt, so even
if very limited, the user would get the least surprise. And as far a
traffic on cygwin@cygwin.com goes, I see that's a VERY valid issue.

-- 
Lapo Luchini - http://lapo.it/

â€œPremature optimisation is the root of all evil in programming.â€ (C. A.
R. Hoare)

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22 16:26                         ` Lapo Luchini
  2009-09-22 16:49                           ` Mark J. Reed
@ 2009-09-22 22:11                           ` Thorsten Kampe
  2009-09-23  5:12                             ` Lapo Luchini
  1 sibling, 1 reply; 51+ messages in thread
From: Thorsten Kampe @ 2009-09-22 22:11 UTC (permalink / raw)
  To: cygwin

* Lapo Luchini (Tue, 22 Sep 2009 18:26:32 +0200)
> But cmd.exe isn't even capable of printing the Euro sign (no cygwin
> involved, I mean the plain Windows Prompt), I guess there's no hope to
> ever seeing in there anything but a very limited output... (which
> surprises me a bit: Euro sign is present in CP1252)

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

thorsten@HOMBRE[C:\Users\thorsten]> echo â‚¬
â‚¬

thorsten@HOMBRE[C:\Users\thorsten]> chcp
Active code page: 437


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-22 22:11                           ` Thorsten Kampe
@ 2009-09-23  5:12                             ` Lapo Luchini
  2009-09-23  9:04                               ` Thorsten Kampe
  0 siblings, 1 reply; 51+ messages in thread
From: Lapo Luchini @ 2009-09-23  5:12 UTC (permalink / raw)
  To: cygwin

Thorsten Kampe wrote:
> * Lapo Luchini (Tue, 22 Sep 2009 18:26:32 +0200)
>> But cmd.exe isn't even capable of printing the Euro sign (no cygwin
>> involved, I mean the plain Windows Prompt), I guess there's no hope to
>> ever seeing in there anything but a very limited output... (which
>> surprises me a bit: Euro sign is present in CP1252)
> 
> Microsoft Windows [Version 6.1.7600]
> Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
> 
> thorsten@HOMBRE[C:\Users\thorsten]> echo â‚¬
> â‚¬
> 
> thorsten@HOMBRE[C:\Users\thorsten]> chcp
> Active code page: 437

OK, so it *can* display it.
But why it does not, when showing a filename?
(which was what I did in the previous message to check)

http://img223.imageshack.us/img223/7821/winprompt.png

I created a file "aÃ â‚¬ç§.txt" in WinExplorer, and then:

C:\>dir *.txt
 Volume in drive C is Primary
 Volume Serial Number is 8437-B5FC

 Directory of C:\

23/09/2009  06.58                 0 aÃ ??.txt
               1 File(s)              0 bytes


Errr.... actually I can't even reproduce your example.
If I write "echo â‚¬" at the prompt I get this:

C:\>echo ?
?

C:\>chcp
Active code page: 437

-- 
Lapo Luchini - http://lapo.it/

â€œI think, therefore I amâ€¦ I think.â€ (Nordom, videogame "Torment", 1999)


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-23  5:12                             ` Lapo Luchini
@ 2009-09-23  9:04                               ` Thorsten Kampe
  2009-09-23 10:48                                 ` Lapo Luchini
  0 siblings, 1 reply; 51+ messages in thread
From: Thorsten Kampe @ 2009-09-23  9:04 UTC (permalink / raw)
  To: cygwin

* Lapo Luchini (Wed, 23 Sep 2009 07:11:48 +0200)
> 
> Thorsten Kampe wrote:
> > * Lapo Luchini (Tue, 22 Sep 2009 18:26:32 +0200)
> >> But cmd.exe isn't even capable of printing the Euro sign (no cygwin
> >> involved, I mean the plain Windows Prompt), I guess there's no hope to
> >> ever seeing in there anything but a very limited output... (which
> >> surprises me a bit: Euro sign is present in CP1252)
> > 
> > Microsoft Windows [Version 6.1.7600]
> > Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
> > 
> > thorsten@HOMBRE[C:\Users\thorsten]> echo â‚¬
> > â‚¬
> > 
> > thorsten@HOMBRE[C:\Users\thorsten]> chcp
> > Active code page: 437
> 
> OK, so it *can* display it.
> But why it does not, when showing a filename?
> (which was what I did in the previous message to check)
> 
> http://img223.imageshack.us/img223/7821/winprompt.png
> 
> I created a file "aÃ â‚¬?.txt" in WinExplorer, and then:
> 
> C:\>dir *.txt
>  Volume in drive C is Primary
>  Volume Serial Number is 8437-B5FC
> 
>  Directory of C:\
> 
> 23/09/2009  06.58                 0 aÃ ??.txt
>                1 File(s)              0 bytes

Works for me, too. Maybe not only the codepage but also the GUI locale 
settings are involved. This is on Windows 7.

Thorsten


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-23  9:04                               ` Thorsten Kampe
@ 2009-09-23 10:48                                 ` Lapo Luchini
  2009-09-23 12:04                                   ` Andy Koppe
  2009-09-24  7:58                                   ` Thorsten Kampe
  0 siblings, 2 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-23 10:48 UTC (permalink / raw)
  To: cygwin

Thorsten Kampe wrote:
>> I created a file "aÃ â‚¬ç§.txt" in WinExplorer, and then:
>>
>> 23/09/2009  06.58                 0 aÃ ??.txt
>>                1 File(s)              0 bytes
> 
> Works for me, too. Maybe not only the codepage but also the GUI locale 
> settings are involved. This is on Windows 7.

Oh, that's interesting, it may be they improved the console in Win7?
Did you see only the euro or also the Japanese character?

Uh, nope. I still get "aÃ ??.txt" in my Win7 vitual machine... both with
chcp 850 and 437. Did you do anything regarding the console settings?
(not that it does seem to have any)

-- 
Lapo Luchini - http://lapo.it/

â€œThe best way to predict the future is to implement it.â€ (David
Heinemeier Hansson)


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-23 10:48                                 ` Lapo Luchini
@ 2009-09-23 12:04                                   ` Andy Koppe
  2009-09-23 15:16                                     ` Mark J. Reed
  2009-09-24  7:58                                   ` Thorsten Kampe
  1 sibling, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-23 12:04 UTC (permalink / raw)
  To: cygwin

2009/9/23 Lapo Luchini:
>> Works for me, too. Maybe not only the codepage but also the GUI locale
>> settings are involved. This is on Windows 7.
>
> Oh, that's interesting, it may be they improved the console in Win7?
> Did you see only the euro or also the Japanese character?
>
> Uh, nope. I still get "aà??.txt" in my Win7 vitual machine... both with
> chcp 850 and 437. Did you do anything regarding the console settings?
> (not that it does seem to have any)

I think it depends on the font that's selected in the console
properties. With the "Raster Font", only the "OEM" codepage (i.e. 437,
850, ...) is supported. With "Lucida Console" or "Consolas", however,
it seems to switch to UTF-16 mode.

(Well, more UCS2 than UTF-16 actually, since surrogates aren't supported.)

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-23 12:04                                   ` Andy Koppe
@ 2009-09-23 15:16                                     ` Mark J. Reed
  0 siblings, 0 replies; 51+ messages in thread
From: Mark J. Reed @ 2009-09-23 15:16 UTC (permalink / raw)
  To: cygwin

If I switch the console font to Lucida, I can see the Euro sign, too
(even on XP Pro).  But mixing and matching with Cygwin doesn't work
well

H:\>echo € | c:\cygwin\bin\od -t x1
0000000 3f 20 0d 0a

(the Cygwin process saw the Euro sign as a question mark)

but

H:\>c:\cygwin\bin\echo € | c:\cygwin\bin\od -t x1
0000000 e2 82 ac 0a

which is the proper UTF-8 encoding of the Euro sign.  So the output of
a Windows process coming in through a pipe is treated differently than
input from the Windows console.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-23 10:48                                 ` Lapo Luchini
  2009-09-23 12:04                                   ` Andy Koppe
@ 2009-09-24  7:58                                   ` Thorsten Kampe
  1 sibling, 0 replies; 51+ messages in thread
From: Thorsten Kampe @ 2009-09-24  7:58 UTC (permalink / raw)
  To: cygwin

* Lapo Luchini (Wed, 23 Sep 2009 12:48:03 +0200)
> Thorsten Kampe wrote:
> >> I created a file "aÃ â‚¬?.txt" in WinExplorer, and then:
> >>
> >> 23/09/2009  06.58                 0 aÃ ??.txt
> >>                1 File(s)              0 bytes
> > 
> > Works for me, too. Maybe not only the codepage but also the GUI locale 
> > settings are involved. This is on Windows 7.
> 
> Oh, that's interesting, it may be they improved the console in Win7?
> Did you see only the euro or also the Japanese character?

The Japanese character displayed as a question mark in my newsreader - 
so I had to skip that character when creating a file.
 
> Uh, nope. I still get "aÃ ??.txt" in my Win7 vitual machine... both with
> chcp 850 and 437. Did you do anything regarding the console settings?
> (not that it does seem to have any)

The only thing I changed was setting the console font to Dejavu Sans 
Mono. I created a text file with Japanese characters from Wikipedia. 
These displayed in Windows Explorer but not in Cmd.exe (empty squares - 
but no question marks).

Thorsten


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-21 21:20               ` Andy Koppe
  2009-09-22  5:59                 ` Lapo Luchini
@ 2009-09-24  7:03                 ` IWAMURO Motonori
  2009-09-24  7:34                   ` Corinna Vinschen
  1 sibling, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-24  7:03 UTC (permalink / raw)
  To: cygwin

2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
> Let's use the Windows "ANSI" codepage as the character set for the C
> locale, for both the conversion functions and filenames. This means
> CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
> ones, and so on.

I oppose the approach (the ANSI codepage is used at C locale) because
CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.

The reason is that the CP932 format contains a lot of meta characters
as follows.

  single character of CP932:
/[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/

This has a ruined influence to the tools that don't see locale.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24  7:03                 ` IWAMURO Motonori
@ 2009-09-24  7:34                   ` Corinna Vinschen
  2009-09-24  9:39                     ` IWAMURO Motonori
  0 siblings, 1 reply; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-24  7:34 UTC (permalink / raw)
  To: cygwin

On Sep 24 16:03, IWAMURO Motonori wrote:
> 2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
> > Let's use the Windows "ANSI" codepage as the character set for the C
> > locale, for both the conversion functions and filenames. This means
> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
> > ones, and so on.
> 
> I oppose the approach (the ANSI codepage is used at C locale) because
> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.
> 
> The reason is that the CP932 format contains a lot of meta characters
> as follows.
> 
>   single character of CP932:
> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/

I don't understand.  Are you saying that the single character in CP932
consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
is a just a simple double byte character set.  What am I missing.

> This has a ruined influence to the tools that don't see locale.

Can you please try to explain the problem in a bit more detail for
those of us not fluent in eastern asian languages?  What do you
mean with "hostile" and "ruined influence"?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24  7:34                   ` Corinna Vinschen
@ 2009-09-24  9:39                     ` IWAMURO Motonori
  2009-09-24  9:57                       ` Corinna Vinschen
  0 siblings, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-24  9:39 UTC (permalink / raw)
  To: cygwin

2009/9/24 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> On Sep 24 16:03, IWAMURO Motonori wrote:
>> 2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
>> > Let's use the Windows "ANSI" codepage as the character set for the C
>> > locale, for both the conversion functions and filenames. This means
>> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
>> > ones, and so on.
>>
>> I oppose the approach (the ANSI codepage is used at C locale) because
>> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.
>>
>> The reason is that the CP932 format contains a lot of meta characters
>> as follows.
>>
>>   single character of CP932:
>> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/
>
> I don't understand.  Are you saying that the single character in CP932
> consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
> is a just a simple double byte character set.  What am I missing.

- CP932 (Shift_JIS) has 1byte character and 2bytes character.

- The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.

- The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.

- The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
  This includes "[", "\", "]", "^", "`", "{", "|", "}".

A lot of problems of the tools (don't see locale and use escaped
string, globbing or regexp) are caused by the last fact.

- Can't open file or directory.
- Destroy filenames.
- Lost files.

For example:

Case1: The CP932 byte sequence of "項目表.xls" is 8D 80 96 DA 95 *5C*
(=='\') 2E 78 6C 73. When this character string is treated as a
character string with the escape without locale, 0x5C disappears.

Case2: When use regexp of /スポット/, I expect that it matches the
character strings including "スポット". But, the tools (don't see locale)
treat as /ス\x83|ット/ because the byte sequence of "スポット" is 83 58 83
*7C* (=='|') 83 62 83 67. As a result, the strings not expected are
matched.

Case3: When use glob of "データ0[0-9].dat", it treated as
"デ\x81[\x83^0[0-9].dat". As a result, the files expected are not
matched.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24  9:39                     ` IWAMURO Motonori
@ 2009-09-24  9:57                       ` Corinna Vinschen
  2009-09-24 10:00                         ` Corinna Vinschen
  2009-09-27  3:44                         ` IWAMURO Motonori
  0 siblings, 2 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-24  9:57 UTC (permalink / raw)
  To: cygwin

On Sep 24 18:37, IWAMURO Motonori wrote:
> 2009/9/24 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> > On Sep 24 16:03, IWAMURO Motonori wrote:
> >> 2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
> >> > Let's use the Windows "ANSI" codepage as the character set for the C
> >> > locale, for both the conversion functions and filenames. This means
> >> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
> >> > ones, and so on.
> >>
> >> I oppose the approach (the ANSI codepage is used at C locale) because
> >> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.
> >>
> >> The reason is that the CP932 format contains a lot of meta characters
> >> as follows.
> >>
> >>   single character of CP932:
> >> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/
> >
> > I don't understand.  Are you saying that the single character in CP932
> > consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
> > is a just a simple double byte character set.  What am I missing.
> 
> - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> 
> - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> 
> - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> 
> - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
>   This includes "[", "\", "]", "^", "`", "{", "|", "}".

Ok, thanks for your examples, they show neatly where the problem is.

As you might know, the codepage 20932 (EUC-JP) is also not the same
as the UNIX EUC_JP implementation.  The JIS-X-0212 three byte codes
are folded into two-byte sequences as described in a comment in
strfuncs.cc:

  /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
     compatible to eucJP.  It's a cute approximation which makes it a
     doublebyte codepage.
     The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
     into two byte codes as follows: The 0x8f is stripped, the next byte is
     taken as is, the third byte is mapped into the lower 7-bit area by
     masking it with 0x7f.  So, for instance, the eucJP code 0x8f,0xdd,0xf8
     becomes 0xdd,0x78 in CP 20932.

     To be really eucJP compatible, we have to map the JIS-X-0212 characters
     between CP 20932 and eucJP ourselves. */

My question is this:  Is the S-JIS implementation on UNIX systems
also using a different implementation to avoid using characters
from the ASCII range?  If so, can't we change the __sjis_wctomb
and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
and __eucjp_mbtowc functions to get a safer implementation?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24  9:57                       ` Corinna Vinschen
@ 2009-09-24 10:00                         ` Corinna Vinschen
  2009-09-26  9:15                           ` Corinna Vinschen
  2009-09-27  3:44                         ` IWAMURO Motonori
  1 sibling, 1 reply; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-24 10:00 UTC (permalink / raw)
  To: cygwin

On Sep 24 11:57, Corinna Vinschen wrote:
> On Sep 24 18:37, IWAMURO Motonori wrote:
> > - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> > 
> > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> > 
> > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> > 
> > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
> >   This includes "[", "\", "]", "^", "`", "{", "|", "}".
> 
> Ok, thanks for your examples, they show neatly where the problem is.
> 
> As you might know, the codepage 20932 (EUC-JP) is also not the same
> as the UNIX EUC_JP implementation.  The JIS-X-0212 three byte codes
> are folded into two-byte sequences as described in a comment in
> strfuncs.cc:
> 
>   /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
>      compatible to eucJP.  It's a cute approximation which makes it a
>      doublebyte codepage.
>      The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
>      into two byte codes as follows: The 0x8f is stripped, the next byte is
>      taken as is, the third byte is mapped into the lower 7-bit area by
>      masking it with 0x7f.  So, for instance, the eucJP code 0x8f,0xdd,0xf8
>      becomes 0xdd,0x78 in CP 20932.
> 
>      To be really eucJP compatible, we have to map the JIS-X-0212 characters
>      between CP 20932 and eucJP ourselves. */
> 
> My question is this:  Is the S-JIS implementation on UNIX systems
> also using a different implementation to avoid using characters
> from the ASCII range?  If so, can't we change the __sjis_wctomb
> and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
> and __eucjp_mbtowc functions to get a safer implementation?

Hmm, as far as I can see from wikipedia, S-JIS is simply defined
that way.  Bah.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24 10:00                         ` Corinna Vinschen
@ 2009-09-26  9:15                           ` Corinna Vinschen
  2009-09-27  3:21                             ` IWAMURO Motonori
  0 siblings, 1 reply; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-26  9:15 UTC (permalink / raw)
  To: cygwin

On Sep 24 12:00, Corinna Vinschen wrote:
> On Sep 24 11:57, Corinna Vinschen wrote:
> > On Sep 24 18:37, IWAMURO Motonori wrote:
> > > - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> > > 
> > > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> > > 
> > > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> > > 
> > > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
> > >   This includes "[", "\", "]", "^", "`", "{", "|", "}".
> > 
> > Ok, thanks for your examples, they show neatly where the problem is.
> > 
> > As you might know, the codepage 20932 (EUC-JP) is also not the same
> > as the UNIX EUC_JP implementation.  The JIS-X-0212 three byte codes
> > are folded into two-byte sequences as described in a comment in
> > strfuncs.cc:
> > 
> >   /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
> >      compatible to eucJP.  It's a cute approximation which makes it a
> >      doublebyte codepage.
> >      The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
> >      into two byte codes as follows: The 0x8f is stripped, the next byte is
> >      taken as is, the third byte is mapped into the lower 7-bit area by
> >      masking it with 0x7f.  So, for instance, the eucJP code 0x8f,0xdd,0xf8
> >      becomes 0xdd,0x78 in CP 20932.
> > 
> >      To be really eucJP compatible, we have to map the JIS-X-0212 characters
> >      between CP 20932 and eucJP ourselves. */
> > 
> > My question is this:  Is the S-JIS implementation on UNIX systems
> > also using a different implementation to avoid using characters
> > from the ASCII range?  If so, can't we change the __sjis_wctomb
> > and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
> > and __eucjp_mbtowc functions to get a safer implementation?
> 
> Hmm, as far as I can see from wikipedia, S-JIS is simply defined
> that way.  Bah.

This leads me to another question to you and other users working with
Japanese systems.

As far as I understood this, the default ANSI and OEM codepage on
Japanese Windows systems is 932/SJIS, right?  And your examples show
nicely how bad codepage 932/SJIS is from a usability perspective.

Right now, if you specify a locale like "ja_JP" on your machine, that
is, without specifying the charset, Cygwin will fetch the ANSI codepage
from Windows and use that as your charset.  That means, LANG="ja_JP"
will result in using the charset SJIS.

The question is this:  Wouldn't it be better from a usability perspective
to avoid SJIS in this case, and to switch Cygwin to EUCJP instead?

So, for a Japanese user:

  LANG="C"          -> UTF-8
  LANG="ja"         -> EUCJP
  LANG="ja_JP"      -> EUCJP
  LANG="ja_JP.SJIS" -> SJIS

That would mean, *only* when specifying SJIS explicitely, Cygwin actually
uses SJIS.

Is that a feasible approach?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-26  9:15                           ` Corinna Vinschen
@ 2009-09-27  3:21                             ` IWAMURO Motonori
  2009-09-28 16:03                               ` IWAMURO Motonori
  0 siblings, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-27  3:21 UTC (permalink / raw)
  To: cygwin

Hi.

> the default ANSI and OEM codepage on Japanese Windows systems is
> 932/SJIS, right?

Yes.

> LANG="C" -> UTF-8
(snip)
> LANG="ja_JP.SJIS" -> SJIS

It's good.

> LANG="ja" -> EUCJP
> LANG="ja_JP" -> EUCJP

Hmmm, It is a difficult problem.

I think selecting UTF-8 is good because eucJP is legacy.

But, for interoperability with other UNIX-like system(*), I don't
think selecting UTF-8 is good.

* Solaris: ja, ja_JP -> eucJP
* Linux (Debian): ja -> Unknown, ja_JP -> eucJP

I need to think more...

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-27  3:21                             ` IWAMURO Motonori
@ 2009-09-28 16:03                               ` IWAMURO Motonori
  2009-09-28 16:16                                 ` Corinna Vinschen
  0 siblings, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-28 16:03 UTC (permalink / raw)
  To: cygwin

2009/9/27 IWAMURO Motonori <deenheart@gmail.com>:
>> LANG="ja" -> EUCJP
>> LANG="ja_JP" -> EUCJP
>
> Hmmm, It is a difficult problem.
>
> I think selecting UTF-8 is good because eucJP is legacy.
>
> But, for interoperability with other UNIX-like system(*), I don't
> think selecting UTF-8 is good.
>
> * Solaris: ja, ja_JP -> eucJP
> * Linux (Debian): ja -> Unknown, ja_JP -> eucJP
>
> I need to think more...

My conclusion is as follows as a result of hearing other Japanese
people's opinion:

LANG=ja -> UTF-8
LANG=ja_JP -> UTF-8

Because, we specify "eucJP" explicitly when we need it.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:03                               ` IWAMURO Motonori
@ 2009-09-28 16:16                                 ` Corinna Vinschen
  2009-09-29  0:23                                   ` wynfield
                                                     ` (4 more replies)
  0 siblings, 5 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-28 16:16 UTC (permalink / raw)
  To: cygwin

On Sep 29 01:03, IWAMURO Motonori wrote:
> 2009/9/27 IWAMURO Motonori <deenheart@gmail.com>:
> >> LANG="ja" -> EUCJP
> >> LANG="ja_JP" -> EUCJP
> >
> > Hmmm, It is a difficult problem.
> >
> > I think selecting UTF-8 is good because eucJP is legacy.
> >
> > But, for interoperability with other UNIX-like system(*), I don't
> > think selecting UTF-8 is good.
> >
> > * Solaris: ja, ja_JP -> eucJP
> > * Linux (Debian): ja -> Unknown, ja_JP -> eucJP
> >
> > I need to think more...
> 
> My conclusion is as follows as a result of hearing other Japanese
> people's opinion:
> 
> LANG=ja -> UTF-8
> LANG=ja_JP -> UTF-8
> 
> Because, we specify "eucJP" explicitly when we need it.

Hmm.

That's an interesting point.

In theory this sounds like a good idea to be used for all locales which
don't specify the charset explicitely, because that results in using the
same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
would all default to UTF-8.

The downside is that a user, who needs to work under the default ANSI
codepage for some reason, has to know the name of the default ANSI
codepage.  Right now any user who needs the default ANSI codepage can
simply set LANG to some language code and go ahead, without having to
know the number.  With your solution, that wouldn't be possible anymore
and the user would have to figure out the default ANSI codepage on the
system before being able to use it.

I honestly don't know if that's really a problem, though.  But I don't
want to take that feature away for now.  Anybody having a strong opinion
on this issue?

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:16                                 ` Corinna Vinschen
@ 2009-09-29  0:23                                   ` wynfield
  2009-09-29  4:04                                     ` Andy Koppe
  2009-09-29 13:55                                     ` IWAMURO Motonori
  2009-09-29  4:27                                   ` Andy Koppe
                                                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 51+ messages in thread
From: wynfield @ 2009-09-29  0:23 UTC (permalink / raw)
  To: cygwin


Though I'm not an up on the details involved here, I will give
you feedback to the request for information about the locale issue, because it affects the quick accessability and usage of Japanese language documents.

Either of the two follow values would be acceptable, but I feel that the UTF-8 charset is becoming more and more adopted.
        LANG=ja -> UTF-8
     LANG=ja_JP -> UTF-8

Also the following be suitable if possible..
        LANG=ja -> iso-2022-jp
     LANG=ja_JP -> iso-2022-jp


Regards:


> On Sep 29 01:03, IWAMURO Motonori wrote:
> > >  
> > > ..... <snipped>
> > >
> > > I think selecting UTF-8 is good because eucJP is legacy.
> > 
>> <and>
> >
> > My conclusion is as follows as a result of hearing other Japanese
> > people's opinion:
> > 
> > LANG=ja -> UTF-8
> > LANG=ja_JP -> UTF-8
> > 
> > Because, we specify "eucJP" explicitly when we need it.

 

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29  0:23                                   ` wynfield
@ 2009-09-29  4:04                                     ` Andy Koppe
  2009-09-29 13:55                                     ` IWAMURO Motonori
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Koppe @ 2009-09-29  4:04 UTC (permalink / raw)
  To: cygwin

2009/9/29 wynfield:
>
> Though I'm not an up on the details involved here, I will give
> you feedback to the request for information about the locale issue, because it affects the quick accessability and usage of Japanese language documents.
>
> Either of the two follow values would be acceptable, but I feel that the UTF-8 charset is becoming more and more adopted.
>        LANG=ja -> UTF-8
>     LANG=ja_JP -> UTF-8
>
> Also the following be suitable if possible..
>        LANG=ja -> iso-2022-jp
>     LANG=ja_JP -> iso-2022-jp

Thanks for the feedback!

Now, Windows knows three different variants of iso-2022-jp. Do you
know which one's the preferred one?

CP50220: ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
CP50221: ISO 2022 Japanese with halfwidth Katakana; Japanese
(JIS-Allow 1 byte Kana)
CP50222: ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte
Kana - SO/SI)

Also, Wikipedia has this to say:

"Since ISO 2022 is a stateful encoding, a program can not jump in the
middle of a block of text to search, insert or delete characters. This
makes manipulation of the text very cumbersome and slow when compared
to non-stateful encodings. Any jump in the middle of the text may
require a back up to the previous escape sequence before the bytes
following the escape sequence can be interpreted."

Doesn't that make it very difficult to use with standard Unix tools?

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29  0:23                                   ` wynfield
  2009-09-29  4:04                                     ` Andy Koppe
@ 2009-09-29 13:55                                     ` IWAMURO Motonori
  1 sibling, 0 replies; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-29 13:55 UTC (permalink / raw)
  To: cygwin

2009/9/29  <wynfield@gmail.com>:
> Also the following be suitable if possible..
>        LANG=ja -> iso-2022-jp
>     LANG=ja_JP -> iso-2022-jp

Hmmm, I think that it is unreal.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:16                                 ` Corinna Vinschen
  2009-09-29  0:23                                   ` wynfield
@ 2009-09-29  4:27                                   ` Andy Koppe
  2009-09-29  7:03                                     ` Corinna Vinschen
  2009-09-29 10:55                                   ` Lapo Luchini
                                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 51+ messages in thread
From: Andy Koppe @ 2009-09-29  4:27 UTC (permalink / raw)
  To: cygwin

2009/9/28 Corinna Vinschen
>> My conclusion is as follows as a result of hearing other Japanese
>> people's opinion:
>>
>> LANG=ja -> UTF-8
>> LANG=ja_JP -> UTF-8
>>
>> Because, we specify "eucJP" explicitly when we need it.
>
> Hmm.
>
> That's an interesting point.
>
> In theory this sounds like a good idea to be used for all locales which
> don't specify the charset explicitely, because that results in using the
> same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
> would all default to UTF-8.

Hmm, there's much to be said for that.

> The downside is that a user, who needs to work under the default ANSI
> codepage for some reason, has to know the name of the default ANSI
> codepage.  Right now any user who needs the default ANSI codepage can
> simply set LANG to some language code and go ahead, without having to
> know the number.  With your solution, that wouldn't be possible anymore
> and the user would have to figure out the default ANSI codepage on the
> system before being able to use it.

How about an explicit "ANSI" charset that maps to GetACP()? And "OEM"
for GetOEMCP()? Those would make easy replacements for the
CYGWIN=codepage:[ansi|oem] option.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29  4:27                                   ` Andy Koppe
@ 2009-09-29  7:03                                     ` Corinna Vinschen
  0 siblings, 0 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-29  7:03 UTC (permalink / raw)
  To: cygwin

On Sep 29 05:27, Andy Koppe wrote:
> 2009/9/28 Corinna Vinschen
> > The downside is that a user, who needs to work under the default ANSI
> > codepage for some reason, has to know the name of the default ANSI
> > codepage. Â Right now any user who needs the default ANSI codepage can
> > simply set LANG to some language code and go ahead, without having to
> > know the number. Â With your solution, that wouldn't be possible anymore
> > and the user would have to figure out the default ANSI codepage on the
> > system before being able to use it.
> 
> How about an explicit "ANSI" charset that maps to GetACP()? And "OEM"
> for GetOEMCP()? Those would make easy replacements for the
> CYGWIN=codepage:[ansi|oem] option.

Not for 1.7.1.  Maybe later, if there's any actual demand.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:16                                 ` Corinna Vinschen
  2009-09-29  0:23                                   ` wynfield
  2009-09-29  4:27                                   ` Andy Koppe
@ 2009-09-29 10:55                                   ` Lapo Luchini
  2009-09-29 11:12                                   ` Thomas Wolff
  2009-09-29 14:13                                   ` IWAMURO Motonori
  4 siblings, 0 replies; 51+ messages in thread
From: Lapo Luchini @ 2009-09-29 10:55 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 427 bytes --]

Corinna Vinschen wrote:
> The downside is that a user, who needs to work under the default ANSI
> codepage for some reason, has to know the name of the default ANSI
> codepage.

Mhhh... IMHO any user interested int his probably knows his own ANSI
codepage all too well (CP1252 for me), but maybe that's a programmer's
point of view and many users can have those issues as well.

-- 
Lapo Luchini - http://lapo.it/


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 898 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:16                                 ` Corinna Vinschen
                                                     ` (2 preceding siblings ...)
  2009-09-29 10:55                                   ` Lapo Luchini
@ 2009-09-29 11:12                                   ` Thomas Wolff
  2009-09-29 12:12                                     ` Corinna Vinschen
  2009-09-29 14:13                                   ` IWAMURO Motonori
  4 siblings, 1 reply; 51+ messages in thread
From: Thomas Wolff @ 2009-09-29 11:12 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> On Sep 29 01:03, IWAMURO Motonori wrote:
>   
>> 2009/9/27 IWAMURO Motonori <deenheart@gmail.com>:
>>     
>>>> LANG="ja" -> EUCJP
>>>> LANG="ja_JP" -> EUCJP
>>>>         
>>> Hmmm, It is a difficult problem.
>>>
>>> I think selecting UTF-8 is good because eucJP is legacy.
>>>
>>> But, for interoperability with other UNIX-like system(*), I don't
>>> think selecting UTF-8 is good.
>>>
>>> * Solaris: ja, ja_JP -> eucJP
>>> * Linux (Debian): ja -> Unknown, ja_JP -> eucJP
>>>
>>> I need to think more...
>>>       
>> My conclusion is as follows as a result of hearing other Japanese
>> people's opinion:
>>
>> LANG=ja -> UTF-8
>> LANG=ja_JP -> UTF-8
>>
>> Because, we specify "eucJP" explicitly when we need it.
>>     
>
> Hmm.
>
> That's an interesting point.
>
> In theory this sounds like a good idea to be used for all locales which
> don't specify the charset explicitely, because that results in using the
> same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
> would all default to UTF-8.
>   
The keyword here again should be compatibility. That means, 
unfortunately, that I do not think this is a good idea.
A number of locales have been established on common systems that do not 
specify their encoding explicitly (i.e. in their name).
Since there is now more or less a common set of such locales among 
various Linux and Unix systems, this seems to be
a de-facto standard although I am not aware of any more formal 
definition/listing/description of this.
On a modern Linux system, use the following command to get a list (not 
sure if it's appropriate to attach it here):
    for l in `locale -a`
    do      echo "$l        `LC_ALL=$l locale charmap`"
    done

I have also tried to incorporate a best guess assembly of mappings from 
modern systems in my editor mined so it can
derive the encoding from the locale name, so you could also take a 
working list from there.

I think this list should be used for reference to define the 
locale/encoding mapping, other choices may be more attractive
but only raise problems.

> The downside is that a user, who needs to work under the default ANSI
> codepage for some reason, has to know the name of the default ANSI
> codepage.  Right now any user who needs the default ANSI codepage can
> simply set LANG to some language code and go ahead, without having to
> know the number.  With your solution, that wouldn't be possible anymore
> and the user would have to figure out the default ANSI codepage on the
> system before being able to use it.
>
> I honestly don't know if that's really a problem, though.  But I don't
> want to take that feature away for now.  Anybody having a strong opinion
> on this issue?
>   
I wasn't quite aware that the old "codepage:oem" setting didn't strictly 
mean "CP850" or "CP437" but apparently the respective system locale.
If that is really needed, maybe the "C" locale should get you there, or 
some "OEM" as (I think) Andy proposed. If someone feels the need
to combine a specific language setting with the unspecific "system 
locale", well, maybe a pseudo encoding name could be invented to form
names like "en_GB.OEM". Just leaving out the encoding suffix should not 
have that effect as I argued above.

Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29 11:12                                   ` Thomas Wolff
@ 2009-09-29 12:12                                     ` Corinna Vinschen
  2009-09-29 14:30                                       ` IWAMURO Motonori
  0 siblings, 1 reply; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-29 12:12 UTC (permalink / raw)
  To: cygwin

On Sep 29 13:05, Thomas Wolff wrote:
> Corinna Vinschen wrote:
>> In theory this sounds like a good idea to be used for all locales which
>> don't specify the charset explicitely, because that results in using the
>> same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
>> would all default to UTF-8.
>>   
> The keyword here again should be compatibility. That means,  
> unfortunately, that I do not think this is a good idea.
> A number of locales have been established on common systems that do not  
> specify their encoding explicitly (i.e. in their name).
> Since there is now more or less a common set of such locales among  
> various Linux and Unix systems, this seems to be
> a de-facto standard although I am not aware of any more formal  
> definition/listing/description of this.
> On a modern Linux system, use the following command to get a list (not  
> sure if it's appropriate to attach it here):
>    for l in `locale -a`
>    do      echo "$l        `LC_ALL=$l locale charmap`"
>    done
>
> I have also tried to incorporate a best guess assembly of mappings from  
> modern systems in my editor mined so it can
> derive the encoding from the locale name, so you could also take a  
> working list from there.
>
> I think this list should be used for reference to define the  
> locale/encoding mapping, other choices may be more attractive
> but only raise problems.

This isn't feasible for now.  As I described in the documentation, the
actual content of the language and territory part is not evaluated for
now.  *Only* the charset part (and the cjknarrow modifier, FWIW) have
a meaning for newlib/Cygwin so far.  What happens for now is that
Cygwin calls a function which fetches the ANSI codepage and generates
the current charset from there.  So that's what happens:

   LANG="C"               -> UTF-8
   LANG="xx"              -> charset equivalent to ANSI codepage
   LANG="xx_XX"           -> ditto
   LANG="xx_XX.CHARSET"   -> Use charset CHARSET

We won't add extra functionality.  In the long run it would be nice to
change the setlocale functionality to use actual locale files in every
respect, but that's wishful thinking for now.

To return to the original problem which started this request. 

I asked if the default charset for the japanese language should be set
to EUCJP rather than SJIS.  The actual implementation would have been
like this

  if (lang="xx or lang="xx_XX" with x in [a-z] and X in [A-Z]?)
    set_charset_from_codepage()

  set_charset_from_codepage()
  {
    switch (GetANSI ())
    [...]
    case 932:
      charset="EUCJP"    <-- Instead of the current `charset="SJIS"
    [...]
  }

Everything going beyond this in complexity is out of the question for now.
  

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29 12:12                                     ` Corinna Vinschen
@ 2009-09-29 14:30                                       ` IWAMURO Motonori
  0 siblings, 0 replies; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-29 14:30 UTC (permalink / raw)
  To: cygwin

2009/9/29 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> I asked if the default charset for the japanese language should be set
> to EUCJP rather than SJIS.  The actual implementation would have been
> like this
>
>  if (lang="xx or lang="xx_XX" with x in [a-z] and X in [A-Z]?)
>    set_charset_from_codepage()
>
>  set_charset_from_codepage()
>  {
>    switch (GetANSI ())
>    [...]
>    case 932:
>      charset="EUCJP"    <-- Instead of the current `charset="SJIS"
>    [...]
>  }

I think that it is not good for Japanese users because EUCJP doesn't
become substitution of SJIS.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-28 16:16                                 ` Corinna Vinschen
                                                     ` (3 preceding siblings ...)
  2009-09-29 11:12                                   ` Thomas Wolff
@ 2009-09-29 14:13                                   ` IWAMURO Motonori
  2009-09-29 14:55                                     ` Corinna Vinschen
  4 siblings, 1 reply; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-29 14:13 UTC (permalink / raw)
  To: cygwin

2009/9/29 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> The downside is that a user, who needs to work under the default ANSI
> codepage for some reason, has to know the name of the default ANSI
> codepage.

If the problem is a problem of 1.5->1.7 migration, how about building
in the wizard which sets the locale environment variable to setup.exe?
Is not it proper as the solution?
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-29 14:13                                   ` IWAMURO Motonori
@ 2009-09-29 14:55                                     ` Corinna Vinschen
  0 siblings, 0 replies; 51+ messages in thread
From: Corinna Vinschen @ 2009-09-29 14:55 UTC (permalink / raw)
  To: cygwin

On Sep 29 23:13, IWAMURO Motonori wrote:
> 2009/9/29 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> > The downside is that a user, who needs to work under the default ANSI
> > codepage for some reason, has to know the name of the default ANSI
> > codepage.
> 
> If the problem is a problem of 1.5->1.7 migration, how about building
> in the wizard which sets the locale environment variable to setup.exe?
> Is not it proper as the solution?

We don't want to enforce the usage of the ANSI codepage after
installation.  Default should be "C" with UTF-8 charset.  Setting
LC_ALL/LC_CTYPE/LANG is the choice of the user.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: The C locale
  2009-09-24  9:57                       ` Corinna Vinschen
  2009-09-24 10:00                         ` Corinna Vinschen
@ 2009-09-27  3:44                         ` IWAMURO Motonori
  1 sibling, 0 replies; 51+ messages in thread
From: IWAMURO Motonori @ 2009-09-27  3:44 UTC (permalink / raw)
  To: cygwin

2009/9/24 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> My question is this:  Is the S-JIS implementation on UNIX systems
> also using a different implementation to avoid using characters
> from the ASCII range?  If so, can't we change the __sjis_wctomb
> and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
> and __eucjp_mbtowc functions to get a safer implementation?

I don't think that it is necessary to think about it.

The problem of eucJP is not caused on the SJIS environment because
SJIS don't support JIS-X-0212.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2009-09-29 14:55 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-30 16:59 The C locale Andy Koppe
2009-08-31  0:53 ` Christopher Faylor
2009-09-02  6:29   ` Andy Koppe
2009-09-02 11:48     ` Eric Blake
2009-09-02 20:10       ` Andy Koppe
2009-09-02 13:56     ` IWAMURO Motonori
2009-09-07 20:08       ` Andy Koppe
2009-09-08 19:35         ` Corinna Vinschen
2009-09-08 20:48           ` Andy Koppe
2009-09-08 21:49           ` Andy Koppe
2009-09-21 10:38             ` Corinna Vinschen
2009-09-21 13:08               ` Lapo Luchini
2009-09-21 14:39               ` Charles Wilson
2009-09-21 21:20               ` Andy Koppe
2009-09-22  5:59                 ` Lapo Luchini
2009-09-22  6:23                   ` Lapo Luchini
2009-09-22  6:50                     ` Andy Koppe
2009-09-22  6:47                   ` Andy Koppe
2009-09-22  8:43                     ` Lapo Luchini
2009-09-22 12:50                       ` Andy Koppe
2009-09-22 16:26                         ` Lapo Luchini
2009-09-22 16:49                           ` Mark J. Reed
2009-09-22 17:04                             ` Lapo Luchini
2009-09-22 22:11                           ` Thorsten Kampe
2009-09-23  5:12                             ` Lapo Luchini
2009-09-23  9:04                               ` Thorsten Kampe
2009-09-23 10:48                                 ` Lapo Luchini
2009-09-23 12:04                                   ` Andy Koppe
2009-09-23 15:16                                     ` Mark J. Reed
2009-09-24  7:58                                   ` Thorsten Kampe
2009-09-24  7:03                 ` IWAMURO Motonori
2009-09-24  7:34                   ` Corinna Vinschen
2009-09-24  9:39                     ` IWAMURO Motonori
2009-09-24  9:57                       ` Corinna Vinschen
2009-09-24 10:00                         ` Corinna Vinschen
2009-09-26  9:15                           ` Corinna Vinschen
2009-09-27  3:21                             ` IWAMURO Motonori
2009-09-28 16:03                               ` IWAMURO Motonori
2009-09-28 16:16                                 ` Corinna Vinschen
2009-09-29  0:23                                   ` wynfield
2009-09-29  4:04                                     ` Andy Koppe
2009-09-29 13:55                                     ` IWAMURO Motonori
2009-09-29  4:27                                   ` Andy Koppe
2009-09-29  7:03                                     ` Corinna Vinschen
2009-09-29 10:55                                   ` Lapo Luchini
2009-09-29 11:12                                   ` Thomas Wolff
2009-09-29 12:12                                     ` Corinna Vinschen
2009-09-29 14:30                                       ` IWAMURO Motonori
2009-09-29 14:13                                   ` IWAMURO Motonori
2009-09-29 14:55                                     ` Corinna Vinschen
2009-09-27  3:44                         ` IWAMURO Motonori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).