public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Default locale for Russian/Russia should be ru_RU.CP1251
@ 2015-12-24 15:41 Andrey ``Bass'' Shcheglov
  2015-12-24 18:22 ` Marco Atzeri
  2015-12-24 19:15 ` Corinna Vinschen
  0 siblings, 2 replies; 4+ messages in thread
From: Andrey ``Bass'' Shcheglov @ 2015-12-24 15:41 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

Hi,

I'm running Cygwin 2.2.0 on an English Windows 8.1 box:

> CYGWIN_NT-6.3 UNIT-725 2.2.0(0.289/5/3) 2015-08-03 12:51 x86_64 Cygwin

Windows regional settings are set to Russian/Russia.

In the absence of any settings in bashrc/bash_profile, `locale` command
outputs the following:

> LANG=ru_RU
> LC_CTYPE="ru_RU"
> LC_NUMERIC="ru_RU"
> LC_TIME="ru_RU"
> LC_COLLATE="ru_RU"
> LC_MONETARY="ru_RU"
> LC_MESSAGES="ru_RU"
> LC_ALL=

This is perfectly fine, except that "no charset" in the locale output
means "ISO charset", which is ISO-8859-5 for Russian/Russia and has
never been used (historically, DOS used CP866, Windows used CP1251 ANSI
codepage, and various Unices sticked to KOI8-R before the rise of
Unicode era).

The above is consistent with locale charmap output, which is again
ISO-8859-5.


Short C example also confirms ISO-8859-5 is used:

> #include <stdio.h>
> 
> #include <locale.h>
> #include <langinfo.h>
> 
> int main() {
>     const char *locale = setlocale(LC_ALL, "");
>     const char *codeset = nl_langinfo(CODESET);
>     printf("locale: %s\n", locale);
>     printf("codeset: %s\n", codeset);
> 
>     return 0;
> }

outputs

> locale: ru_RU/ru_RU/ru_RU/ru_RU/ru_RU/C
> codeset: ISO-8859-5


Cygwin docs state that

> Starting with Cygwin 1.7.2, the default character set is determined by the default Windows ANSI codepage for this language and territory.

which is not true in my case (Windows ANSI codepage for Cyrillic is
CP1251, not ISO-8859-5!). Surprisingly, for Belarusian (a.k.a
Belorussian, Eastern Slavic language very close to Russian) "be_BY"
locale the default charset is indeed CP1251 which is in accordance with
both the documentation and common sense.


Additionally, in `strace locale -u` output, I see multiple
> __get_lcid_from_locale: LCID=0x0419 
lines.

"0x0419" corresponds to Russian/Russia (see
<https://msdn.microsoft.com/en-us/library/windows/desktop/dd318693%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396>).

Despite that, $(locale -u) returns "en_GB", despite all regional
settings are set to Russian/Russia. I believe this is not correct,
either, and needs to be fixed.


Regards,
Andrey.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3705 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Default locale for Russian/Russia should be ru_RU.CP1251
  2015-12-24 15:41 Default locale for Russian/Russia should be ru_RU.CP1251 Andrey ``Bass'' Shcheglov
@ 2015-12-24 18:22 ` Marco Atzeri
  2015-12-24 19:15 ` Corinna Vinschen
  1 sibling, 0 replies; 4+ messages in thread
From: Marco Atzeri @ 2015-12-24 18:22 UTC (permalink / raw)
  To: cygwin

On 24/12/2015 16:40, Andrey ``Bass'' Shcheglov wrote:
> Hi,
>
> I'm running Cygwin 2.2.0 on an English Windows 8.1 box:
>
>> CYGWIN_NT-6.3 UNIT-725 2.2.0(0.289/5/3) 2015-08-03 12:51 x86_64 Cygwin
>
> Windows regional settings are set to Russian/Russia.
>
> In the absence of any settings in bashrc/bash_profile, `locale` command
> outputs the following:
>
>> LANG=ru_RU
>> LC_CTYPE="ru_RU"
>> LC_NUMERIC="ru_RU"
>> LC_TIME="ru_RU"
>> LC_COLLATE="ru_RU"
>> LC_MONETARY="ru_RU"
>> LC_MESSAGES="ru_RU"
>> LC_ALL=
>
> This is perfectly fine, except that "no charset" in the locale output
> means "ISO charset", which is ISO-8859-5 for Russian/Russia and has
> never been used (historically, DOS used CP866, Windows used CP1251 ANSI
> codepage, and various Unices sticked to KOI8-R before the rise of
> Unicode era).
>
> The above is consistent with locale charmap output, which is again
> ISO-8859-5.
>
>
> Short C example also confirms ISO-8859-5 is used:
>
>> #include <stdio.h>
>>
>> #include <locale.h>
>> #include <langinfo.h>
>>
>> int main() {
>>      const char *locale = setlocale(LC_ALL, "");
>>      const char *codeset = nl_langinfo(CODESET);
>>      printf("locale: %s\n", locale);
>>      printf("codeset: %s\n", codeset);
>>
>>      return 0;
>> }
>
> outputs
>
>> locale: ru_RU/ru_RU/ru_RU/ru_RU/ru_RU/C
>> codeset: ISO-8859-5
>
>
> Cygwin docs state that
>
>> Starting with Cygwin 1.7.2, the default character set is determined by the default Windows ANSI codepage for this language and territory.
>
> which is not true in my case (Windows ANSI codepage for Cyrillic is
> CP1251, not ISO-8859-5!). Surprisingly, for Belarusian (a.k.a
> Belorussian, Eastern Slavic language very close to Russian) "be_BY"
> locale the default charset is indeed CP1251 which is in accordance with
> both the documentation and common sense.
>
>
> Additionally, in `strace locale -u` output, I see multiple
>> __get_lcid_from_locale: LCID=0x0419
> lines.
>
> "0x0419" corresponds to Russian/Russia (see
> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd318693%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396>).
>
> Despite that, $(locale -u) returns "en_GB", despite all regional
> settings are set to Russian/Russia. I believe this is not correct,
> either, and needs to be fixed.

the current code on
   winsup/cygwin/nlsfuncs.cc

is responsible for the ISO-8859-5 defaults.
--------------------------------------------------------------
     case 1251:
       if (lcid == 0x0c1a                /* sr_CS (Serbian Language/Former
                                                   Serbia and Montenegro) */
           || lcid == 0x1c1a             /* sr_BA (Serbian Language/Bosnia
                                                   and Herzegovina) */
           || lcid == 0x281a             /* sr_RS (Serbian 
Language/Serbia) */
           || lcid == 0x301a             /* sr_ME (Serbian 
Language/Montenegro)*/
           || lcid == 0x0440             /* ky_KG (Kyrgyz/Kyrgyzstan) */
           || lcid == 0x0843             /* uz_UZ (Uzbek/Uzbekistan) */
                                         /* tt_RU (Tatar/Russia),
                                                  IQTElif alphabet */
           || (lcid == 0x0444 && has_modifier ("@iqtelif"))
           || lcid == 0x0450)            /* mn_MN (Mongolian/Mongolia) */
         cs = "UTF-8";
       else if (lcid == 0x0423)          /* be_BY (Belarusian/Belarus) */
         cs = has_modifier ("@latin") ? "UTF-8" : "CP1251";
       else if (lcid == 0x0402)          /* bg_BG (Bulgarian/Bulgaria) */
         cs = "CP1251";
       else if (lcid == 0x0422)          /* uk_UA (Ukrainian/Ukraine) */
         cs = "KOI8-U";
       else
         cs = "ISO-8859-5";
--------------------------------------------------------------

> Regards,
> Andrey.

as temporary workaround can you use UTF-8 ?

export LANG=ru_RU.UTF-8

Regards
Marco





--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Default locale for Russian/Russia should be ru_RU.CP1251
  2015-12-24 15:41 Default locale for Russian/Russia should be ru_RU.CP1251 Andrey ``Bass'' Shcheglov
  2015-12-24 18:22 ` Marco Atzeri
@ 2015-12-24 19:15 ` Corinna Vinschen
  2015-12-25 10:51   ` Andrey ``Bass'' Shcheglov
  1 sibling, 1 reply; 4+ messages in thread
From: Corinna Vinschen @ 2015-12-24 19:15 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 4073 bytes --]

On Dec 24 18:40, Andrey ``Bass'' Shcheglov wrote:
> Hi,
> 
> I'm running Cygwin 2.2.0 on an English Windows 8.1 box:
> 
> > CYGWIN_NT-6.3 UNIT-725 2.2.0(0.289/5/3) 2015-08-03 12:51 x86_64 Cygwin
> 
> Windows regional settings are set to Russian/Russia.
> 
> In the absence of any settings in bashrc/bash_profile, `locale` command
> outputs the following:
> 
> > LANG=ru_RU
> > LC_CTYPE="ru_RU"
> > LC_NUMERIC="ru_RU"
> > LC_TIME="ru_RU"
> > LC_COLLATE="ru_RU"
> > LC_MONETARY="ru_RU"
> > LC_MESSAGES="ru_RU"
> > LC_ALL=
> 
> This is perfectly fine, except that "no charset" in the locale output
> means "ISO charset", which is ISO-8859-5 for Russian/Russia and has
> never been used (historically, DOS used CP866, Windows used CP1251 ANSI
> codepage, and various Unices sticked to KOI8-R before the rise of
> Unicode era).

Well, not quite.  Cygwin is following Linux here:

  linux$ locale -av
  [...]
  locale: ru_RU           archive: /usr/lib/locale/locale-archive
  ----------------------------------------------------------------------
      title | Russian locale for Russia
     source | RAP
    address | Sankt Jorgens Alle 8, DK-1615 Kobenhavn V, Danmark
      email | bug-glibc-locales@gnu.org
   language | Russian
  territory | Russia
   revision | 1.0
       date | 2000-06-29
    codeset | ISO-8859-5

  cygwin$ locale -av
  [...]
  locale: ru_RU           archive: /mnt/c/WINDOWS/system32/KERNEL32.DLL
  ----------------------------------------------------------------------
   language | Russian
  territory | Russia
    codeset | ISO-8859-5

> Cygwin docs state that
> 
> > Starting with Cygwin 1.7.2, the default character set is determined by the default Windows ANSI codepage for this language and territory.

You missed to read on:

  Cygwin uses a character set which is the typical Unix-equivalent to
  the Windows ANSI codepage.  For instance: [...]

> which is not true in my case (Windows ANSI codepage for Cyrillic is
> CP1251, not ISO-8859-5!).

Rephrasing the above, Cygwin only uses the ANSI codepage to fetch the
default Linux codepage from there.  Maybe the documentation is a bit
fuzzy, but it didn't say the charset is set *to* the Windows ANSI
charset, it just *uses* the information to compute and set the codeset
to the equivalent Linux codeset.

> Surprisingly, for Belarusian (a.k.a
> Belorussian, Eastern Slavic language very close to Russian) "be_BY"
> locale the default charset is indeed CP1251 which is in accordance with
> both the documentation and common sense.

See the docs:

  The default charset of the "be_BY" locale (Belarusian/Belarus) is CP1251.
  With the "@latin" modifier it's UTF-8.

Just as on Linux.

> Despite that, $(locale -u) returns "en_GB", despite all regional
> settings are set to Russian/Russia. I believe this is not correct,
> either, and needs to be fixed.

The locale is directly taken from the Windows system function
GetUserDefaultUILanguage() in case of the -u option(*), and from
GetUserDefaultLCID() in case of the -f option(**).  This value is then
fed into the Windows function GetLocaleInfo()(***) to fetch language and
territory codes and that's what locale -u/-f prints.

So, looks like you're using a UK-english system with just the region
settings changed to Russia.

In general UTF-8 is the preferred codeset so setting LANG to ru_RU.utf8
(locale -fU should work for you) is the better choice.


Corinna

(*) https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=winsup/utils/locale.cc;h=fadf3f3dacedad6474c92aabe826620b2677e494;hb=HEAD#l805

(**) https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=winsup/utils/locale.cc;h=fadf3f3dacedad6474c92aabe826620b2677e494;hb=HEAD#l812

(**) https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=winsup/utils/locale.cc;h=fadf3f3dacedad6474c92aabe826620b2677e494;hb=HEAD#l114

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Default locale for Russian/Russia should be ru_RU.CP1251
  2015-12-24 19:15 ` Corinna Vinschen
@ 2015-12-25 10:51   ` Andrey ``Bass'' Shcheglov
  0 siblings, 0 replies; 4+ messages in thread
From: Andrey ``Bass'' Shcheglov @ 2015-12-25 10:51 UTC (permalink / raw)
  To: cygwin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thank you for clarification Corinna.

Regards,
Andrey.

On 24.12.2015 22:15, Corinna Vinschen wrote:
> You missed to read on:
> 
> Cygwin uses a character set which is the typical Unix-equivalent
> to the Windows ANSI codepage.  For instance: [...]
> 
> Rephrasing the above, Cygwin only uses the ANSI codepage to fetch
> the default Linux codepage from there.  Maybe the documentation is
> a bit fuzzy, but it didn't say the charset is set *to* the Windows
> ANSI charset, it just *uses* the information to compute and set the
> codeset to the equivalent Linux codeset.
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlZ9H5IACgkQFX2weoTrDGeVfgCfVPro1VY+YrnbDjXD8bWjWJY9
4yYAn048jMnfhTYhOd8JKr1B9RqAfWZq
=7rxQ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-12-25 10:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-24 15:41 Default locale for Russian/Russia should be ru_RU.CP1251 Andrey ``Bass'' Shcheglov
2015-12-24 18:22 ` Marco Atzeri
2015-12-24 19:15 ` Corinna Vinschen
2015-12-25 10:51   ` Andrey ``Bass'' Shcheglov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).