From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6509-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 51395 invoked by alias); 1 Dec 2018 22:09:39 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 51373 invoked by uid 89); 1 Dec 2018 22:09:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=iso88591, iso-8859-1, H*Ad:U*libc-locales, HTo:U*libc-locales
X-HELO: shared-ano163.rev.nazwa.pl
X-Spam-Score: 0
Date: Sat, 01 Dec 2018 22:09:00 -0000
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Marko Myllynen <myllynen@redhat.com>, Egor Kobylkin <egor@kobylkin.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org
Message-ID: <1441622134.517912.1543702039942@poczta.nazwa.pl>
In-Reply-To: <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
 <837001401.21346.1542406647888@poczta.nazwa.pl>
 <bef63562-09d1-3306-aae9-20002ccf4130@kobylkin.com>
 <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com>
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2018-q4/txt/msg00121.txt.bz2

19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
>=20
> $ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t UTF-8 -x cyrillic-la=
tin
> =C5=A0ema
> $ echo =D0=A1=D1=85=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t UTF-8 -x cyrill=
ic-latin
> Shema
> $ uconv -V
> uconv v2.1  ICU 50.1.2

I've played a little with uconv and unfortunately it does not look good
to me.

It does not have any fallback transliteration to plain ASCII.  When it says
that '=D0=A8' is transliterated to '=C5=A0' then it always uses '=C5=A0' an=
d if the target
charset does not have this character then crashes:

$ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t ASCII -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic=
-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic=
-latin
=EF=BF=BDema
$ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic=
-latin | uconv -f
ISO-8859-2 -t UTF-8
=C5=A0ema

It seems to follow ISO 9 (GOST 7.79) System A.  However, the transliteration
of the hard sign is rather strange:

$ echo =D0=BD=D1=8A=D0=B5  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
n=CA=BAe

The above was correct but:

$ echo =D0=9D=D0=AA=D0=95  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin=20=
=20=20=20=20=20=20=20=20=20
N=CA=BA=CC=B1E
$ echo =D0=AA  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
=CA=BA=CC=B1
$ echo =D0=AA  | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
0000000    feff    02ba    0331    000a=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
0000008

So this generates:
02BA  MODIFIER LETTER DOUBLE PRIME
0331  COMBINING MACRON BELOW

There is are more transliteration methods, for example Russian-Latin/BGN:

$ echo =D0=A8=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin=
/BGN
Shema
$ echo =D0=A1=D1=85=D0=B5=D0=BC=D0=B0  | uconv -f UTF-8 -t UTF-8 -x Russian=
-Latin/BGN
Skhema

Converting '=D1=85' to 'kh' seems to be common in English transliteration b=
ut
it does not follow any ISO standard.

$ echo =D0=A5=D0=90 =D1=85=D0=B0 | uconv -f UTF-8 -t UTF-8 -x Russian-Latin=
/BGN
KHA kha

This means that the choice whether a digraph in the output should be
all uppercase or maybe upper+lower is context based, something which we
probably cannot implement.  But definitely a good thing.

Two more tests:

$ echo =D0=95=D1=89=D1=91 | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Yeshch=C3=AB
$ echo =D0=95=D1=89=D1=91 | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Conversion from Unicode to codepage failed at output byte position 6.
Unicode: 00eb Error: Invalid character found

So the output is not plain ASCII.

$ echo =D0=B5 =D0=B6=D0=B5 =D0=BB=D0=B5 =D0=BD=D0=B5 | uconv -f UTF-8 -t AS=
CII -x Russian-Latin/BGN
ye zhe le ne

Again this means that transliteration of '=D0=B5' is context based:
it is 'ye' in the beginning of a word and 'e' otherwise.

The version which I've tested:

$ uconv -V
uconv v2.1  ICU 60.2

It seems that uconv will not be a good hint about transliterating
to plain ASCII.

Also, the difference between uconv and iconv is that we can provide
multiple transliterations for any source character but we can't group
them into standards so we can't tell iconv to use this or another
system.  It will just choose the best fitting the current output
character set and the only thing we can choose is the locale.

This makes me think: should we add a locale like ru_RU@SystemA or
ru_RU@SystemB?

Regards,

Rafal