From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 51395 invoked by alias); 1 Dec 2018 22:09:39 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 51373 invoked by uid 89); 1 Dec 2018 22:09:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=iso88591, iso-8859-1, H*Ad:U*libc-locales, HTo:U*libc-locales X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: 0 Date: Sat, 01 Dec 2018 22:09:00 -0000 From: Rafal Luzynski To: Marko Myllynen , Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org Message-ID: <1441622134.517912.1543702039942@poczta.nazwa.pl> In-Reply-To: <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com> References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <837001401.21346.1542406647888@poczta.nazwa.pl> <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com> Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-q4/txt/msg00121.txt.bz2 19.11.2018 08:13 Marko Myllynen wrote: > [...] > Given the amount of questions above I think the way forward is to try > follow the relevant standards as closely as possible and also check what > the other implementations (i.e., uconv(1)) do. For example, checking the > case earlier mentioned case may or may not give some hints: >=20 > $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t UTF-8 -x cyrillic-la= tin > =C5=A0ema > $ echo =D0=A1=D1=85=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t UTF-8 -x cyrill= ic-latin > Shema > $ uconv -V > uconv v2.1 ICU 50.1.2 I've played a little with uconv and unfortunately it does not look good to me. It does not have any fallback transliteration to plain ASCII. When it says that '=D0=A8' is transliterated to '=C5=A0' then it always uses '=C5=A0' an= d if the target charset does not have this character then crashes: $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t ASCII -x cyrillic-latin Conversion from Unicode to codepage failed at output byte position 0. Unicode: 0160 Error: Invalid character found $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic= -latin Conversion from Unicode to codepage failed at output byte position 0. Unicode: 0160 Error: Invalid character found $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic= -latin =EF=BF=BDema $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic= -latin | uconv -f ISO-8859-2 -t UTF-8 =C5=A0ema It seems to follow ISO 9 (GOST 7.79) System A. However, the transliteration of the hard sign is rather strange: $ echo =D0=BD=D1=8A=D0=B5 | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin n=CA=BAe The above was correct but: $ echo =D0=9D=D0=AA=D0=95 | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin=20= =20=20=20=20=20=20=20=20=20 N=CA=BA=CC=B1E $ echo =D0=AA | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin =CA=BA=CC=B1 $ echo =D0=AA | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x 0000000 feff 02ba 0331 000a=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 0000008 So this generates: 02BA MODIFIER LETTER DOUBLE PRIME 0331 COMBINING MACRON BELOW There is are more transliteration methods, for example Russian-Latin/BGN: $ echo =D0=A8=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t UTF-8 -x Russian-Latin= /BGN Shema $ echo =D0=A1=D1=85=D0=B5=D0=BC=D0=B0 | uconv -f UTF-8 -t UTF-8 -x Russian= -Latin/BGN Skhema Converting '=D1=85' to 'kh' seems to be common in English transliteration b= ut it does not follow any ISO standard. $ echo =D0=A5=D0=90 =D1=85=D0=B0 | uconv -f UTF-8 -t UTF-8 -x Russian-Latin= /BGN KHA kha This means that the choice whether a digraph in the output should be all uppercase or maybe upper+lower is context based, something which we probably cannot implement. But definitely a good thing. Two more tests: $ echo =D0=95=D1=89=D1=91 | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN Yeshch=C3=AB $ echo =D0=95=D1=89=D1=91 | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN Conversion from Unicode to codepage failed at output byte position 6. Unicode: 00eb Error: Invalid character found So the output is not plain ASCII. $ echo =D0=B5 =D0=B6=D0=B5 =D0=BB=D0=B5 =D0=BD=D0=B5 | uconv -f UTF-8 -t AS= CII -x Russian-Latin/BGN ye zhe le ne Again this means that transliteration of '=D0=B5' is context based: it is 'ye' in the beginning of a word and 'e' otherwise. The version which I've tested: $ uconv -V uconv v2.1 ICU 60.2 It seems that uconv will not be a good hint about transliterating to plain ASCII. Also, the difference between uconv and iconv is that we can provide multiple transliterations for any source character but we can't group them into standards so we can't tell iconv to use this or another system. It will just choose the best fitting the current output character set and the only thing we can choose is the locale. This makes me think: should we add a locale like ru_RU@SystemA or ru_RU@SystemB? Regards, Rafal