From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6755-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 77554 invoked by alias); 10 Jun 2019 22:44:37 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 77453 invoked by uid 89); 10 Jun 2019 22:44:36 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-4.0 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_ASCII_DIVIDERS,KAM_MANYTO,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy==d0=a8=d0, =d0=bc=d0=b0, HTo:U*siddhesh, handing?=
X-HELO: shared-ano163.rev.nazwa.pl
X-Spam-Score: -0.9
Date: Mon, 10 Jun 2019 22:44:00 -0000
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Carlos O'Donell <codonell@redhat.com>,
	Marko Myllynen <myllynen@redhat.com>,
	"Diego (Egor) Kobylkin" <egor@kobylkin.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>,
	"libc-locales@sourceware.org" <libc-locales@sourceware.org>,
	Siddhesh Poyarekar <siddhesh@gotplt.org>
Cc: Mike Fabian <mfabian@redhat.com>
Message-ID: <1728627823.1766022.1560206406480@poczta.nazwa.pl>
In-Reply-To: <761147fe-75d8-fbbf-b75a-1b58323254f9@redhat.com>
References: <DDiRMB942zU2NTs_1xTsb-zTgRD2L6AOaaJW-a0-0YJ3O5voZt2GeTjQJQ0c_hExTwcJKvBMiXIeyHsdieM2Q1m61oOpU27Msj09zowycVM=@kobylkin.com>
 <2030695416.914859.1559778544120@poczta.nazwa.pl>
 <f392d6b1-39e7-c883-ae5e-2a8636231181@redhat.com>
 <1640311749.1550210.1559856673283@poczta.nazwa.pl>
 <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com>
 <956159024.1658672.1559904734686@poczta.nazwa.pl>
 <761147fe-75d8-fbbf-b75a-1b58323254f9@redhat.com>
Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration
 [BZ #2872]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-q2/txt/msg00091.txt.bz2

7.06.2019 14:35 Carlos O'Donell <codonell@redhat.com> wrote:
> On 6/7/19 6:52 AM, Rafal Luzynski wrote:
> [...]
> >=20
> > uconv implements a smart algorithm to adjust the upper/lower case:
> >=20
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > $ echo "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru=
-ru_Latn/BGN
> > Skhema
> >=20
> > $ echo "=D0=A8=D0=B5=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x Russian-=
Latin/BGN
> > Shema
> >=20
> > $ echo "=D0=A8=D0=95=D0=9C=D0=90" | uconv -f UTF-8 -t ASCII -x ru-ru_La=
tn/BGN
> > SHEMA
> >=20
> > $ echo "=D0=A8=D0=95=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru-ru_La=
tn/BGN
> > SHEma
> >=20
> > $ echo "=D0=A8 =D0=95=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru-ru_L=
atn/BGN
> > SH Yema
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=20
> > Also for them it is easier because they decided that "=D0=A5" should be
> > transliterated to "KH" (I think this is the common thing when
> > transliterating to English) while ISO 9 says it should be transliterated
> > to "H" and GOST says it should be "X".  We can't implement this
> > fallback in glibc because the glibc algorithm is very simple.
>=20
> *Sigh*
>=20
> I should have known you could find enough examples that contradict
> eachother :-)

Sorry about the confusion.  My aim was to demonstrate how uconv adjusts
upper/lower case depending on the context so "=D0=A8" becomes sometimes "SH"
and sometimes "Sh".


7.06.2019 14:59 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> [...]
> It's quite simple really - suppose you have a list of pages in an
> wikipedia.=20
>=20
> For example there are these two entries in Russian:
> 1.=D0=A8=D0=B5=D0=BC=D0=B0
> https://ru.wikipedia.org/w/index.php?title=3D%D0%A8%D0%B5%D0%BC%D0%B0&red=
irect=3Dno
>=20
>=20
> 2.=D0=A1=D1=85=D0=B5=D0=BC=D0=B0 https://ru.wikipedia.org/wiki/%D0%A1%D1%=
85%D0%B5%D0%BC%D0%B0=20
>=20
>=20
> So you want to scrape wikipedia and them out to files: =D0=A8=D0=B5=D0=BC=
=D0=B0.txt and
> =D0=A1=D1=85=D0=B5=D0=BC=D0=B0.txt
> But the target system doesn't support Russian locale and so you must
> transliterate the filenames.

While talking about the filesystem: I think the problem is not
that it does not support Russian locale but that it tries to
handle it and fails at this.  If the filesystem accepted any
byte string as a file name wouldn't it accept a byte string which
constructs correct Cyrillic characters in UTF-8, without any
transliteration?

> If "=D0=A8"->"Sh" and "=D0=A1=D1=85"->"Sh", both of them will be written =
into the same
> file "Shema.txt". With no other special handing the first file will be
> overwritten and its data lost.
>=20
> If "=D0=A8"->"SH" and "=D0=A1=D1=85"->"Sh" - there will be two separate f=
iles 1.
> SHema.txt 2. Shema.txt . No data loss in this case.=20
> [...]

The problem is exclusively in the limitation of glibc itself.
In fact no standard says that "=D0=A8" should be transliterated as "Sh"
(or "SH") and "=D0=A5" as "H" (consequently, "=D0=A1=D1=85" as "Sh").  ISO-9
says that "=D0=A8" should be "=C5=A0" and "=D0=A5" should be "H" (consequen=
tly,
"=D0=A1=D1=85" should be "Sh" but that would never be confused for "=D0=A8"=
).
GOST 7.79 says that "=D0=A8" should be "SH" (or "Sh") and "=D0=A5" should
be "X" (consequently, ""=D0=A1=D1=85" should be "Sx").  There is no confusi=
on
in any case.  The problem is that we can't express all these rules
in the language of glibc transliterations; the rule:

    =D0=A5    "H";"X"

will not work because it would choose a transliteration of "X" only
if "H" was not available in the target charset (which never happens)
while we want it to choose "X" if "=C5=A0" is not available.


7.06.2019 23:17 Carlos O'Donell <codonell@redhat.com> wrote:
> [...]
> I also think your point about "technical" is relevant here, nobody
> really wants to read the transliterated results, they want to read
> the original, and providing any hint about the original form has
> value.

It looks like I totally misunderstood the purpose.  I always thought
the aim is to produce a transliteration system for real natural
language texts and to achieve the same output as it would be written
by a human writer.  Which I still think is possible, at least partially
and not necessarily in the current development cycle.  If you guys
want to have only technical hint and want to relax the linguistic
rules then Egor's patches are mostly sufficient.

> In glibc we don't have any framework for an intelligent conversion.
> We would have to write specific code to handle this case and add
> it into the translit code for special handling in this case.

My suggestion was to add such an intelligent conversion.  The rule
should be simple: if a letter is followed by a lowercase it should
be a titlecase (Sh), otherwise it should be uppercase (SH).  But
this may break Egor's requirement to keep them always uppercase.

> I think we should today leave "=D0=A8"->"SH" and "=D0=A1=D1=85"->"Sh", si=
nce it's
> the most conservative position that avoids ambiguity, and then we
> can discuss the aesthetics of this and the other impacts and solutions.
>=20
> I appreciate Rafal's position, but I think being conservative here,
> even if it's not as pretty as uconv, is a good guiding idea.

Just to summarize: if you want to apply the relaxed rules, more
technical than linguistic, then I am more willing to accept these
patches.

Regards,

Rafal