From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 77554 invoked by alias); 10 Jun 2019 22:44:37 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 77453 invoked by uid 89); 10 Jun 2019 22:44:36 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-4.0 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_ASCII_DIVIDERS,KAM_MANYTO,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy==d0=a8=d0, =d0=bc=d0=b0, HTo:U*siddhesh, handing?= X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: -0.9 Date: Mon, 10 Jun 2019 22:44:00 -0000 From: Rafal Luzynski To: Carlos O'Donell , Marko Myllynen , "Diego (Egor) Kobylkin" , "libc-alpha@sourceware.org" , "libc-locales@sourceware.org" , Siddhesh Poyarekar Cc: Mike Fabian Message-ID: <1728627823.1766022.1560206406480@poczta.nazwa.pl> In-Reply-To: <761147fe-75d8-fbbf-b75a-1b58323254f9@redhat.com> References: <2030695416.914859.1559778544120@poczta.nazwa.pl> <1640311749.1550210.1559856673283@poczta.nazwa.pl> <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com> <956159024.1658672.1559904734686@poczta.nazwa.pl> <761147fe-75d8-fbbf-b75a-1b58323254f9@redhat.com> Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2019-q2/txt/msg00091.txt.bz2 7.06.2019 14:35 Carlos O'Donell wrote: > On 6/7/19 6:52 AM, Rafal Luzynski wrote: > [...] > >=20 > > uconv implements a smart algorithm to adjust the upper/lower case: > >=20 > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > $ echo "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru= -ru_Latn/BGN > > Skhema > >=20 > > $ echo "=D0=A8=D0=B5=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x Russian-= Latin/BGN > > Shema > >=20 > > $ echo "=D0=A8=D0=95=D0=9C=D0=90" | uconv -f UTF-8 -t ASCII -x ru-ru_La= tn/BGN > > SHEMA > >=20 > > $ echo "=D0=A8=D0=95=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru-ru_La= tn/BGN > > SHEma > >=20 > > $ echo "=D0=A8 =D0=95=D0=BC=D0=B0" | uconv -f UTF-8 -t ASCII -x ru-ru_L= atn/BGN > > SH Yema > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > Also for them it is easier because they decided that "=D0=A5" should be > > transliterated to "KH" (I think this is the common thing when > > transliterating to English) while ISO 9 says it should be transliterated > > to "H" and GOST says it should be "X". We can't implement this > > fallback in glibc because the glibc algorithm is very simple. >=20 > *Sigh* >=20 > I should have known you could find enough examples that contradict > eachother :-) Sorry about the confusion. My aim was to demonstrate how uconv adjusts upper/lower case depending on the context so "=D0=A8" becomes sometimes "SH" and sometimes "Sh". 7.06.2019 14:59 "Diego (Egor) Kobylkin" wrote: > [...] > It's quite simple really - suppose you have a list of pages in an > wikipedia.=20 >=20 > For example there are these two entries in Russian: > 1.=D0=A8=D0=B5=D0=BC=D0=B0 > https://ru.wikipedia.org/w/index.php?title=3D%D0%A8%D0%B5%D0%BC%D0%B0&red= irect=3Dno >=20 >=20 > 2.=D0=A1=D1=85=D0=B5=D0=BC=D0=B0 https://ru.wikipedia.org/wiki/%D0%A1%D1%= 85%D0%B5%D0%BC%D0%B0=20 >=20 >=20 > So you want to scrape wikipedia and them out to files: =D0=A8=D0=B5=D0=BC= =D0=B0.txt and > =D0=A1=D1=85=D0=B5=D0=BC=D0=B0.txt > But the target system doesn't support Russian locale and so you must > transliterate the filenames. While talking about the filesystem: I think the problem is not that it does not support Russian locale but that it tries to handle it and fails at this. If the filesystem accepted any byte string as a file name wouldn't it accept a byte string which constructs correct Cyrillic characters in UTF-8, without any transliteration? > If "=D0=A8"->"Sh" and "=D0=A1=D1=85"->"Sh", both of them will be written = into the same > file "Shema.txt". With no other special handing the first file will be > overwritten and its data lost. >=20 > If "=D0=A8"->"SH" and "=D0=A1=D1=85"->"Sh" - there will be two separate f= iles 1. > SHema.txt 2. Shema.txt . No data loss in this case.=20 > [...] The problem is exclusively in the limitation of glibc itself. In fact no standard says that "=D0=A8" should be transliterated as "Sh" (or "SH") and "=D0=A5" as "H" (consequently, "=D0=A1=D1=85" as "Sh"). ISO-9 says that "=D0=A8" should be "=C5=A0" and "=D0=A5" should be "H" (consequen= tly, "=D0=A1=D1=85" should be "Sh" but that would never be confused for "=D0=A8"= ). GOST 7.79 says that "=D0=A8" should be "SH" (or "Sh") and "=D0=A5" should be "X" (consequently, ""=D0=A1=D1=85" should be "Sx"). There is no confusi= on in any case. The problem is that we can't express all these rules in the language of glibc transliterations; the rule: =D0=A5 "H";"X" will not work because it would choose a transliteration of "X" only if "H" was not available in the target charset (which never happens) while we want it to choose "X" if "=C5=A0" is not available. 7.06.2019 23:17 Carlos O'Donell wrote: > [...] > I also think your point about "technical" is relevant here, nobody > really wants to read the transliterated results, they want to read > the original, and providing any hint about the original form has > value. It looks like I totally misunderstood the purpose. I always thought the aim is to produce a transliteration system for real natural language texts and to achieve the same output as it would be written by a human writer. Which I still think is possible, at least partially and not necessarily in the current development cycle. If you guys want to have only technical hint and want to relax the linguistic rules then Egor's patches are mostly sufficient. > In glibc we don't have any framework for an intelligent conversion. > We would have to write specific code to handle this case and add > it into the translit code for special handling in this case. My suggestion was to add such an intelligent conversion. The rule should be simple: if a letter is followed by a lowercase it should be a titlecase (Sh), otherwise it should be uppercase (SH). But this may break Egor's requirement to keep them always uppercase. > I think we should today leave "=D0=A8"->"SH" and "=D0=A1=D1=85"->"Sh", si= nce it's > the most conservative position that avoids ambiguity, and then we > can discuss the aesthetics of this and the other impacts and solutions. >=20 > I appreciate Rafal's position, but I think being conservative here, > even if it's not as pretty as uconv, is a good guiding idea. Just to summarize: if you want to apply the relaxed rules, more technical than linguistic, then I am more willing to accept these patches. Regards, Rafal