From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6692-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 94142 invoked by alias); 19 Apr 2019 22:24:26 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 94122 invoked by uid 89); 19 Apr 2019 22:24:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-4.5 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy==d0=a1=d1, H*x:Mailer, H*UA:Mailer, Latin?=
X-HELO: shared-ano163.rev.nazwa.pl
X-Spam-Score: -0.4
Date: Fri, 19 Apr 2019 22:24:00 -0000
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Marko Myllynen <myllynen@redhat.com>, Egor Kobylkin <egor@kobylkin.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org,
	Carlos O'Donell <carlos@redhat.com>
Cc: Siddhesh Poyarekar <siddhesh@gotplt.org>,
	Mike Fabian <mfabian@redhat.com>
Message-ID: <32577152.740058.1555712661835@poczta.nazwa.pl>
In-Reply-To: <cf4a6fd5-e3f4-ea19-4a05-1afb72a110f7@redhat.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <a1db6ae3-2847-1482-b849-dd383e8c85aa@kobylkin.com>
 <2124833400.35614.1546698902753@poczta.nazwa.pl>
 <908ed415-cfe4-804c-f421-4351ef062edc@kobylkin.com>
 <e3013c9d-b307-0a2e-4736-7bfccd0fb8fb@redhat.com>
 <6d076299-babd-406a-b1fe-87778f54bf36@kobylkin.com>
 <e560eb2e-ec8c-212a-c38a-cd44847ba8df@redhat.com>
 <41aff10b-9cf1-638c-4fbc-8c4f4122f2e9@kobylkin.com>
 <cf4a6fd5-e3f4-ea19-4a05-1afb72a110f7@redhat.com>
Subject: Re: [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ
 #2872] ping for 2.30
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-q2/txt/msg00028.txt.bz2

Thank you Siddhesh and Carlos for your involvement in testing this
patch and I apologize Egor and Marko and everyone else who need this
patch to be pushed for my poor involvement.  I'd like to reply to
this email from Marko because it summarizes all issues.  Also I hope
I will explain the problems which made me stuck.

14.02.2019 17:48 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> 1) Built-in C locale doesn't read/use any translit_* files and it can't
> have any fallback mechanisms and it only supports ASCII so using GOST
> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
> be the appropriate way to implement Cyrillic transliteration for the
> built-in C locale (it adds some 8KB to the binary).

This sounds like a good idea.

Also, C locale is probably a good way to enforce the plain ASCII
transliteration without any fallback.

> 2) Other locales read/use translit_* files and with them fallbacks and
> non-ASCII are possible so it would seem preferable to first try ISO 9 /
> GOST 7.79 System A

OK, we agree here.

> and only if that fails then use GOST 7.79 System B
> (in which case the end result should match with the built-in C locale).

This is impossible due to this case.  System A transliterates the Cyrillic
"=D0=A5" to Latin "H", system B transliterates it to Latin "X".  Transliter=
ation
as implemented in glibc supports a simple fallback algorithm: transliterate
the letter "X" to "YY" but if it is not available then to "ZZ".  It can't
support the complex algorithm which we need here: transliterate "X" to "YY"
but if "Q" cannot be transliterated to "RR" then transliterate "X" to "ZZ".
In our case we would like to transliterate "=D0=A5" to "X" if "=D0=A8" cann=
ot be
transliterated to "=C5=A0".  The only thing we can implement is a fallback
transliteration which is similar to System B but not 100% compatible.

This is not the case if we are going to implement only System B in C locale
because we know already that "=C5=A0" is unavailable so we have to translit=
erate
"=D0=A5" to "X" always.

> For this the translit_cyrillic file should be added (as per patch v9 +
> changes mentioned in patches v10 and v12).
>=20
> 3) Individual locale files can then be updated to use translit_cyrillic
> as appropriate (see patch v9) and language/national specific conventions
> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.

Sometimes I wonder whether really any other locale than a language which
uses the Cyrillic script should want to have a Cyrillic transliteration
but on the other hand - why not.

Also I'd like to reiterate other disagreements which we have here:

1. How to handle upper/lower case in System B?  Should we transliterate
   "=D0=A8" to "SH" or "Sh"?  Should we maybe implement a smart context bas=
ed
   casing algorithm first?  I mean the algorithm which would detect if
   an uppercase letter appears as the first letter of otherwise lowercase
   word so should be transliterated as "Sh", or maybe it's in a context
   of a fully uppercase word so should be transliterated as "SH".
   I think that uconv implements this algorithm.
2. How to handle ambiguous transliterations like "=D0=A1=D1=85=D0=B5=D0=BC=
=D0=B0" -> "Shema"
   vs. "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema"? "SHema"?
3. How to handle the characters which are proper letters in Cyrillic
   and have an upper and lower case like a hard and soft sign but are
   transliterated to punctuation characters (grave accent "`")?
   Should we transliterate upper and lower case to the same character
   or should we mark them somehow?  uconv adds Unicode combining low
   line to the grave accent (so the output is "`=CC=B2") if the original
   Cyrillic character was uppercase.  But this is unavailable if
   our target charset is ASCII.

Regarding the test cases which I mentioned the other day I discussed
this with Dmitry and he convinced me that requiring the test cases is
the bar set too high so I agree we don't need to require them already.

Regards,

Rafal