From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6498-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 96952 invoked by alias); 16 Nov 2018 22:17:42 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 96927 invoked by uid 89); 16 Nov 2018 22:17:41 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=1.8 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy==d0=a1=d1, letter?=
X-HELO: shared-ano163.rev.nazwa.pl
X-Spam-Score: 0
Date: Fri, 16 Nov 2018 22:17:00 -0000
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Egor Kobylkin <egor@kobylkin.com>, libc-alpha@sourceware.org,
	libc-locales@sourceware.org
Message-ID: <837001401.21346.1542406647888@poczta.nazwa.pl>
In-Reply-To: <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2018-q4/txt/msg00110.txt.bz2

Thank you for working on this, Egor.

Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.

1. I think we need tests for transliteration.  Currently there is only
   one test program which is similar to what we need,
   localedata/bug-iconv-trans.c.  It is old and it is not quite clear
   what bug it is trying to test.  Therefore I think we need a new
   framework to test transliteration.  Is it a good idea to base the
   test on the iconv(1) command line utility which is part of glibc?

2. I made few tests in the command line and it seems to me that the
   transliteration from "=D0=97" to "Z" (+ lowercase as well) in uk_UA does
   not work and has not been working for some time already because
   I've checked some older systems as well and the result is always
   the same.  I think that the reason is that uk_UA defines multiple
   transliteration rules for "=D0=97" depending on what is the letter follo=
wing
   it.  It does not seem to work.  AFAIK the reason is that the syntax of
   transliteration rules says that a single non-Latin character may map
   one or more Latin strings, each consisting of one or more characters.
   There cannot be a rule transliterating multiple source characters into
   one or multiple destination characters.  Is it a bug in transliteration
   implementation?  Or maybe in the specification, including POSIX standard?
   The definition of transliteration says that it is one-to-one mapping
   of graphemes while a grapheme may be one or multiple characters.
   It does not have to be always mapping one-to-one character.  Should we
   fix this bug first, make uk_UA transliteration work, and only then
   add a generic Cyrillic transliteration?  Egor's patch already contains
   transliteration of "=D0=A3" + combining acute accent to "=C3=9A" which m=
ost
probably
   will not work.

I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.

Egor, while at this I was thinking about your idea to transliterate letters
like "=D0=A8" (uppercase) to "SH" (always uppercase) in order to distinguish
between "=D0=A8=D0=B5=D0=BC=D0=B0" (-> "SHema") and "=D0=A1=D1=85=D0=B5=D0=
=BC=D0=B0" (-> "Shema" or "Sxema").  Also
you include a rule to transliterate "=D0=A5" to "H" or "X" depending on whi=
ch
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used.  I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter.  What if we:

* drop the rule of transliterating "=D0=A5" to "H" and transliterate always=
 to
"X",
* transliterate uppercase "=D0=A8" to "Sh" (so it will work fine for titlec=
ase
  words)?

As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "=D0=A5" and therefore will never cause a con=
flict.
Examples:

* "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema",
* "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" -> "Sxema".

Will this solve the problem?

Regards,

Rafal