From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 96952 invoked by alias); 16 Nov 2018 22:17:42 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 96927 invoked by uid 89); 16 Nov 2018 22:17:41 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=1.8 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy==d0=a1=d1, letter?= X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: 0 Date: Fri, 16 Nov 2018 22:17:00 -0000 From: Rafal Luzynski To: Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org Message-ID: <837001401.21346.1542406647888@poczta.nazwa.pl> In-Reply-To: References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-q4/txt/msg00110.txt.bz2 Thank you for working on this, Egor. Before I start reviewing I would like to summarize the things which I think are blocking for this patch. 1. I think we need tests for transliteration. Currently there is only one test program which is similar to what we need, localedata/bug-iconv-trans.c. It is old and it is not quite clear what bug it is trying to test. Therefore I think we need a new framework to test transliteration. Is it a good idea to base the test on the iconv(1) command line utility which is part of glibc? 2. I made few tests in the command line and it seems to me that the transliteration from "=D0=97" to "Z" (+ lowercase as well) in uk_UA does not work and has not been working for some time already because I've checked some older systems as well and the result is always the same. I think that the reason is that uk_UA defines multiple transliteration rules for "=D0=97" depending on what is the letter follo= wing it. It does not seem to work. AFAIK the reason is that the syntax of transliteration rules says that a single non-Latin character may map one or more Latin strings, each consisting of one or more characters. There cannot be a rule transliterating multiple source characters into one or multiple destination characters. Is it a bug in transliteration implementation? Or maybe in the specification, including POSIX standard? The definition of transliteration says that it is one-to-one mapping of graphemes while a grapheme may be one or multiple characters. It does not have to be always mapping one-to-one character. Should we fix this bug first, make uk_UA transliteration work, and only then add a generic Cyrillic transliteration? Egor's patch already contains transliteration of "=D0=A3" + combining acute accent to "=C3=9A" which m= ost probably will not work. I still think that in the longer term all existing custom transliterations of Cyrillic alphabets should be ported to a modification of your patch. Egor, while at this I was thinking about your idea to transliterate letters like "=D0=A8" (uppercase) to "SH" (always uppercase) in order to distinguish between "=D0=A8=D0=B5=D0=BC=D0=B0" (-> "SHema") and "=D0=A1=D1=85=D0=B5=D0= =BC=D0=B0" (-> "Shema" or "Sxema"). Also you include a rule to transliterate "=D0=A5" to "H" or "X" depending on whi= ch destination characters are available, which I told you already that will not work because both "H" and "X" are always available and therefore only the first rule will always be used. I still don't like the idea to put two uppercase letters in a beginning of a word in titlecase only to indicate that there was originally a single letter. What if we: * drop the rule of transliterating "=D0=A5" to "H" and transliterate always= to "X", * transliterate uppercase "=D0=A8" to "Sh" (so it will work fine for titlec= ase words)? As a result the Latin letter "h" will only appear as part of a digraph and never as a transliteration of "=D0=A5" and therefore will never cause a con= flict. Examples: * "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema", * "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" -> "Sxema". Will this solve the problem? Regards, Rafal