From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-96337-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 75788 invoked by alias); 10 Oct 2018 11:23:07 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 75670 invoked by uid 89); 10 Oct 2018 11:23:06 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=ups, instantly, 10102018, H*r:sk:i8-v6so
X-HELO: mail-wm1-f66.google.com
Return-Path: <myllynen@redhat.com>
Reply-To: Marko Myllynen <myllynen@redhat.com>
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872] re-submission for 2.29
To: Egor Kobylkin <egor@kobylkin.com>,
 Rafal Luzynski <digitalfreak@lingonborough.com>
Cc: Keld Simonsen <keld@keldix.com>, libc-alpha@sourceware.org,
 libc-locales@sourceware.org, "Dmitry V. Levin" <ldv@altlinux.org>,
 Volodymyr Lisivka <vlisivka@gmail.com>, Carlos O'Donell <carlos@redhat.com>,
 Max Kutny <mkutny@gmail.com>, danilo@gnome.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com>
 <ac4c9b3e-aeae-30de-23ef-24d8f53d7bc4@kobylkin.com>
 <20181003091949.GA21486@rap.rap.dk>
 <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com>
 <1485772360.805333.1538731225156@poczta.nazwa.pl>
 <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
 <bda2ca60-18f1-3b19-91e5-c9ad144bc834@redhat.com>
 <bb4e1ba5-5fa5-2986-2573-7d27be226124@kobylkin.com>
 <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com>
 <b8f02fe9-f911-487f-b50b-9b0c43191cb6@kobylkin.com>
 <f51992ad-008b-03a4-8880-4c12edced53b@redhat.com>
 <246390048.827062.1539037422672@poczta.nazwa.pl>
 <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com>
 <f6b530b0-53b7-bd90-9bb9-864d0a477f50@kobylkin.com>
 <a9af47d8-bf3d-e607-38e1-a6e765a604d3@kobylkin.com>
 <1198370378.413479.1539123456488@poczta.nazwa.pl>
 <70c29e42-0fd3-4f10-fafb-44d67190d870@kobylkin.com>
 <c89f0ac3-6ccb-3e41-dc26-75ef03d9afa1@kobylkin.com>
From: Marko Myllynen <myllynen@redhat.com>
Message-ID: <9edcf6f2-607c-91ac-8eaf-ffbc973fe597@redhat.com>
Date: Wed, 10 Oct 2018 12:12:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <c89f0ac3-6ccb-3e41-dc26-75ef03d9afa1@kobylkin.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-10/txt/msg00157.txt.bz2

Hi,

On 2018-10-10 01:42, Egor Kobylkin wrote:
> Ups, sorry, wrong link to the patch
> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303

Although I haven't checked every rule this in general looks very good
(but see below). Not sure do we want to add the few missing characters
mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode,
e.g., one instantly notices that U+0400 is missing. (I wouldn't add at
least initially the more exotic characters, like the historic ones,
though.) Perhaps filing a bug or two for these cases for separate
consideration would be ok.

> On 10.10.2018 00:40, Egor Kobylkin wrote:
>> On 10.10.2018 00:17, Rafal Luzynski wrote:
>>> 9.10.2018 20:34 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>
>>>> The culprits were the "" around the "<U0423><U0301>" (<U00DA>) and
>>>> "<U0443><U0301>" (<U00FA>).
>>>> It works now with
>>>> % CYRILLIC UNDEFINED
>>>> <U0423><U0301> <U00DA>;"<U0055><U0060>"
>>>> % CYRILLIC UNDEFINED
>>>> <U0443><U0301> <U00FA>;"<U0075><U0060>"
>>>>
>>>> [...]
>>>
>>> I wonder why you need Cyrillic U with acute, and why you comment it
>>> as "undefined" at all.  I know that any Cyrillic vowel may appear with
>>> an acute accent but "the diacritic is used only in dictionaries, children's
>>> books, resources for foreign-language learners (...)". [1]  So maybe
>>> all vowels with an acute accent should be handled (which I think is fine)
>>> rather than just U.
>>
>> I have just taken the https://en.wikipedia.org/wiki/ISO_9 table and
>> implemented it on Marko's suggestion. Personally I have no opinion on
>> what letters should be included and under what name. These funny Us just
>> happened to be in the ISO9 table.
>>
>> There is no codepoint and no name for <U0423><U0301> and <U0443><U0301>
>> in Unicode. Thatâs why its coming through that way from my worksheet as
>> it does a reverse lookup on the names based on the Unicode codepoints.
>>
>> Manually we can change it to whatever youâd suggest in the
>> translit_cyrillic. I just donât know the right name.

I'm not sure this will work, no existing rule in translit_* files
contain two characters, I'd assume that the rule for U+0423 is applied
first and then the below rule is never used.

% CYRILLIC UNDEFINED
<U0423><U0301> <U00DA>;"<U0055><U0060>"

Perhaps this should be commented out or removed altogether if it's not
working as intended.

Thanks,

-- 
Marko Myllynen