From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 75788 invoked by alias); 10 Oct 2018 11:23:07 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 75670 invoked by uid 89); 10 Oct 2018 11:23:06 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=ups, instantly, 10102018, H*r:sk:i8-v6so X-HELO: mail-wm1-f66.google.com Return-Path: Reply-To: Marko Myllynen Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 To: Egor Kobylkin , Rafal Luzynski Cc: Keld Simonsen , libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> <246390048.827062.1539037422672@poczta.nazwa.pl> <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com> <1198370378.413479.1539123456488@poczta.nazwa.pl> <70c29e42-0fd3-4f10-fafb-44d67190d870@kobylkin.com> From: Marko Myllynen Message-ID: <9edcf6f2-607c-91ac-8eaf-ffbc973fe597@redhat.com> Date: Wed, 10 Oct 2018 12:12:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-10/txt/msg00157.txt.bz2 Hi, On 2018-10-10 01:42, Egor Kobylkin wrote: > Ups, sorry, wrong link to the patch > correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303 Although I haven't checked every rule this in general looks very good (but see below). Not sure do we want to add the few missing characters mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode, e.g., one instantly notices that U+0400 is missing. (I wouldn't add at least initially the more exotic characters, like the historic ones, though.) Perhaps filing a bug or two for these cases for separate consideration would be ok. > On 10.10.2018 00:40, Egor Kobylkin wrote: >> On 10.10.2018 00:17, Rafal Luzynski wrote: >>> 9.10.2018 20:34 Egor Kobylkin wrote: >>>> >>>> The culprits were the "" around the "" () and >>>> "" (). >>>> It works now with >>>> % CYRILLIC UNDEFINED >>>> ;"" >>>> % CYRILLIC UNDEFINED >>>> ;"" >>>> >>>> [...] >>> >>> I wonder why you need Cyrillic U with acute, and why you comment it >>> as "undefined" at all. I know that any Cyrillic vowel may appear with >>> an acute accent but "the diacritic is used only in dictionaries, children's >>> books, resources for foreign-language learners (...)". [1] So maybe >>> all vowels with an acute accent should be handled (which I think is fine) >>> rather than just U. >> >> I have just taken the https://en.wikipedia.org/wiki/ISO_9 table and >> implemented it on Marko's suggestion. Personally I have no opinion on >> what letters should be included and under what name. These funny Us just >> happened to be in the ISO9 table. >> >> There is no codepoint and no name for and >> in Unicode. That’s why its coming through that way from my worksheet as >> it does a reverse lookup on the names based on the Unicode codepoints. >> >> Manually we can change it to whatever you’d suggest in the >> translit_cyrillic. I just don’t know the right name. I'm not sure this will work, no existing rule in translit_* files contain two characters, I'd assume that the rule for U+0423 is applied first and then the below rule is never used. % CYRILLIC UNDEFINED ;"" Perhaps this should be commented out or removed altogether if it's not working as intended. Thanks, -- Marko Myllynen