From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 76151 invoked by alias); 9 Oct 2018 18:34:29 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 76111 invoked by uid 89); 9 Oct 2018 18:34:29 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=upload, 09102018, uppercase, transcription X-HELO: mout.kundenserver.de Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 From: Egor Kobylkin To: Rafal Luzynski , Marko Myllynen Cc: Keld Simonsen , libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> <246390048.827062.1539037422672@poczta.nazwa.pl> <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com> Openpgp: preference=signencrypt Message-ID: Date: Tue, 09 Oct 2018 18:34:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SW-Source: 2018-q4/txt/msg00031.txt.bz2 The culprits were the "" around the "" () and "" (). It works now with % CYRILLIC UNDEFINED ;"" % CYRILLIC UNDEFINED ;"" The is "combining" and obviously it doesn't work if enclosed in quotes with the letter codepoint. Please let me know if there is another explanation. I will now make those changes and generate the patch itself. Egor On 09.10.2018 15:18, Egor Kobylkin wrote: > Hi, > > I have now implemented all the changes requested for translit_cyrillic > file but started hitting what seems like a bug: > > - If the line ; is present in translt_cyrillic the > locale compilation fails i.e. grep CYRILLIC < $testfile | > LOCPATH=$workdir/compiled_locales/"$locale"/ LC_ALL="$locale".UTF-8 > iconv -f UTF-8 -t ASCII//TRANSLIT is hanging frozen. > > - If the line ; is absent from translit_cyrillic > everything works, just the transliteration of fails as expected > (? is displayed) > > - If translit_cyrillic contains ; as the _only_ > line the transliteration of works again (others as ?). > > Would you have any idea into what direction should I look? The new > translit_cyrillic is attached. > > ( is % CYRILLIC CAPITAL LETTER HA) > > Best regards, > Egor > > On 09.10.2018 01:35, Egor Kobylkin wrote: >> On 09.10.2018 00:23, Rafal Luzynski wrote: >>> 8.10.2018 14:40 Marko Myllynen wrote: >>>> Hi, >>>> >>>> Thanks for the update. I have few mostly cosmetic comments below, >>>> hopefully we'll hear from others whether they agree with this direction. >>>> >> >> Yeah, the earlier we have feedback the more productive we are. I'd be >> happy to get much feedback on this as early as possible. So please >> everybody concerned please chime in. >> >>> >>>> - No duplicates: >>>> >>>> % CYRILLIC SMALL LETTER IE >>>> ; >>>> >>>> should become: >>>> >>>> % CYRILLIC SMALL LETTER IE >>>> >>>> >>>> - There are few issues with the definitions: >>>> >>>> % CYRILLIC CAPITAL LETTER U >>>> ; >>>> % CYRILLIC UNDEFINED >>>> ; "" >>>> >>>> % CYRILLIC SMALL LETTER U >>>> ; >>>> % CYRILLIC UNDEFINED >>>> ; "" >>> >>> Are the duplicates here because some Cyrillic letters may have multiple >>> Latin transliterations depending on the context, for example Cyrillic IE >>> must be transliterated sometimes as "e", sometimes as "ie", sometimes >>> as "ye" or "je"? Can we provide rules for groups of characters instead? >> No, the duplicates are just by design of my line generating logic. I >> have fixed (removed) them. The varying transcription between >> languages/locales can not be handled in one file at all as far as I >> understood. >> >>> >>>> I wonder would it be possible to automate generation of this file so >>>> that issues like the above could avoided? But perhaps that could be the >>>> next step once this initial patch lands. >> >> I am generating the content part of the translit_cyrillc from the >> LibreOffice Spreadsheet. Not sure if you had time to view it by now? >> https://sourceware.org/bugzilla/attachment.cgi?id=11299 >> >> Anyway I have just fixed the issues identified by Marko above in that >> spreadsheet. I will do the changes for the below request and then upload >> the new translit_cyrillic file to the bugzilla. >> >> >>>> - Please add the standard glibc locale header (see the existing >>>> translit_* files for reference) >>>> - Consider wrapping the header lines at or around column 70-72 >>>> - Consider describing which characters, character ranges, or blocks are >>>> supported (perhaps also describe why some of those are not included, see >>>> e.g. https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode) >>>> - Please remove trailing whitespaces and spaces after ; >>> >>> Thanks for this, Marko. While at this, in the ChangeLog and in the commit >>> message these paths: >>> >>> * locales/aa_DJ: likewise >>> >>> 1. Should be a relative path starting in the root directory of glibc >> source, >>> that is: "* localedata/locales/aa_DJ". >>> 2. Should be "Likewise." (starting with an uppercase and ending with a >> dot). >> >> will do. >> >> Bests, >> Egor >> >