From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6419-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 76151 invoked by alias); 9 Oct 2018 18:34:29 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 76111 invoked by uid 89); 9 Oct 2018 18:34:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-0.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=upload, 09102018, uppercase, transcription
X-HELO: mout.kundenserver.de
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872] re-submission for 2.29
From: Egor Kobylkin <egor@kobylkin.com>
To: Rafal Luzynski <digitalfreak@lingonborough.com>,
 Marko Myllynen <myllynen@redhat.com>
Cc: Keld Simonsen <keld@keldix.com>, libc-alpha@sourceware.org,
 libc-locales@sourceware.org, "Dmitry V. Levin" <ldv@altlinux.org>,
 Volodymyr Lisivka <vlisivka@gmail.com>, Carlos O'Donell <carlos@redhat.com>,
 Max Kutny <mkutny@gmail.com>, danilo@gnome.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com>
 <ac4c9b3e-aeae-30de-23ef-24d8f53d7bc4@kobylkin.com>
 <20181003091949.GA21486@rap.rap.dk>
 <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com>
 <1485772360.805333.1538731225156@poczta.nazwa.pl>
 <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
 <bda2ca60-18f1-3b19-91e5-c9ad144bc834@redhat.com>
 <bb4e1ba5-5fa5-2986-2573-7d27be226124@kobylkin.com>
 <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com>
 <b8f02fe9-f911-487f-b50b-9b0c43191cb6@kobylkin.com>
 <f51992ad-008b-03a4-8880-4c12edced53b@redhat.com>
 <246390048.827062.1539037422672@poczta.nazwa.pl>
 <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com>
 <f6b530b0-53b7-bd90-9bb9-864d0a477f50@kobylkin.com>
Openpgp: preference=signencrypt
Message-ID: <a9af47d8-bf3d-e607-38e1-a6e765a604d3@kobylkin.com>
Date: Tue, 09 Oct 2018 18:34:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <f6b530b0-53b7-bd90-9bb9-864d0a477f50@kobylkin.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-q4/txt/msg00031.txt.bz2


The culprits were the "" around the "<U0423><U0301>" (<U00DA>) and
"<U0443><U0301>" (<U00FA>).
It works now with
% CYRILLIC UNDEFINED
<U0423><U0301> <U00DA>;"<U0055><U0060>"
% CYRILLIC UNDEFINED
<U0443><U0301> <U00FA>;"<U0075><U0060>"

The <U0301> is "combining" and obviously it doesn't work if enclosed in
quotes with the letter codepoint. Please let me know if there is another
explanation.

I will now make those changes and generate the patch itself.
Egor

On 09.10.2018 15:18, Egor Kobylkin wrote:
> Hi,
> 
> I have now implemented all the changes requested for translit_cyrillic
> file but started hitting what seems like a bug:
> 
> - If the line <U0425> <U0048>;<U0058> is present in translt_cyrillic the
> locale compilation fails i.e. grep CYRILLIC < $testfile |
> LOCPATH=$workdir/compiled_locales/"$locale"/ LC_ALL="$locale".UTF-8
> iconv -f UTF-8 -t ASCII//TRANSLIT is hanging frozen.
> 
> - If the line <U0425> <U0048>;<U0058> is absent from translit_cyrillic
> everything works, just the transliteration of <U0425> fails as expected
> (? is displayed)
> 
> - If translit_cyrillic contains <U0425> <U0048>;<U0058> as the _only_
> line the transliteration of <U0425> works again (others as ?).
> 
> Would you have any idea into what direction should I look? The new
> translit_cyrillic is attached.
> 
> (<U0425> is % CYRILLIC CAPITAL LETTER HA)
> 
> Best regards,
> Egor
> 
> On 09.10.2018 01:35, Egor Kobylkin wrote:
>> On 09.10.2018 00:23, Rafal Luzynski wrote:
>>> 8.10.2018 14:40 Marko Myllynen <myllynen@redhat.com> wrote:
>>>> Hi,
>>>>
>>>> Thanks for the update. I have few mostly cosmetic comments below,
>>>> hopefully we'll hear from others whether they agree with this direction.
>>>>
>>
>> Yeah, the earlier we have feedback the more productive we are. I'd be
>> happy to get much feedback on this as early as possible. So please
>> everybody concerned please chime in.
>>
>>>
>>>> - No duplicates:
>>>>
>>>> % CYRILLIC SMALL LETTER IE
>>>> <U0435> <U0065>; <U0065>
>>>>
>>>> should become:
>>>>
>>>> % CYRILLIC SMALL LETTER IE
>>>> <U0435> <U0065>
>>>>
>>>> - There are few issues with the definitions:
>>>>
>>>> % CYRILLIC CAPITAL LETTER U
>>>> <U0423> <U0055>; <U0055>
>>>> % CYRILLIC UNDEFINED
>>>> <U0423><U0423> <U00DA>; "<U0055><U0060>"
>>>>
>>>> % CYRILLIC SMALL LETTER U
>>>> <U0443> <U0075>; <U0075>
>>>> % CYRILLIC UNDEFINED
>>>> <U0443><U0443> <U00FA>; "<U0075><U0060>"
>>>
>>> Are the duplicates here because some Cyrillic letters may have multiple
>>> Latin transliterations depending on the context, for example Cyrillic IE
>>> must be transliterated sometimes as "e", sometimes as "ie", sometimes
>>> as "ye" or "je"?  Can we provide rules for groups of characters instead?
>> No, the duplicates are just by design of my line generating logic. I
>> have fixed (removed) them. The varying transcription between
>> languages/locales can not be handled in one file at all as far as I
>> understood.
>>
>>>
>>>> I wonder would it be possible to automate generation of this file so
>>>> that issues like the above could avoided? But perhaps that could be the
>>>> next step once this initial patch lands.
>>
>> I am generating the content part of the translit_cyrillc from the
>> LibreOffice Spreadsheet. Not sure if you had time to view it by now?
>> https://sourceware.org/bugzilla/attachment.cgi?id=11299
>>
>> Anyway I have just fixed the issues identified by Marko above in that
>> spreadsheet. I will do the changes for the below request and then upload
>> the new translit_cyrillic file to the bugzilla.
>>
>>
>>>> - Please add the standard glibc locale header (see the existing
>>>> translit_* files for reference)
>>>> - Consider wrapping the header lines at or around column 70-72
>>>> - Consider describing which characters, character ranges, or blocks are
>>>> supported (perhaps also describe why some of those are not included, see
>>>> e.g. https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode)
>>>> - Please remove trailing whitespaces and spaces after ;
>>>
>>> Thanks for this, Marko.  While at this, in the ChangeLog and in the commit
>>> message these paths:
>>>
>>> 	* locales/aa_DJ: likewise
>>>
>>> 1. Should be a relative path starting in the root directory of glibc
>> source,
>>>    that is: "* localedata/locales/aa_DJ".
>>> 2. Should be "Likewise." (starting with an uppercase and ending with a
>> dot).
>>
>> will do.
>>
>> Bests,
>> Egor
>>
>