From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 98168 invoked by alias); 5 Oct 2018 12:21:17 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 97795 invoked by uid 89); 5 Oct 2018 12:21:17 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=prime, HX-Received:a5d, HX-Received:sk:b1-v6mr, H*r:sk:a13-v6s X-HELO: mail-wr1-f67.google.com Return-Path: Reply-To: Marko Myllynen Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 To: Egor Kobylkin , Rafal Luzynski , Keld Simonsen Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org, "Dmitry V. Levin" , Volodymyr Lisivka , Carlos O'Donell , Max Kutny , danilo@gnome.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> From: Marko Myllynen Message-ID: <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> Date: Fri, 05 Oct 2018 12:21:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-q4/txt/msg00012.txt.bz2 Hi, The scheme I proposed would also be ASCII compatible; consider this example: % CYRILLIC CAPITAL LETTER SHA "";"" "printf \\u0428\\n | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv -f ISO-8859-15 -t UTF-8" would produce Š as per System A and "printf \\u0428\\n | iconv -f UTF-8 -t ASCII//TRANSLIT" would produce Sh as per System B. Thanks, On 2018-10-05 15:00, Egor Kobylkin wrote: > Hi Marko, > > I have chosen the System B because it is ASCII compartible. System A is > not ASCII compartible (diacritics in target). > > https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A > "GOST 7.79 contains two transliteration tables. > > System A > one Cyrillic character to one Latin character, some with diacritics > – identical to ISO 9:1995 > > System B > one Cyrillic character to one or many Latin characters without > diacritics > " > Hope this helps, > Egor > > On 05.10.2018 13:54, Marko Myllynen wrote: >> Hi, >> >> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if >> possible and if not, then fall back to GOST 7.79 System B? >> >> Implementation-wise current translit_* files have few examples where a >> non-ASCII transliteration is tried first before an ASCII fallback. These >> examples are from translit_neutral: >> >> % NARROW NO-BREAK SPACE >> ; >> % REVERSED TRIPLE PRIME >> "";"" >> >> Thanks, >> >> On 2018-10-05 13:29, Egor Kobylkin wrote: >>> Keld,Marko,Rafal, other locale maintainers, >>> >>> this all is written with having in mind a minimal viable fix for this >>> bug asap. I want to avoid wasting maintainers time getting into >>> fundamental discussions here (although for perfectly good reasons). >>> >>> I see three options: >>> 1. those locale maintainers that are fine with using ISO >>> 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) include it >>> in their locales (see attached screenshot of the table). >>> 2. those that that want to have a differing table can create their own >>> variety based on the spreadsheet I have prepared >>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and include it in >>> this patch. >>> 3. those that want to omit a cyrillic transliteration altogether for now >>> state so and just carry over the bug #2872 from the year 2006. >>> >>> Does this make sense to you? >>> >>> Just to be super clear on this: the patch is a stopgap _ASCII_ >>> transliteration table. ASCII being AMERICAN Standard Code for >>> Information Interchange, that is obviously orthogonal to any >>> transliteration rule of other countries. As such it is not explicitly >>> targeting transliteration standards of any country. >>> >>> The fact that the patch is reflecting Russian variety of ISO >>> 9:1995/GOST_7.79_System_B is because a) ISO 9:1995/GOST_7.79_System_B is >>> available and can be helpful to a majority of cyrillic users b) I have >>> access to it including via being proficient in Russian. >>> >>> It is offered to all the respective locale maintainers as a stopgap >>> solution. Stopgap in the sense that it is better to have some >>> transliteration than not to have any at all and carry over the bug from >>> 2006. That it may be a somewhat officially correct transliteration for >>> ru_RU is a bonus. In that sense I would dub the discussion on the >>> correctness for other languages "offtopic". Let me know if this is not OK. >>> >>> You are all are correctly mentioning the deficiencies of this approach. >>> However, I couldn't find a better straightforward approach as of yet. >>> Happy to hear from you as on how this could be handled. >>> >>> There is a danger of being caught in the web of language/country >>> differences. I propose just pruning the locales that are not comfortable >>> including this current table. We can address possible solutions in the >>> second wave of patching. >>> >>> I am vary of getting into discussions on specific country variants just >>> because of the sheer complexity of this topic. It is probably better >>> addressed by respective maintainers of their locales. I do not see a >>> "one fits all" solution in this first wave possible. >>> >>> I would like to have this "three options plan of action" vetted first >>> and then we could go to the specific detail. (Like, for instance, what >>> characters should be included in to the table, and in which >>> transliteration form.) >>> >>> I am looking forward to your reply, >>> Egor Kobylkin >>> >>> P.S. specifically as to how address languages other than Ru included in >>> GOST_7.79_System_B: we can take the first option left to right from that >>> table (Ru,By,Uk,Bg,Mk). Then it will technically work for all those >>> locales/languages but with errors where Ru supersedes their own variants. >>> >>> >>> On 05.10.2018 11:20, Rafal Luzynski wrote: >>>> 3.10.2018 11:32 Egor Kobylkin wrote: >>>>> >>>>> On 03.10.2018 11:19, Keld Simonsen wrote: >>>>>> Hi >>>>>> >>>>>> Please note that translitteration of Cyrillic to latin is not universal. >>>>>> There are different schemes for for example German, English and Danish, and >>>>>> there is also an ISO standard for it. >>>>> >>>>> Thanks for your feedback, Keld! >>>>> >>>>> Could the locale maintainers that wouldn't like to include this patch >>>>> explicitly state so here? >>>> >>>> I think it is about me so I must reply. I am sorry about that and the sole >>>> reason is my lack of time. I'm just a volunteer here, that means it's not >>>> my regular job to work on locale data nor anything in glibc nor in any other >>>> open source project. I do these things only in my free time which I don't >>>> have much. Of course you will see my contributions here and there but they >>>> are either trivial or take me months to complete. Your patches are on my >>>> radar but I can't tell any ETA for them. Of course, there are other people >>>> around here and they are all welcome to come and join. >>>> >>>>> That is: >>>>> - In the case that there is a different preferred cyrillic >>>>> transliteration table for any specific locale their maintainers may want >>>>> to point me to it so I can supply a separate table/patch. >>>>> - Or they could state explicitly that for some reason they would like to >>>>> exclude their locale from the patch for a default cyrillic >>>>> transliteration altogether. >>>> >>>> As Keld wrote, there are probably separate rules for every language so >>>> I don't think you should treat your rules as universal and include them >>>> in every locale. At first sight, it seems to me they work only for English >>>> (as a destination locale). Also, although it is called "transliteration >>>> from Cyrillic" it seems that it covers only Russian alphabet. What about >>>> other languages which use Cyrillic alphabet but add their own diacritic >>>> characters? Think about Belarusian, Ukrainian, Serbian, Chechen, Chuvash, >>>> Mari, Ossetian, Yakut, Tatar, and more. What about languages which use >>>> Cyrillic alphabet but transliterate their respective letters in a different >>>> way than Russian? For example, Russian "Ъ" is (I think) usually skipped >>>> in transliteration, I think you propose "``", but when transliterating from >>>> Bulgarian they usually transliterate this as "ă". >>>> >>>> Few remarks: >>>> >>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be better? >>>> * You transliterate "ц" as "cz", wouldn't "ts" be better? By the way, >>>> in Polish language "cz" is a correct transliteration of "ч". >>>> * You transliterate "й" as "j", this is fine in many languages but wouldn't >>>> "y" be better in English? >>>> * In case of "е": how will you know if it is correct to transliterate it >>>> to "e" or "ie" or "je" or "ye"? >>>> >>>> These remarks are obviously incomplete, your patch deserves much more >>>> attention to review. >>>> >>>> Best regards, >>>> >>>> Rafal >>>> >>> >> >> > -- Marko Myllynen