From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 90691 invoked by alias); 7 Jun 2019 21:17:25 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 90670 invoked by uid 89); 7 Jun 2019 21:17:24 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=1.6 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy=classic, intelligent, H*r:sk:libc-lo, H*i:sk:9wc_YRD X-HELO: mail-qk1-f196.google.com Return-Path: Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] To: "Diego (Egor) Kobylkin" Cc: Rafal Luzynski , Marko Myllynen , "libc-alpha@sourceware.org" , "libc-locales@sourceware.org" , Siddhesh Poyarekar , Mike Fabian References: <2030695416.914859.1559778544120@poczta.nazwa.pl> <1640311749.1550210.1559856673283@poczta.nazwa.pl> <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com> <956159024.1658672.1559904734686@poczta.nazwa.pl> <761147fe-75d8-fbbf-b75a-1b58323254f9@redhat.com> <9wc_YRDypDxNcY7OQmdiFL6pFBhm4APonL99vRdbPa0OtXWQ0vMnM5_fPZ-pPGhslnFKw25XLJi5eouRXcBghhgLDpiWsow1cDe06MN65aU=@kobylkin.com> From: Carlos O'Donell Message-ID: <226594c6-f1fc-b6de-a145-57e66a4f1868@redhat.com> Date: Fri, 07 Jun 2019 21:17:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <9wc_YRDypDxNcY7OQmdiFL6pFBhm4APonL99vRdbPa0OtXWQ0vMnM5_fPZ-pPGhslnFKw25XLJi5eouRXcBghhgLDpiWsow1cDe06MN65aU=@kobylkin.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2019-q2/txt/msg00087.txt.bz2 On 6/7/19 8:59 AM, Diego (Egor) Kobylkin wrote: > On Friday, June 7, 2019 2:35 PM, Carlos O'Donell > wrote: >> I'd like to hear what Egor has to say about the data loss aspects. > > It's quite simple really - suppose you have a list of pages in an > wikipedia. > > For example there are these two entries in Russian: 1.Шема > https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no > > 2.Схема > https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0 > > > So you want to scrape wikipedia and them out to files: Шема.txt and > Схема.txt But the target system doesn't support Russian locale and so > you must transliterate the filenames. > > > If "Ш"->"Sh" and "Сх"->"Sh", both of them will be written into the > same file "Shema.txt". With no other special handing the first file > will be overwritten and its data lost. > > If "Ш"->"SH" and "Сх"->"Sh" - there will be two separate files 1. > SHema.txt 2. Shema.txt . No data loss in this case. Agreed. > We cant exclude all data loss scenarios but at least shouldn't > knowingly let the most basic ones happen just because how SHema > looks. Translit is mostly a technical field at least in glibc so the > aesthetics would be the last thing I would care about here. > > > Anyway I'm all for committing the patch this way or another and > opening a new bug should anyone complain about Sh/SH. Until now we > had a hard time getting any input from any outsider on this issue. I > guess de-facto I am the only end-user that has an opinion on this > :-) I appreciate your input. I expected this example, it's a classic problem with transliteration that the conversion can result in non-unique representations. I also think your point about "technical" is relevant here, nobody really wants to read the transliterated results, they want to read the original, and providing any hint about the original form has value. In glibc we don't have any framework for an intelligent conversion. We would have to write specific code to handle this case and add it into the translit code for special handling in this case. I think we should today leave "Ш"->"SH" and "Сх"->"Sh", since it's the most conservative position that avoids ambiguity, and then we can discuss the aesthetics of this and the other impacts and solutions. I appreciate Rafal's position, but I think being conservative here, even if it's not as pretty as uconv, is a good guiding idea. -- Cheers, Carlos.