From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 47492 invoked by alias); 7 Jun 2019 11:11:22 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 47471 invoked by uid 89); 7 Jun 2019 11:11:21 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-6.4 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 spammy=lean, H*i:@kobylkin.com, H*i:sk:oxkea2E, H*f:sk:oxkea2E X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: 1 Date: Fri, 07 Jun 2019 11:11:00 -0000 From: Rafal Luzynski To: "Diego (Egor) Kobylkin" , Carlos O'Donell Cc: Marko Myllynen , "libc-alpha@sourceware.org" , "libc-locales@sourceware.org" , Siddhesh Poyarekar , Mike Fabian Message-ID: <2088824303.1660312.1559905911624@poczta.nazwa.pl> In-Reply-To: References: <2030695416.914859.1559778544120@poczta.nazwa.pl> <1640311749.1550210.1559856673283@poczta.nazwa.pl> <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com> Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2019-q2/txt/msg00084.txt.bz2 7.06.2019 11:46 "Diego (Egor) Kobylkin" wrote: >=20 > Hi Carlos et al.=20 >=20 > On Friday, June 7, 2019 2:57 AM, Carlos O'Donell > wrote: > > I have a weak preference for 1. However, I would change my preference if > > someone showed me existing prior implementations that did 1 or 2. >=20 > 1. gibc already translits letters and ligatures capitalized in > locale/C-translit.h.in: > "\x00c6" "AE" # LATIN CAPITAL LETTER AE > "\x0132" "IJ" # LATIN CAPITAL LIGATURE IJ Now I lean to thinking that it is wrong because we don't have a smart algorithm which would adjust the upper/lower case of the transliterated letters. I don't criticize this particular transliteration rule, just any rule here would be wrong and incomplete (e.g., "\x00c6" -> "Ae" could be good in some cases but also wrong in many other). As a real life example, please fix me if I'm wrong, but AFAIK in German the umlaut letters like "=C3=96" are sometimes written (transliterated) as "OE" but when they appear as the first letter in a titlecased word they are transliteraded as "Oe", not as "OE" (e.g., "=C3=96sterreich" -> "Oesterreich" but not "OEsterreich"). > 2. I would just like to quote myself from 2018:=20 >=20 >=20 > "collisions due to "one symbol capitalization" would cause irreversible > damage to data.=20 >=20 > For a library like glibc this seems to be a very relevant issue to > consider..." > [...] Could you please elaborate why it is so important to ensure that the output data is never ambiguous and what damage to data would that cause? OK, you mentioned the case of renaming files. I believe that a perfect non-collision algorithm is impossible. A simple example when it would never work is when you have two files in the same directory: one with a name written in Cyrillic and another one written in Latin using exactly the same name which is the output of the transliteration algorithm. Another question: why do you need to transliterate the file names at all? Wouldn't it work perfectly for you if they were not transliterated at all? My guess is that it might be useful when using files on some older systems which do not support Unicode. Maybe let's consider who (and why) should use any transliteration at all. What comes to my mind is: 1. Countries (languages) which use two writing systems and want to have an automatic transliteration of the text. Examples: Serbian, Kazakh. 2. Countries (languages) which use non-Latin script but want to provide automatically some readable content for foreign visitors. 3. Backward compatibility with some older computer devices which are unable to handle Unicode. Now we may think about what are the requirements of these target groups and whether we can provide a solution which would work for all of them. Regards, Rafal