From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 81035 invoked by alias); 2 Nov 2018 23:27:31 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 81001 invoked by uid 89); 2 Nov 2018 23:27:29 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy==d0=a1=d1, genuine, =d0=b5=d0=bc=d0=b0, brake?= X-HELO: mout.kundenserver.de Subject: Re: [PATCH v8] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] To: libc-alpha@sourceware.org, libc-locales@sourceware.org References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <4c8c8fa7-5aa3-11c5-741d-33552b4c4c7c@kobylkin.com> <1847777640.296958.1541197328619@poczta.nazwa.pl> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: Date: Fri, 02 Nov 2018 23:27:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <1847777640.296958.1541197328619@poczta.nazwa.pl> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-q4/txt/msg00100.txt.bz2 Moving everybody from To: and CC: on BCC. It seems at this stage it is Rafal and me. It is still going to libc-alpha and libc-locales. If you are interested to be put back on CC - please let me know. On 02.11.18 23:22, Rafal Luzynski wrote: >> * Consistently transliterate single uppercase Cyrillic letters to >> sequences of all uppercase Latin letters in all languages >> (whenever a Cyrillic letter is transliterated to more than one >> Latin letter), for example "Ї" is now transliterated as "YI" rather >> than "Yi". > > I think you have not yet explained whether this is required by any > existing standard (please provide links) or whether this is your > genuine idea to distinguish between the cases like "Ш" transliterated > to "Sh" and "Сх" also transliterated to "Sh". I remember seeing this form of the capitalization it in actual transliterated texts long time ago but can't find a formal description as of now. Just don't want to claim this to be my original idea. >> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for >> example for "Сх" and "Ш" that would both transliterate to Sh: >> With SH:"Схема"->"Shema" but "Шема"->"SHema" >> With Sh:"Схема"->"Shema" and "Шема"->"Shema". Collision! >> This is important e.g. for renaming files, grouping as in using uniq >> etc. As for the users - I am a user and I have demonstrated the use cases where the collisions due to "one symbol capitalization" would cause irreversible damage to data. For a library like glibc this seems like a relevant issue to consider. The "two symbol capitalization" on the other hand would prevent collision and can be easily corrected in the userspace if needed with something like foo="SHema" foo="${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})" echo "$foo" Shema It looks like everyone really using transliteration for something sensitive already have done it the userspace since at least 2006 when this bug was first logged. So we won't brake the official use cases where the capitalization should be done in a certain way. But we will prevent new bugs due to collision if we use "two symbol capitalization" indeed. Happy to hear arguments to the contrary. Bests, Egor Kobylkin