From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6488-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 81035 invoked by alias); 2 Nov 2018 23:27:31 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 81001 invoked by uid 89); 2 Nov 2018 23:27:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy==d0=a1=d1, genuine, =d0=b5=d0=bc=d0=b0, brake?=
X-HELO: mout.kundenserver.de
Subject: Re: [PATCH v8] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
To: libc-alpha@sourceware.org, libc-locales@sourceware.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <4c8c8fa7-5aa3-11c5-741d-33552b4c4c7c@kobylkin.com>
 <1847777640.296958.1541197328619@poczta.nazwa.pl>
From: Egor Kobylkin <egor@kobylkin.com>
Openpgp: preference=signencrypt
Message-ID: <ffbdaf2a-7d2c-c096-5762-7e91b6de19b2@kobylkin.com>
Date: Fri, 02 Nov 2018 23:27:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <1847777640.296958.1541197328619@poczta.nazwa.pl>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-q4/txt/msg00100.txt.bz2

Moving everybody from To: and CC: on BCC. It seems at this stage it is
Rafal and me. It is still going to libc-alpha and libc-locales. If you
are interested to be put back on CC - please let me know.

On 02.11.18 23:22, Rafal Luzynski wrote:
>> * Consistently transliterate single uppercase Cyrillic letters to 
>> sequences of all uppercase Latin letters in all languages
>> (whenever a Cyrillic letter is transliterated to more than one
>> Latin letter), for example "Ð" is now transliterated as "YI" rather
>> than "Yi".
> 
> I think you have not yet explained whether this is required by any
> existing standard (please provide links) or whether this is your
> genuine idea to distinguish between the cases like "Ð¨" transliterated > to "Sh" and
 "Ð¡Ñ" also transliterated to "Sh".

I remember seeing this form of the capitalization it in actual
transliterated texts long time ago but can't find a formal description
as of now. Just don't want to claim this to be my original idea.

>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for
>> example for "Ð¡Ñ" and "Ð¨" that would both transliterate to Sh:
>> With SH:"Ð¡ÑÐµÐ¼Ð°"->"Shema" but "Ð¨ÐµÐ¼Ð°"->"SHema"
>> With Sh:"Ð¡ÑÐµÐ¼Ð°"->"Shema" and "Ð¨ÐµÐ¼Ð°"->"Shema". Collision!
>> This is important e.g. for renaming files, grouping as in using uniq >> etc.

As for the users - I am a user and I have demonstrated the use cases
where the collisions due to "one symbol capitalization" would cause
irreversible damage to data. For a library like glibc this seems like a
relevant issue to consider.

The "two symbol capitalization" on the other hand would prevent
collision and can be easily corrected in the userspace if needed
with something like

foo="SHema"
foo="${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})"
echo "$foo"
Shema

It looks like everyone really using transliteration for something
sensitive already have done it the userspace since at least 2006 when
this bug was first logged. So we won't brake the official use cases
where the capitalization should be done in a certain way. But we will
prevent new bugs due to collision if we use "two symbol capitalization"
indeed.

Happy to hear arguments to the contrary.

Bests,
Egor Kobylkin