From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6746-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 19890 invoked by alias); 7 Jun 2019 09:46:20 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 19847 invoked by uid 89); 7 Jun 2019 09:46:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_2,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy=H*f:@kobylkin.com, H*f:sk:DDiRMB9, H*f:sk:2030695, showed
X-HELO: mail-40132.protonmail.ch
Date: Fri, 07 Jun 2019 09:46:00 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kobylkin.com;
	s=protonmail; t=1559900771;
	bh=eWyEc7LGRPXG9MhxrgqnYfa+vgfnq2oZZrwPR1vs/gc=;
	h=Date:To:From:Cc:Reply-To:Subject:In-Reply-To:References:
	 Feedback-ID:From;
	b=Mb4Pnm0qR+8XND334PEvjKBqd8ShlbKEJnDzAPHqewFN/mFfdXrnbZHRn115i+KA6
	 HQEYRo9L3QOYvjdfg+YJ0uBN7w1rHz9sHhvuwl9rHgvimJ0mj/pByzB0ywY3SbV7BR
	 aIBBVfXJqlA80g6Kz6+5sQi1CH2dcgPv2Ws6RWQE=
To: Carlos O'Donell <codonell@redhat.com>
From: "Diego (Egor) Kobylkin" <egor@kobylkin.com>
Cc: Rafal Luzynski <digitalfreak@lingonborough.com>, Marko Myllynen <myllynen@redhat.com>, "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>, "libc-locales@sourceware.org" <libc-locales@sourceware.org>, Siddhesh Poyarekar <siddhesh@gotplt.org>, Mike Fabian <mfabian@redhat.com>
Reply-To: "Diego (Egor) Kobylkin" <egor@kobylkin.com>
Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
Message-ID: <oxkea2EN74QNeSDcsfbIQzkTXQZCzuSb3pjMXr6RmDC8zgBm1bDUeE-9RsH4_kUfAJTv2Bwk0Cm-4_r6RM8WHoDBWYW09r1gKQWT1i7r3Mo=@kobylkin.com>
In-Reply-To: <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com>
References: <DDiRMB942zU2NTs_1xTsb-zTgRD2L6AOaaJW-a0-0YJ3O5voZt2GeTjQJQ0c_hExTwcJKvBMiXIeyHsdieM2Q1m61oOpU27Msj09zowycVM=@kobylkin.com>
 <2030695416.914859.1559778544120@poczta.nazwa.pl>
 <f392d6b1-39e7-c883-ae5e-2a8636231181@redhat.com>
 <1640311749.1550210.1559856673283@poczta.nazwa.pl>
 <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="---------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc"; charset=UTF-8
X-SW-Source: 2019-q2/txt/msg00082.txt.bz2

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
-----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc
Content-Type: multipart/mixed;boundary=---------------------20542a4481d69ae317e855edb81936f4

-----------------------20542a4481d69ae317e855edb81936f4
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;charset=utf-8
Content-length: 5133

Hi Carlos et al.=20


On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com> wrot=
e:
> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.

1. gibc already translits letters and ligatures capitalized in locale/C-tra=
nslit.h.in:
"\x00c6"	"AE"	# <U00C6> LATIN CAPITAL LETTER AE
"\x0132"	"IJ"	# <U0132> LATIN CAPITAL LIGATURE IJ

2. I would just like to quote myself from 2018:=20


"collisions due to "one symbol capitalization" would cause irreversible dam=
age to data.=20

For a library like glibc this seems to be a very relevant issue to consider=
..."

On 03.11.18 00:27, Egor Kobylkin wrote:> On 02.11.18 23:22, Rafal Luzynski =
wrote:
>>> * Consistently transliterate single uppercase Cyrillic letters to
>>> sequences of all uppercase Latin letters in all languages
>>> (whenever a Cyrillic letter is transliterated to more than one
>>> Latin letter), for example "=D0=87" is now transliterated as "YI" rather
>>> than "Yi".
>> I think you have not yet explained whether this is required by any
>> existing standard (please provide links) or whether this is your
>> genuine idea to distinguish between the cases like "=D0=A8" transliterat=
ed > to "Sh" and
>   "=D0=A1=D1=85" also transliterated to "Sh".
>=20

> I remember seeing this form of the capitalization it in actual
> transliterated texts long time ago but can't find a formal description
> as of now. Just don't want to claim this to be my original idea.
>=20

>>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for
>>> example for "=D0=A1=D1=85" and "=D0=A8" that would both transliterate t=
o Sh:
>>> With SH:"=D0=A1=D1=85=D0=B5=D0=BC=D0=B0"->"Shema" but "=D0=A8=D0=B5=D0=
=BC=D0=B0"->"SHema"
>>> With Sh:"=D0=A1=D1=85=D0=B5=D0=BC=D0=B0"->"Shema" and "=D0=A8=D0=B5=D0=
=BC=D0=B0"->"Shema". Collision!
>>> This is important e.g. for renaming files, grouping as in using uniq >>=
 etc.
> As for the users - I am a user and I have demonstrated the use cases
> where the collisions due to "one symbol capitalization" would cause
> irreversible damage to data. For a library like glibc this seems like a
> relevant issue to consider.
>=20

> The "two symbol capitalization" on the other hand would prevent
> collision and can be easily corrected in the userspace if needed
> with something like
>=20

> foo=3D"SHema"
> foo=3D"${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})"
> echo "$foo"
> Shema
>=20

> It looks like everyone really using transliteration for something
> sensitive already have done it the userspace since at least 2006 when
> this bug was first logged. So we won't break the official use cases
> where the capitalization should be done in a certain way. But we will
> prevent new bugs due to collision if we use "two symbol capitalization"
> indeed.
>=20

> Happy to hear arguments to the contrary.


Bests,
Diego

=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Original Me=
ssage =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90
On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com> wrot=
e:

> On 6/6/19 5:31 PM, Rafal Luzynski wrote:
>=20

> > > > Possible answers (Cyrillic -> Latin Extended -> ASCII):
> > > >=20

> > > > 1.  "=D0=A8" -> "=C5=A0" -> "SH"
> > > >     e.g.: "=D0=A8=D0=B5=D0=BC=D0=B0" -> "=C5=A0ema" -> "SHema"
> > > >     "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" ----------> "Shema"
> > > >=20=20=20=20=20

> > > > 2.  "=D0=A8" -> "=C5=A0" -> "Sh"
> > > >     e.g.: "=D0=A8=D0=B5=D0=BC=D0=B0" -> "=C5=A0ema" -> "Shema"
> > > >     "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" ----------> "Shema"
> > > >=20=20=20=20=20

> > > >=20

> > > > Personally I don't like the answer 1. because "SHema" looks weird
> > > > to me. Egor in turn does not like the answer 2. because the output
> > > > string becomes ambiguous.
> > > > Should we maybe have a smart algorithm which would select the title
> > > > case or the upper case of the output characters depending on the
> > > > context in the word? Note that it would not resolve the problem of
> > > > the output text being ambiguous.
> > >=20

> > > It seems clear that there is no one right/wrong answer but it's a mat=
ter
> > > of preference, especially the way this currently works. It might be an
> > > improvement to output (for instance) SH instead of Sh if all the other
> > > letters of a word are upper-case as well but not sure what would help
> > > with the result being unambiguous.
> >=20

> > I think you refer to the idea of implementing a smart algorithm which w=
ould
> > adapt the lower/upper case depending on the context but indeed it would
> > not resolve the problem of ambiguity.
> > So, the smart algorithm aside, what should be the preferred translitera=
tion
> > rule?
>=20

> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.
>=20

> -------------------------------------------------------------------------=
--------------------------------------------------------------------
>=20

> Cheers,
> Carlos.


-----------------------20542a4481d69ae317e855edb81936f4
Content-Type: application/pgp-keys; filename="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"; name="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"; name="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"
Content-length: 891

LS0tLS1CRUdJTiBQR1AgUFVCTElDIEtFWSBCTE9DSy0tLS0tDQpWZXJzaW9u
OiBPcGVuUEdQLmpzIHY0LjUuMQ0KQ29tbWVudDogaHR0cHM6Ly9vcGVucGdw
anMub3JnDQoNCnhqTUVYTGN4NkJZSkt3WUJCQUhhUnc4QkFRZEFUYVpYRStO
US9ZYXJYRk9jTEhJQk9DSWJ6TXNnNXpQZQ0KSTZ5VzR4OHBQVlhOSnlKbFoy
OXlRR3R2WW5sc2EybHVMbU52YlNJZ1BHVm5iM0pBYTI5aWVXeHJhVzR1DQpZ
Mjl0UHNKM0JCQVdDZ0FmQlFKY3R6SG9CZ3NKQndnREFnUVZDQW9DQXhZQ0FR
SVpBUUliQXdJZUFRQUsNCkNSQStPcVNEZ0FHcG9acmVBUDlOTUdxMXZ1UVJi
Y1hBbGhZbStvRU9XMGVWYXRyK0RJcDRBdGJoYzdkZw0KUUFFQXA1NjBKMFEz
RHpmK1BKY1pDdFBHeERlOWZWVkZyelBYUzN3MTBYN00wd2ZPT0FSY3R6SG9F
Z29yDQpCZ0VFQVpkVkFRVUJBUWRBb2RSbXRLSDkwV0ZMZzlwTHloS0c2b0Rv
ZWpIdWhjOEd0eTROSXlhRUxtd0QNCkFRZ0h3bUVFR0JZSUFBa0ZBbHkzTWVn
Q0d3d0FDZ2tRUGpxa2c0QUJxYUVtc2dFQTZnSWdWQ29jMVp0cw0KWWMyNVh6
MEtVWXNuMWtPNEZxZmwyd2pQNzVUYkxYZ0EvQW9odWdlc2xXZVFsRTdUQ2Fh
U3hFV0RXL2xYDQo4SmRlTEo4dFlIZFEvNU1MDQo9T0JwMQ0KLS0tLS1FTkQg
UEdQIFBVQkxJQyBLRVkgQkxPQ0stLS0tLQ0K

-----------------------20542a4481d69ae317e855edb81936f4--

-----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
Content-length: 249

-----BEGIN PGP SIGNATURE-----
Version: ProtonMail
Comment: https://protonmail.com

wl4EARYKAAYFAlz6Ml4ACgkQPjqkg4ABqaEEmAEA0Y7JNwcsffslelPVP+M2
gM2qwSCW5yw+pmHkvxWeTUcA/ix64YPWCp/HnuP7dppNLSSBslF9/CFPwWQx
uo2U9FEL
=uVhE
-----END PGP SIGNATURE-----


-----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc--