From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19890 invoked by alias); 7 Jun 2019 09:46:20 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 19847 invoked by uid 89); 7 Jun 2019 09:46:16 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_2,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.1 spammy=H*f:@kobylkin.com, H*f:sk:DDiRMB9, H*f:sk:2030695, showed X-HELO: mail-40132.protonmail.ch Date: Fri, 07 Jun 2019 09:46:00 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kobylkin.com; s=protonmail; t=1559900771; bh=eWyEc7LGRPXG9MhxrgqnYfa+vgfnq2oZZrwPR1vs/gc=; h=Date:To:From:Cc:Reply-To:Subject:In-Reply-To:References: Feedback-ID:From; b=Mb4Pnm0qR+8XND334PEvjKBqd8ShlbKEJnDzAPHqewFN/mFfdXrnbZHRn115i+KA6 HQEYRo9L3QOYvjdfg+YJ0uBN7w1rHz9sHhvuwl9rHgvimJ0mj/pByzB0ywY3SbV7BR aIBBVfXJqlA80g6Kz6+5sQi1CH2dcgPv2Ws6RWQE= To: Carlos O'Donell From: "Diego (Egor) Kobylkin" Cc: Rafal Luzynski , Marko Myllynen , "libc-alpha@sourceware.org" , "libc-locales@sourceware.org" , Siddhesh Poyarekar , Mike Fabian Reply-To: "Diego (Egor) Kobylkin" Subject: Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Message-ID: In-Reply-To: <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com> References: <2030695416.914859.1559778544120@poczta.nazwa.pl> <1640311749.1550210.1559856673283@poczta.nazwa.pl> <054f3b06-3ca8-00b0-ee07-1ff86a4106dc@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="---------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc"; charset=UTF-8 X-SW-Source: 2019-q2/txt/msg00082.txt.bz2 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) -----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc Content-Type: multipart/mixed;boundary=---------------------20542a4481d69ae317e855edb81936f4 -----------------------20542a4481d69ae317e855edb81936f4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain;charset=utf-8 Content-length: 5133 Hi Carlos et al.=20 On Friday, June 7, 2019 2:57 AM, Carlos O'Donell wrot= e: > I have a weak preference for 1. However, I would change my preference if > someone showed me existing prior implementations that did 1 or 2. 1. gibc already translits letters and ligatures capitalized in locale/C-tra= nslit.h.in: "\x00c6" "AE" # LATIN CAPITAL LETTER AE "\x0132" "IJ" # LATIN CAPITAL LIGATURE IJ 2. I would just like to quote myself from 2018:=20 "collisions due to "one symbol capitalization" would cause irreversible dam= age to data.=20 For a library like glibc this seems to be a very relevant issue to consider= ..." On 03.11.18 00:27, Egor Kobylkin wrote:> On 02.11.18 23:22, Rafal Luzynski = wrote: >>> * Consistently transliterate single uppercase Cyrillic letters to >>> sequences of all uppercase Latin letters in all languages >>> (whenever a Cyrillic letter is transliterated to more than one >>> Latin letter), for example "=D0=87" is now transliterated as "YI" rather >>> than "Yi". >> I think you have not yet explained whether this is required by any >> existing standard (please provide links) or whether this is your >> genuine idea to distinguish between the cases like "=D0=A8" transliterat= ed > to "Sh" and > "=D0=A1=D1=85" also transliterated to "Sh". >=20 > I remember seeing this form of the capitalization it in actual > transliterated texts long time ago but can't find a formal description > as of now. Just don't want to claim this to be my original idea. >=20 >>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for >>> example for "=D0=A1=D1=85" and "=D0=A8" that would both transliterate t= o Sh: >>> With SH:"=D0=A1=D1=85=D0=B5=D0=BC=D0=B0"->"Shema" but "=D0=A8=D0=B5=D0= =BC=D0=B0"->"SHema" >>> With Sh:"=D0=A1=D1=85=D0=B5=D0=BC=D0=B0"->"Shema" and "=D0=A8=D0=B5=D0= =BC=D0=B0"->"Shema". Collision! >>> This is important e.g. for renaming files, grouping as in using uniq >>= etc. > As for the users - I am a user and I have demonstrated the use cases > where the collisions due to "one symbol capitalization" would cause > irreversible damage to data. For a library like glibc this seems like a > relevant issue to consider. >=20 > The "two symbol capitalization" on the other hand would prevent > collision and can be easily corrected in the userspace if needed > with something like >=20 > foo=3D"SHema" > foo=3D"${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})" > echo "$foo" > Shema >=20 > It looks like everyone really using transliteration for something > sensitive already have done it the userspace since at least 2006 when > this bug was first logged. So we won't break the official use cases > where the capitalization should be done in a certain way. But we will > prevent new bugs due to collision if we use "two symbol capitalization" > indeed. >=20 > Happy to hear arguments to the contrary. Bests, Diego =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Original Me= ssage =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 On Friday, June 7, 2019 2:57 AM, Carlos O'Donell wrot= e: > On 6/6/19 5:31 PM, Rafal Luzynski wrote: >=20 > > > > Possible answers (Cyrillic -> Latin Extended -> ASCII): > > > >=20 > > > > 1. "=D0=A8" -> "=C5=A0" -> "SH" > > > > e.g.: "=D0=A8=D0=B5=D0=BC=D0=B0" -> "=C5=A0ema" -> "SHema" > > > > "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" ----------> "Shema" > > > >=20=20=20=20=20 > > > > 2. "=D0=A8" -> "=C5=A0" -> "Sh" > > > > e.g.: "=D0=A8=D0=B5=D0=BC=D0=B0" -> "=C5=A0ema" -> "Shema" > > > > "=D0=A1=D1=85=D0=B5=D0=BC=D0=B0" ----------> "Shema" > > > >=20=20=20=20=20 > > > >=20 > > > > Personally I don't like the answer 1. because "SHema" looks weird > > > > to me. Egor in turn does not like the answer 2. because the output > > > > string becomes ambiguous. > > > > Should we maybe have a smart algorithm which would select the title > > > > case or the upper case of the output characters depending on the > > > > context in the word? Note that it would not resolve the problem of > > > > the output text being ambiguous. > > >=20 > > > It seems clear that there is no one right/wrong answer but it's a mat= ter > > > of preference, especially the way this currently works. It might be an > > > improvement to output (for instance) SH instead of Sh if all the other > > > letters of a word are upper-case as well but not sure what would help > > > with the result being unambiguous. > >=20 > > I think you refer to the idea of implementing a smart algorithm which w= ould > > adapt the lower/upper case depending on the context but indeed it would > > not resolve the problem of ambiguity. > > So, the smart algorithm aside, what should be the preferred translitera= tion > > rule? >=20 > I have a weak preference for 1. However, I would change my preference if > someone showed me existing prior implementations that did 1 or 2. >=20 > -------------------------------------------------------------------------= -------------------------------------------------------------------- >=20 > Cheers, > Carlos. -----------------------20542a4481d69ae317e855edb81936f4 Content-Type: application/pgp-keys; filename="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"; name="publickey - egor@kobylkin.com - 0x01FEB4E8.asc" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="publickey - egor@kobylkin.com - 0x01FEB4E8.asc"; name="publickey - egor@kobylkin.com - 0x01FEB4E8.asc" Content-length: 891 LS0tLS1CRUdJTiBQR1AgUFVCTElDIEtFWSBCTE9DSy0tLS0tDQpWZXJzaW9u OiBPcGVuUEdQLmpzIHY0LjUuMQ0KQ29tbWVudDogaHR0cHM6Ly9vcGVucGdw anMub3JnDQoNCnhqTUVYTGN4NkJZSkt3WUJCQUhhUnc4QkFRZEFUYVpYRStO US9ZYXJYRk9jTEhJQk9DSWJ6TXNnNXpQZQ0KSTZ5VzR4OHBQVlhOSnlKbFoy OXlRR3R2WW5sc2EybHVMbU52YlNJZ1BHVm5iM0pBYTI5aWVXeHJhVzR1DQpZ Mjl0UHNKM0JCQVdDZ0FmQlFKY3R6SG9CZ3NKQndnREFnUVZDQW9DQXhZQ0FR SVpBUUliQXdJZUFRQUsNCkNSQStPcVNEZ0FHcG9acmVBUDlOTUdxMXZ1UVJi Y1hBbGhZbStvRU9XMGVWYXRyK0RJcDRBdGJoYzdkZw0KUUFFQXA1NjBKMFEz RHpmK1BKY1pDdFBHeERlOWZWVkZyelBYUzN3MTBYN00wd2ZPT0FSY3R6SG9F Z29yDQpCZ0VFQVpkVkFRVUJBUWRBb2RSbXRLSDkwV0ZMZzlwTHloS0c2b0Rv ZWpIdWhjOEd0eTROSXlhRUxtd0QNCkFRZ0h3bUVFR0JZSUFBa0ZBbHkzTWVn Q0d3d0FDZ2tRUGpxa2c0QUJxYUVtc2dFQTZnSWdWQ29jMVp0cw0KWWMyNVh6 MEtVWXNuMWtPNEZxZmwyd2pQNzVUYkxYZ0EvQW9odWdlc2xXZVFsRTdUQ2Fh U3hFV0RXL2xYDQo4SmRlTEo4dFlIZFEvNU1MDQo9T0JwMQ0KLS0tLS1FTkQg UEdQIFBVQkxJQyBLRVkgQkxPQ0stLS0tLQ0K -----------------------20542a4481d69ae317e855edb81936f4-- -----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-length: 249 -----BEGIN PGP SIGNATURE----- Version: ProtonMail Comment: https://protonmail.com wl4EARYKAAYFAlz6Ml4ACgkQPjqkg4ABqaEEmAEA0Y7JNwcsffslelPVP+M2 gM2qwSCW5yw+pmHkvxWeTUcA/ix64YPWCp/HnuP7dppNLSSBslF9/CFPwWQx uo2U9FEL =uVhE -----END PGP SIGNATURE----- -----------------------7a4e4ec4d4b010fb44778c0fcc5f7bbc--