From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17521 invoked by alias); 2 Jan 2019 18:05:54 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 17507 invoked by uid 89); 2 Jan 2019 18:05:54 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=lucky, =d0=a1=d1, phase, opinions?= X-HELO: ladybird.maple.relay.mailchannels.net X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org X-MailChannels-Auth-Id: dreamhost X-Drop-Thoughtful: 0a2523cf3bff3bd4_1546452335471_872175493 X-MC-Loop-Signature: 1546452335471:2312528387 X-MC-Ingress-Time: 1546452335471 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gotplt.org; h=subject:to :references:from:message-id:date:mime-version:in-reply-to :content-type:content-transfer-encoding; s=gotplt.org; bh=R89k+x kmMVfzurVbb4Vl2Qv8jbc=; b=HzhNeSEt88rj9K3ALznVWNegR7u3EU3Zhlwkgq eH+MYOpeuzAkmvW2hlrQCbanZ8pJdmOGDFpQAHhELsd6pXRDixx0ZnRJR4M7emg5 w1b+jtIz25dCsggSLmWz5DmaeHwzV+11FidRRRP/vhk8ljQ21QE7Fme5Nl9QiwPE x/bDY= Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] To: Rafal Luzynski , libc-alpha@sourceware.org References: <1042796605.674608.1545262199252@poczta.nazwa.pl> X-DH-BACKEND: pdx1-sub0-mail-a12 From: Siddhesh Poyarekar Message-ID: Date: Wed, 02 Jan 2019 18:05:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1 MIME-Version: 1.0 In-Reply-To: <1042796605.674608.1545262199252@poczta.nazwa.pl> Content-Type: text/plain; charset=utf-8; format=flowed X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: 0 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrudeigddutdegucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucenucfjughrpefuvfhfhffkffgfgggjtgfgsehtkeertddtfeejnecuhfhrohhmpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqeenucfkphepuddvfedrvdehvddrvddtvddrudejvdenucfrrghrrghmpehmohguvgepshhmthhppdhhvghloheplgduledvrdduieekrdehrdduudelngdpihhnvghtpeduvdefrddvhedvrddvtddvrddujedvpdhrvghtuhhrnhdqphgrthhhpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqedpmhgrihhlfhhrohhmpehsihguughhvghshhesghhothhplhhtrdhorhhgpdhnrhgtphhtthhopeguihhgihhtrghlfhhrvggrkheslhhinhhgohhnsghorhhouhhghhdrtghomhenucevlhhushhtvghrufhiiigvpedt Content-Transfer-Encoding: quoted-printable X-SW-Source: 2019-01/txt/msg00035.txt.bz2 I've tried to overcome my general lack of confidence in commenting on=20 locale related issues to provide some opinions. I'd take those with an=20 appropriate dose of salt since like I've said before, I have little=20 experience in this area. On 20/12/18 4:59 AM, Rafal Luzynski wrote: > * Should we take the title of the bug literally and provide the > transliteration exclusively to plain ASCII or should we support the > transliteration to extended Latin (with some diacritic characters, > as per ISO 9 [2]) and support plain ASCII only as a fallback? We currently have a patch with ASCII fallback implemented and I reckon=20 implementing Latin fallback would just be additional work that could be=20 done in a second phase if really necessary. In that sense, I'd think=20 ASCII is sufficient as a first pass. > * Should we agree for Cyrillic -> extended Latin -> ASCII even if the > ASCII fallback does not fully conform with any existing standard? I have no idea what standards govern this, so I have no opinion on it. > * Should we implement Cyrillic -> plain ASCII as per GOST System B [3] > and skip extended Latin if it is impossible to handle both for standar= ds > technical reasons? Sounds reasonable. > * Is the C builtin locale the correct place to put this transliteration? > If yes, should we think about including the support of other alphabets > as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on) > ever in future? Yes on both counts, although this could result in bloating of the C=20 locale. If we are to provide additional transliteration of this sort,=20 we probably need to provide some way to trim it. > * Should the Cyrillic transliteration work in every locale (possibly with > few exceptions) or should we require that a locale actually using > Cyrillic script must be used? (E.g., should it work when ru_RU is not > installed? should it work if en_US is the only locale installed? Should > it work when no locale is installed, even en_US?) Would it matter if it was in the C builtin locale? > * Is it required that transliteration produces unambiguous output which > means that two different original strings never produce the same resul= t? > (As a consequence, the reverse transliteration could be possible). I don't think so. Transliterations are approximations in the end and=20 striving for such guarantees might be overreach. > Additionally we have a disagreement about how should we handle the case > when a single original uppercase character transliterates into a digraph > in ASCII. Should both ASCII characters be uppercase (which is good for > all uppercase strings and also good to emphasize that the original charac= ter > was single rather than two separate characters which accidentally > transliterate into two characters making a digraph) or should only the fi= rst > ASCII character be uppercase (which is good for the titlecase words which= is > common in natural texts)? An example is "=D0=A8" - should it be "SH" or = "Sh"? > Note that "=D0=A1=D1=85" may also produce "Sh" ("S" + "h" -> "Sh"). Is that important? > We are lucky that some of existing glibc locales already handle > transliteration > from Cyrillic to Latin, for example sr_RS and uk_UA. Unfortunately, they > follow their national standards rather than ISO or GOST so they cannot > be copied directly to ru_RU or applied universally to all locales. >=20 > Also, taking Egor's work into account, can we include this bug into the > list of desirable to be fixed in 2.29? It's late for 2.29 (sorry, it's partly my fault for not being decisive=20 enough about it) but please continue reviewing so that it lands first=20 thing in 2.30. It also looks like something that can be safely=20 backported assuming that it does not affect translations, so you could=20 well do that for 2.29 or as far back as you like. Siddhesh