From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-98965-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 17521 invoked by alias); 2 Jan 2019 18:05:54 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 17507 invoked by uid 89); 2 Jan 2019 18:05:54 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=lucky, =d0=a1=d1, phase, opinions?=
X-HELO: ladybird.maple.relay.mailchannels.net
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org
X-MailChannels-Auth-Id: dreamhost
X-Drop-Thoughtful: 0a2523cf3bff3bd4_1546452335471_872175493
X-MC-Loop-Signature: 1546452335471:2312528387
X-MC-Ingress-Time: 1546452335471
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gotplt.org; h=subject:to
	:references:from:message-id:date:mime-version:in-reply-to
	:content-type:content-transfer-encoding; s=gotplt.org; bh=R89k+x
	kmMVfzurVbb4Vl2Qv8jbc=; b=HzhNeSEt88rj9K3ALznVWNegR7u3EU3Zhlwkgq
	eH+MYOpeuzAkmvW2hlrQCbanZ8pJdmOGDFpQAHhELsd6pXRDixx0ZnRJR4M7emg5
	w1b+jtIz25dCsggSLmWz5DmaeHwzV+11FidRRRP/vhk8ljQ21QE7Fme5Nl9QiwPE
	x/bDY=
Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ
 #2872]
To: Rafal Luzynski <digitalfreak@lingonborough.com>, libc-alpha@sourceware.org
References: <1042796605.674608.1545262199252@poczta.nazwa.pl>
X-DH-BACKEND: pdx1-sub0-mail-a12
From: Siddhesh Poyarekar <siddhesh@gotplt.org>
Message-ID: <f9675959-bd60-62ec-0c7a-34b3fbc88b64@gotplt.org>
Date: Wed, 02 Jan 2019 18:05:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.3.1
MIME-Version: 1.0
In-Reply-To: <1042796605.674608.1545262199252@poczta.nazwa.pl>
Content-Type: text/plain; charset=utf-8; format=flowed
X-VR-OUT-STATUS: OK
X-VR-OUT-SCORE: 0
X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrudeigddutdegucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucenucfjughrpefuvfhfhffkffgfgggjtgfgsehtkeertddtfeejnecuhfhrohhmpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqeenucfkphepuddvfedrvdehvddrvddtvddrudejvdenucfrrghrrghmpehmohguvgepshhmthhppdhhvghloheplgduledvrdduieekrdehrdduudelngdpihhnvghtpeduvdefrddvhedvrddvtddvrddujedvpdhrvghtuhhrnhdqphgrthhhpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqedpmhgrihhlfhhrohhmpehsihguughhvghshhesghhothhplhhtrdhorhhgpdhnrhgtphhtthhopeguihhgihhtrghlfhhrvggrkheslhhinhhgohhnsghorhhouhhghhdrtghomhenucevlhhushhtvghrufhiiigvpedt
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-01/txt/msg00035.txt.bz2

I've tried to overcome my general lack of confidence in commenting on=20
locale related issues to provide some opinions.  I'd take those with an=20
appropriate dose of salt since like I've said before, I have little=20
experience in this area.

On 20/12/18 4:59 AM, Rafal Luzynski wrote:
> * Should we take the title of the bug literally and provide the
>    transliteration exclusively to plain ASCII or should we support the
>    transliteration to extended Latin (with some diacritic characters,
>    as per ISO 9 [2]) and support plain ASCII only as a fallback?

We currently have a patch with ASCII fallback implemented and I reckon=20
implementing Latin fallback would just be additional work that could be=20
done in a second phase if really necessary.  In that sense, I'd think=20
ASCII is sufficient as a first pass.

> * Should we agree for Cyrillic -> extended Latin -> ASCII even if the
>    ASCII fallback does not fully conform with any existing standard?

I have no idea what standards govern this, so I have no opinion on it.

> * Should we implement Cyrillic -> plain ASCII as per GOST System B [3]
>    and skip extended Latin if it is impossible to handle both for standar=
ds
>    technical reasons?

Sounds reasonable.

> * Is the C builtin locale the correct place to put this transliteration?
>    If yes, should we think about including the support of other alphabets
>    as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on)
>    ever in future?

Yes on both counts, although this could result in bloating of the C=20
locale.  If we are to provide additional transliteration of this sort,=20
we probably need to provide some way to trim it.

> * Should the Cyrillic transliteration work in every locale (possibly with
>    few exceptions) or should we require that a locale actually using
>    Cyrillic script must be used? (E.g., should it work when ru_RU is not
>    installed? should it work if en_US is the only locale installed? Should
>    it work when no locale is installed, even en_US?)

Would it matter if it was in the C builtin locale?

> * Is it required that transliteration produces unambiguous output which
>    means that two different original strings never produce the same resul=
t?
>    (As a consequence, the reverse transliteration could be possible).

I don't think so.  Transliterations are approximations in the end and=20
striving for such guarantees might be overreach.

> Additionally we have a disagreement about how should we handle the case
> when a single original uppercase character transliterates into a digraph
> in ASCII.  Should both ASCII characters be uppercase (which is good for
> all uppercase strings and also good to emphasize that the original charac=
ter
> was single rather than two separate characters which accidentally
> transliterate into two characters making a digraph) or should only the fi=
rst
> ASCII character be uppercase (which is good for the titlecase words which=
 is
> common in natural texts)?  An example is "=D0=A8" - should it be "SH" or =
"Sh"?
> Note that "=D0=A1=D1=85" may also produce "Sh" ("S" + "h" -> "Sh").

Is that important?

> We are lucky that some of existing glibc locales already handle
> transliteration
> from Cyrillic to Latin, for example sr_RS and uk_UA.  Unfortunately, they
> follow their national standards rather than ISO or GOST so they cannot
> be copied directly to ru_RU or applied universally to all locales.
>=20
> Also, taking Egor's work into account, can we include this bug into the
> list of desirable to be fixed in 2.29?

It's late for 2.29 (sorry, it's partly my fault for not being decisive=20
enough about it) but please continue reviewing so that it lands first=20
thing in 2.30.  It also looks like something that can be safely=20
backported assuming that it does not affect translations, so you could=20
well do that for 2.29 or as far back as you like.

Siddhesh