From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6571-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 113626 invoked by alias); 3 Jan 2019 13:41:24 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 113597 invoked by uid 89); 3 Jan 2019 13:41:23 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=box, strive
X-HELO: golden.birch.relay.mailchannels.net
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org
X-MailChannels-Auth-Id: dreamhost
X-Wiry-Troubled: 7585fe5c36d8974a_1546522878134_1779076345
X-MC-Loop-Signature: 1546522878134:3995298276
X-MC-Ingress-Time: 1546522878133
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gotplt.org; h=subject:to
	:references:from:message-id:date:mime-version:in-reply-to
	:content-type:content-transfer-encoding; s=gotplt.org; bh=vKCkNZ
	TQ4KYvwsghr1b1CSE7LMk=; b=anWj3VVPywI5IsWzGlCrg6E0TevlR+3Vu1tojz
	4c0ne2qgg2PxYrRCvL4lLynGPHQ3AvYtupgbGLSK/HyJ/SFGc9xCbEYhY0Aind3Q
	zPKBsnf4IkwhUJjQhSO7RT1ZJZQUbD+7ing8dSM4eZ2qj29o9npJLK42yu28S28e
	qEX/s=
Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ
 #2872]
To: Egor Kobylkin <egor@kobylkin.com>,
 Rafal Luzynski <digitalfreak@lingonborough.com>,
 Carlos O'Donell <carlos@redhat.com>, libc-alpha@sourceware.org,
 libc-locales@sourceware.org
References: <abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com>
X-DH-BACKEND: pdx1-sub0-mail-a11
From: Siddhesh Poyarekar <siddhesh@gotplt.org>
Message-ID: <a706684f-d011-c2e7-021c-c1946dfa702c@gotplt.org>
Date: Thu, 03 Jan 2019 13:41:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.3.1
MIME-Version: 1.0
In-Reply-To: <abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com>
Content-Type: text/plain; charset=utf-8; format=flowed
X-VR-OUT-STATUS: OK
X-VR-OUT-SCORE: -100
X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrudekgdehiecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefuvfhfhffkffgfgggjtgfgsehtkeertddtfeejnecuhfhrohhmpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqeenucfkphepgeelrddvgeekrdduleeirddvgeefnecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopegludelvddrudeikedruddrudeingdpihhnvghtpeegledrvdegkedrudeliedrvdegfedprhgvthhurhhnqdhprghthhepufhiugguhhgvshhhucfrohihrghrvghkrghruceoshhiugguhhgvshhhsehgohhtphhlthdrohhrgheqpdhmrghilhhfrhhomhepshhiugguhhgvshhhsehgohhtphhlthdrohhrghdpnhhrtghpthhtohepvghgohhrsehkohgshihlkhhinhdrtghomhenucevlhhushhtvghrufhiiigvpedt
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-q1/txt/msg00014.txt.bz2

On 03/01/19 4:52 PM, Egor Kobylkin wrote:
> Is there a specific way you measure the bloat of the C locale?
> Is it the size of the resulting libc.so.6 file we are concerned with?

I believe it's built into libc.so, so I suppose you'd have to look at=20
its file size.

> In terms of the source code we are just adding as many lines as there
> are letters (169 insertions for Cyrillic in this patch v12)

Yeah, it will likely not be much for a single locale, but it may add up=20
across locales.  I have no idea how much, it may well be insignificant.

> Just for clarification, the whole point (at least for me) for this patch
> is to have the transliteration when other methods are not available. Or
> when existing programs/systems can not make use of them. The most basic
> example: filenames in Cyrillic on a NAS that get converted to
> ????????.??? and get overwritten in the worst case. So the most value is
> when it works out of box with the C builtin. Other locales can actually
> implement their own variant and explicitly use it if they need one; some
> already have, others may be just fine with the builtin C.

That's a fair point but given the approximation, that specific use case=20
may still be flaky.

>>> Additionally we have a disagreement about how should we handle the
>>> =C2=A0case when a single original uppercase character transliterates
>>> into a digraph in ASCII.=C2=A0 Should both ASCII characters be
>>> uppercase (which is good for all uppercase strings and also good to
>>> emphasize that the original character was single rather than two
>>> separate characters which accidentally transliterate into two
>>> characters making a digraph) or should only the first ASCII
>>> character be uppercase (which is good for the titlecase words which
>>> is common in natural texts)? An example is "=D0=A8" - should it be "SH"
>>> or "Sh"? Note that "=D0=A1=D1=85" may also produce "Sh" ("S" + "h" -> "=
Sh").
>>
>>
>> Is that important?
>=20
> As in the above example about the files, you would probably agree that
> it's better not to knowingly introduce a failure vector for such basic
> OS operations like working with files. The transliteration
> capitalization collisions have this negative potential. The users that
> need a different specific capitalization can still implement that in
> their locale.

OK, I can see reason for reducing collisions but again, it remains=20
flaky.  We could in the interest of moving forward, strive towards=20
making it less flaky but at the same time be aware that there may=20
eventually be collisions.

Siddhesh