From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 113626 invoked by alias); 3 Jan 2019 13:41:24 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 113597 invoked by uid 89); 3 Jan 2019 13:41:23 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=box, strive X-HELO: golden.birch.relay.mailchannels.net X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org X-MailChannels-Auth-Id: dreamhost X-Wiry-Troubled: 7585fe5c36d8974a_1546522878134_1779076345 X-MC-Loop-Signature: 1546522878134:3995298276 X-MC-Ingress-Time: 1546522878133 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gotplt.org; h=subject:to :references:from:message-id:date:mime-version:in-reply-to :content-type:content-transfer-encoding; s=gotplt.org; bh=vKCkNZ TQ4KYvwsghr1b1CSE7LMk=; b=anWj3VVPywI5IsWzGlCrg6E0TevlR+3Vu1tojz 4c0ne2qgg2PxYrRCvL4lLynGPHQ3AvYtupgbGLSK/HyJ/SFGc9xCbEYhY0Aind3Q zPKBsnf4IkwhUJjQhSO7RT1ZJZQUbD+7ing8dSM4eZ2qj29o9npJLK42yu28S28e qEX/s= Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] To: Egor Kobylkin , Rafal Luzynski , Carlos O'Donell , libc-alpha@sourceware.org, libc-locales@sourceware.org References: X-DH-BACKEND: pdx1-sub0-mail-a11 From: Siddhesh Poyarekar Message-ID: Date: Thu, 03 Jan 2019 13:41:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: -100 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrudekgdehiecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefuvfhfhffkffgfgggjtgfgsehtkeertddtfeejnecuhfhrohhmpefuihguughhvghshhcurfhohigrrhgvkhgrrhcuoehsihguughhvghshhesghhothhplhhtrdhorhhgqeenucfkphepgeelrddvgeekrdduleeirddvgeefnecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopegludelvddrudeikedruddrudeingdpihhnvghtpeegledrvdegkedrudeliedrvdegfedprhgvthhurhhnqdhprghthhepufhiugguhhgvshhhucfrohihrghrvghkrghruceoshhiugguhhgvshhhsehgohhtphhlthdrohhrgheqpdhmrghilhhfrhhomhepshhiugguhhgvshhhsehgohhtphhlthdrohhrghdpnhhrtghpthhtohepvghgohhrsehkohgshihlkhhinhdrtghomhenucevlhhushhtvghrufhiiigvpedt Content-Transfer-Encoding: quoted-printable X-SW-Source: 2019-q1/txt/msg00014.txt.bz2 On 03/01/19 4:52 PM, Egor Kobylkin wrote: > Is there a specific way you measure the bloat of the C locale? > Is it the size of the resulting libc.so.6 file we are concerned with? I believe it's built into libc.so, so I suppose you'd have to look at=20 its file size. > In terms of the source code we are just adding as many lines as there > are letters (169 insertions for Cyrillic in this patch v12) Yeah, it will likely not be much for a single locale, but it may add up=20 across locales. I have no idea how much, it may well be insignificant. > Just for clarification, the whole point (at least for me) for this patch > is to have the transliteration when other methods are not available. Or > when existing programs/systems can not make use of them. The most basic > example: filenames in Cyrillic on a NAS that get converted to > ????????.??? and get overwritten in the worst case. So the most value is > when it works out of box with the C builtin. Other locales can actually > implement their own variant and explicitly use it if they need one; some > already have, others may be just fine with the builtin C. That's a fair point but given the approximation, that specific use case=20 may still be flaky. >>> Additionally we have a disagreement about how should we handle the >>> =C2=A0case when a single original uppercase character transliterates >>> into a digraph in ASCII.=C2=A0 Should both ASCII characters be >>> uppercase (which is good for all uppercase strings and also good to >>> emphasize that the original character was single rather than two >>> separate characters which accidentally transliterate into two >>> characters making a digraph) or should only the first ASCII >>> character be uppercase (which is good for the titlecase words which >>> is common in natural texts)? An example is "=D0=A8" - should it be "SH" >>> or "Sh"? Note that "=D0=A1=D1=85" may also produce "Sh" ("S" + "h" -> "= Sh"). >> >> >> Is that important? >=20 > As in the above example about the files, you would probably agree that > it's better not to knowingly introduce a failure vector for such basic > OS operations like working with files. The transliteration > capitalization collisions have this negative potential. The users that > need a different specific capitalization can still implement that in > their locale. OK, I can see reason for reducing collisions but again, it remains=20 flaky. We could in the interest of moving forward, strive towards=20 making it less flaky but at the same time be aware that there may=20 eventually be collisions. Siddhesh