From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 94142 invoked by alias); 19 Apr 2019 22:24:26 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 94122 invoked by uid 89); 19 Apr 2019 22:24:26 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-4.5 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy==d0=a1=d1, H*x:Mailer, H*UA:Mailer, Latin?= X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: -0.4 Date: Fri, 19 Apr 2019 22:24:00 -0000 From: Rafal Luzynski To: Marko Myllynen , Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org, Carlos O'Donell Cc: Siddhesh Poyarekar , Mike Fabian Message-ID: <32577152.740058.1555712661835@poczta.nazwa.pl> In-Reply-To: References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <2124833400.35614.1546698902753@poczta.nazwa.pl> <908ed415-cfe4-804c-f421-4351ef062edc@kobylkin.com> <6d076299-babd-406a-b1fe-87778f54bf36@kobylkin.com> <41aff10b-9cf1-638c-4fbc-8c4f4122f2e9@kobylkin.com> Subject: Re: [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2019-q2/txt/msg00028.txt.bz2 Thank you Siddhesh and Carlos for your involvement in testing this patch and I apologize Egor and Marko and everyone else who need this patch to be pushed for my poor involvement. I'd like to reply to this email from Marko because it summarizes all issues. Also I hope I will explain the problems which made me stuck. 14.02.2019 17:48 Marko Myllynen wrote: > [...] > 1) Built-in C locale doesn't read/use any translit_* files and it can't > have any fallback mechanisms and it only supports ASCII so using GOST > 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to > be the appropriate way to implement Cyrillic transliteration for the > built-in C locale (it adds some 8KB to the binary). This sounds like a good idea. Also, C locale is probably a good way to enforce the plain ASCII transliteration without any fallback. > 2) Other locales read/use translit_* files and with them fallbacks and > non-ASCII are possible so it would seem preferable to first try ISO 9 / > GOST 7.79 System A OK, we agree here. > and only if that fails then use GOST 7.79 System B > (in which case the end result should match with the built-in C locale). This is impossible due to this case. System A transliterates the Cyrillic "=D0=A5" to Latin "H", system B transliterates it to Latin "X". Transliter= ation as implemented in glibc supports a simple fallback algorithm: transliterate the letter "X" to "YY" but if it is not available then to "ZZ". It can't support the complex algorithm which we need here: transliterate "X" to "YY" but if "Q" cannot be transliterated to "RR" then transliterate "X" to "ZZ". In our case we would like to transliterate "=D0=A5" to "X" if "=D0=A8" cann= ot be transliterated to "=C5=A0". The only thing we can implement is a fallback transliteration which is similar to System B but not 100% compatible. This is not the case if we are going to implement only System B in C locale because we know already that "=C5=A0" is unavailable so we have to translit= erate "=D0=A5" to "X" always. > For this the translit_cyrillic file should be added (as per patch v9 + > changes mentioned in patches v10 and v12). >=20 > 3) Individual locale files can then be updated to use translit_cyrillic > as appropriate (see patch v9) and language/national specific conventions > (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis. Sometimes I wonder whether really any other locale than a language which uses the Cyrillic script should want to have a Cyrillic transliteration but on the other hand - why not. Also I'd like to reiterate other disagreements which we have here: 1. How to handle upper/lower case in System B? Should we transliterate "=D0=A8" to "SH" or "Sh"? Should we maybe implement a smart context bas= ed casing algorithm first? I mean the algorithm which would detect if an uppercase letter appears as the first letter of otherwise lowercase word so should be transliterated as "Sh", or maybe it's in a context of a fully uppercase word so should be transliterated as "SH". I think that uconv implements this algorithm. 2. How to handle ambiguous transliterations like "=D0=A1=D1=85=D0=B5=D0=BC= =D0=B0" -> "Shema" vs. "=D0=A8=D0=B5=D0=BC=D0=B0" -> "Shema"? "SHema"? 3. How to handle the characters which are proper letters in Cyrillic and have an upper and lower case like a hard and soft sign but are transliterated to punctuation characters (grave accent "`")? Should we transliterate upper and lower case to the same character or should we mark them somehow? uconv adds Unicode combining low line to the grave accent (so the output is "`=CC=B2") if the original Cyrillic character was uppercase. But this is unavailable if our target charset is ASCII. Regarding the test cases which I mentioned the other day I discussed this with Dmitry and he convinced me that requiring the test cases is the bar set too high so I agree we don't need to require them already. Regards, Rafal