From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 131018 invoked by alias); 17 Nov 2018 18:35:01 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 130978 invoked by uid 89); 17 Nov 2018 18:35:00 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=combinations, focus, 16.11.18, 161118 X-HELO: mout.kundenserver.de Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] To: Rafal Luzynski , libc-alpha@sourceware.org, libc-locales@sourceware.org, Marko Myllynen References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <837001401.21346.1542406647888@poczta.nazwa.pl> From: Egor Kobylkin Openpgp: preference=signencrypt Message-ID: Date: Sat, 17 Nov 2018 18:35:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <837001401.21346.1542406647888@poczta.nazwa.pl> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-q4/txt/msg00112.txt.bz2 Hi Rafal, thanks for putting it into a clear issue statement on SH/Sh problem. I'm totally with you on this being a good thing to discuss. It is orthogonal to the tests so let me focus on SH/Sh and System A/B problematic here. Looks like we have three issues: 1. lack of explicit control which transformation to use (System A or System B) via //TRANSLIT 2. possibility of collision for System B if used CAP/low transcription for capital letters 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per System B because it's equivalent 'X'/'x' from System A is always present and takes precedence. As a solution shouldn't we only keep System B in a new file transcribe_cyrillic and put it in place as the explicit ASCII transcription for targeted locales (as opposed to transliteration)? We would keep System A as translit_cyrillic but won't include it into this patch. Once you have resolved an issue of having two conflicting rule-sets but only one key //TRANSLIT you could add the System A back. The SH/Sh can be decided on either way - seems like an easy change any way. Please see more discussion on your excellent points below: On 16.11.18 23:17, Rafal Luzynski wrote: > Egor, while at this I was thinking about your idea to transliterate > letters like "Ш" (uppercase) to "SH" (always uppercase) in order to > distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or > "Sxema"). to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t ASCII//TRANSLIT (i.e. System B transcription). But it's not only SH/Sh, there are following combinations used to transcribe capital letters: YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ Arguably any of them (if not in that CAP/CAP form) could collide with their CAP/low equivalent from a different word. (there may be language grammar rules that in fact prevent some but we don't know for sure) With transcription we are basically striping information from the data, mapping it into a smaller character set. The idea to keep them in CAP/CAP is to try to preserve as much information as possible. > Also you include a rule to transliterate "Х" to "H" or "X" depending > on which destination characters are available, which I told you > already that will not work because both "H" and "X" are always > available and therefore only the first rule will always be used. Just to have this here for reference, the idea was to have both rules in one file so iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII compatible _transcription_ (System B) iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv -f ISO-8859-15 -t UTF-8 will produce Latin _transliteration_ as per ISO 9.1995. (System A) So in fact we have two rules for each letter in the same file (System A and System B), where System A takes precedence. I have a question then: isn't this more like a hack than a right thing to do? Shouldn't we have two explicit rules for transcription and transliteration not dependent on a destination character set? > I still don't like the idea to > put two uppercase letters in a beginning of a word in titlecase only > to indicate that there was originally a single letter. What if we: > > * drop the rule of transliterating "Х" to "H" and transliterate > always to "X", This would contradict ISO 9.1995. (System A). System A was added on Marko's request (so setting him on TO:) I am neutral on keeping it or dropping it, just to be clear. > * transliterate uppercase "Ш" to "Sh" (so it will work fine for > titlecase words)? > > As a result the Latin letter "h" will only appear as part of a > digraph and never as a transliteration of "Х" and therefore will > never cause a conflict. Examples: > > * "Шема" -> "Shema", * "Схема" -> "Sxema". > > Will this solve the problem? This particular rule with h/x would make sense it's own. But again - it would contradict the standards. On the other hand, for my personal needs I care less about standards but about current functionality and data loss because of missing transcription altogether due to the BZ #2872. Bests, Egor