From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6500-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 131018 invoked by alias); 17 Nov 2018 18:35:01 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 130978 invoked by uid 89); 17 Nov 2018 18:35:00 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=combinations, focus, 16.11.18, 161118
X-HELO: mout.kundenserver.de
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
To: Rafal Luzynski <digitalfreak@lingonborough.com>,
 libc-alpha@sourceware.org, libc-locales@sourceware.org,
 Marko Myllynen <myllynen@redhat.com>
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
 <837001401.21346.1542406647888@poczta.nazwa.pl>
From: Egor Kobylkin <egor@kobylkin.com>
Openpgp: preference=signencrypt
Message-ID: <bef63562-09d1-3306-aae9-20002ccf4130@kobylkin.com>
Date: Sat, 17 Nov 2018 18:35:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <837001401.21346.1542406647888@poczta.nazwa.pl>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-q4/txt/msg00112.txt.bz2

Hi Rafal,
thanks for putting it into a clear issue statement on SH/Sh problem. I'm
totally with you on this being a good thing to discuss. It is orthogonal
to the tests so let me focus on SH/Sh and System A/B problematic here.

Looks like we have three issues:
1. lack of explicit control which transformation to use (System A or
System B) via //TRANSLIT
2. possibility of collision for System B if used CAP/low transcription
for capital letters
3. Cyrillic 'Ð¥'/'Ñ' (ha) never transcribes to 'H'/'h' as it should per
System B because it's equivalent 'X'/'x' from System A is always present
and takes precedence.

As a solution shouldn't we only keep System B in a new file
transcribe_cyrillic and put it in place as the explicit ASCII
transcription for targeted locales (as opposed to transliteration)?

We would keep System A as translit_cyrillic but won't include it into
this patch. Once you have resolved an issue of having two conflicting
rule-sets but only one key //TRANSLIT you could add the System A back.

The SH/Sh can be decided on either way - seems like an easy change any way.

Please see more discussion on your excellent points below:

On 16.11.18 23:17, Rafal Luzynski wrote:

> Egor, while at this I was thinking about your idea to transliterate
> letters like "Ð¨" (uppercase) to "SH" (always uppercase) in order to
> distinguish between "Ð¨ÐµÐ¼Ð°" (-> "SHema") and "Ð¡ÑÐµÐ¼Ð°" (-> "Shema" or
> "Sxema").

to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
ASCII//TRANSLIT (i.e. System B transcription).
But it's not only SH/Sh, there are following combinations used to
transcribe capital letters:

YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ

Arguably any of them (if not in that CAP/CAP form) could collide with
their CAP/low equivalent from a different word. (there may be language
grammar rules that in fact prevent some but we don't know for sure)

With transcription we are basically striping information from the data,
mapping it into a smaller character set. The idea to keep them in
CAP/CAP is to try to preserve as much information as possible.


> Also you include a rule to transliterate "Ð¥" to "H" or "X" depending
> on which destination characters are available, which I told you
> already that will not work because both "H" and "X" are always
> available and therefore only the first rule will always be used.

Just to have this here for reference, the idea was to have both rules in
one file so

iconv -f UTF-8 -t ASCII//TRANSLIT
will produce ASCII compatible _transcription_ (System B)

iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8
will produce Latin _transliteration_ as per ISO 9.1995. (System A)

So in fact we have two rules for each letter in the same file (System A
and System B), where System A takes precedence.

I have a question then: isn't this more like a hack than a right thing
to do?

Shouldn't we have two explicit rules for transcription and
transliteration not dependent on a destination character set?


> I still don't like the idea to
> put two uppercase letters in a beginning of a word in titlecase only
> to indicate that there was originally a single letter.  What if we:
> 
> * drop the rule of transliterating "Ð¥" to "H" and transliterate
> always to "X",
This would contradict ISO 9.1995. (System A).
System A was added on Marko's request (so setting him on TO:) I am
neutral on keeping it or dropping it, just to be clear.

> * transliterate uppercase "Ð¨" to "Sh" (so it will work fine for
> titlecase words)?
> 
> As a result the Latin letter "h" will only appear as part of a
> digraph and never as a transliteration of "Ð¥" and therefore will
> never cause a conflict. Examples:
> 
> * "Ð¨ÐµÐ¼Ð°" -> "Shema", * "Ð¡ÑÐµÐ¼Ð°" -> "Sxema".
> 
> Will this solve the problem?
This particular rule with h/x would make sense it's own.
But again - it would contradict the standards.
On the other hand, for my personal needs I care less about standards but
about current functionality and data loss because of missing
transcription altogether due to the BZ #2872.

Bests,
Egor