From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6515-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 13209 invoked by alias); 10 Dec 2018 21:20:42 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 13182 invoked by uid 89); 10 Dec 2018 21:20:41 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=HX-Google-DKIM-Signature:reply-to, aside, article, theoretical
X-HELO: mail-wr1-f54.google.com
Return-Path: <myllynen@redhat.com>
Reply-To: Marko Myllynen <myllynen@redhat.com>
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872]
To: Rafal Luzynski <digitalfreak@lingonborough.com>,
 Egor Kobylkin <egor@kobylkin.com>, libc-alpha@sourceware.org,
 libc-locales@sourceware.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
 <837001401.21346.1542406647888@poczta.nazwa.pl>
 <bef63562-09d1-3306-aae9-20002ccf4130@kobylkin.com>
 <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com>
 <1441622134.517912.1543702039942@poczta.nazwa.pl>
 <2f6fc82c-77ba-d331-ae5d-e2373e122a88@kobylkin.com>
 <1361059722.707244.1544231740358@poczta.nazwa.pl>
From: Marko Myllynen <myllynen@redhat.com>
Cc: Mike Fabian <mfabian@redhat.com>, Carlos O'Donell <carlos@redhat.com>
Message-ID: <d5cdbc81-8049-e1fa-56f6-047bd1d7eb28@redhat.com>
Date: Mon, 10 Dec 2018 21:20:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.3.0
MIME-Version: 1.0
In-Reply-To: <1361059722.707244.1544231740358@poczta.nazwa.pl>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-q4/txt/msg00127.txt.bz2

Hi,

On 08/12/2018 03.15, Rafal Luzynski wrote:
> 17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
>>
>> The SH/Sh can be decided on either way - seems like an easy change any
>> way.
> 
> I'm in favor of "Sh" because it will work fine for titlecased words
> (where only the first letter is uppercase) but I'm aware it would be
> a problem for uppercased words.  Unfortunately, I think we are unable
> to satisfy both cases.

I think I'm in favor of "Sh" as well, although not perfect I'd assume
it's probably going to be correct in more cases than SH.

>> System A was added on Marko's request (so setting him on TO:) I am
>> neutral on keeping it or dropping it, just to be clear.
> 
> I think I didn't see this Marko's request but I'm in favor of keeping
> System A, too.
> 
> Marko, it would be good to hear your opinion about System A vs. System B
> again.

I think System A is a better option as it should be the same as ISO 9
and perhaps also produces results in some cases which are more expected
than with System B (if the Wikipedia ISO 9 article is to be believed).

Wrt BZ #2872 I think it's good to keep it in mind but IMHO we can also
deviate from it if needed, however with System A + ASCII fallback
definitions the RFE should be satisfied as well?

> 19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> In any case once your patch lands I'm going to submit a follow-up patch
>> for fi_FI to make it compliant with the applicable national standard
>> (SFS 4900) which defines how to do Cyrillic transliteration /
>> transcription in the context Finnish.
> 
> I totally agree.  As far as I can see, SFS 4900 is more similar to
> System A (ISO 9) rather than System B, that is, it transliterates to Latin
> characters with diacritics rather than plain ASCII.  Marko, what is your
> opinion about possible implementation of SFS 4900 in these cases:
> 
> * When the destination charset does not contain required Latin diacritic
>   characters (e.g., it is plain ASCII)?

This would be according to http://jkorpela.fi/iso9.html8 so for example
instead of Å¾ -> zh and instead of Å¡tÅ¡ -> shtsh.

> * When the output is ambiguous, that means, when two different Cyrillic
>   strings produce the same Latin (or ASCII) output?

This is a good point and one I haven't considered but I'm not sure is
there anything we can do about this (at least without major locale
system internals work)? Do you have any rough idea how frequently this
could happen or is this more a theoretical issue? (Sorry if I've missed
earlier comments about this, it's been a long thread.)

>> The same with having both System A and System B.  Initially I went along
>> with the suggestion to include the system A but it is clear now that it
>> doesnât make fixing [BZ #2872] more straightforward. So Iâd also propose
>> to set it aside for the moment and use the v10 without the system A.
>> That is the whole reason I have submitted it, to be superclear on that.
> 
> OK, I think that now I understand your reason to drop System A better.
> But still I'd like to rethink implementing System A somehow and drop
> (or rather: implement only partially) System B.

Yes, I also think System A AKA ISO 9 would be a better choice but I'll
leave the final decision for you two (and others who might weigh in).

Thanks,

-- 
Marko Myllynen