From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-6418-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 21301 invoked by alias); 9 Oct 2018 16:49:20 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 21274 invoked by uid 89); 9 Oct 2018 16:49:20 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: =?ISO-8859-1?Q?Yes, score=5.8 required=5.0 tests=AWL,BAYES_50,BODY_8BITS,GARBLED_BODY,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy==d0=be=d1, =d0=b8=d1, 8:=d0=b5, 8:=d0=be?=
X-HELO: mail-wm1-f66.google.com
Return-Path: <myllynen@redhat.com>
Reply-To: Marko Myllynen <myllynen@redhat.com>
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ
 #2872] re-submission for 2.29
To: Egor Kobylkin <egor@kobylkin.com>,
 Rafal Luzynski <digitalfreak@lingonborough.com>,
 Keld Simonsen <keld@keldix.com>
Cc: libc-alpha@sourceware.org, libc-locales@sourceware.org,
 "Dmitry V. Levin" <ldv@altlinux.org>, Volodymyr Lisivka
 <vlisivka@gmail.com>, Carlos O'Donell <carlos@redhat.com>,
 Max Kutny <mkutny@gmail.com>, danilo@gnome.org
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
 <20180412224352.GB2911@altlinux.org>
 <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com>
 <ac4c9b3e-aeae-30de-23ef-24d8f53d7bc4@kobylkin.com>
 <20181003091949.GA21486@rap.rap.dk>
 <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com>
 <1485772360.805333.1538731225156@poczta.nazwa.pl>
 <19e29568-e710-535f-4f90-98dbcec930ed@kobylkin.com>
 <1028447684.826961.1539036295224@poczta.nazwa.pl>
 <63fb4fae-a93b-7aff-13df-4452cbc8853f@redhat.com>
 <18f97c1f-3da2-809d-14bb-6e6d677b27eb@kobylkin.com>
From: Marko Myllynen <myllynen@redhat.com>
Message-ID: <8bfe3169-55c9-af90-91cb-fe0f3ecccfb6@redhat.com>
Date: Tue, 09 Oct 2018 16:49:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <18f97c1f-3da2-809d-14bb-6e6d677b27eb@kobylkin.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-q4/txt/msg00030.txt.bz2

Hi,

To clarify, the page has a section explaining the differences between
transliteration and transcription and how the terminology is not
entirely unambiguous. It also explains that the national standard SFS
4900 overrides ISO 9, thus ISO 9 can't be used as-is in Finnish context.

Thanks,

On 2018-10-09 19:22, Egor Kobylkin wrote:
> In the hope to be helpful: what you describe below from
> https://fi.wikipedia.org/wiki/Siirtokirjoitus is called _transcription_,
> not transliteration.
> 
> Transliteration is what we have done with ISO 9 or GOST 7.79 System A
> and it could be the same for all languages indeed.
> 
> The transcription can be phonetic or serve other purposes and depends on
> the target language or use case. We have used the GOST 7.79 System B.
> 
> Egor
> 
> On 09.10.2018 18:10, Marko Myllynen wrote:
>> Hi,
>>
>> On 2018-10-09 01:04, Rafal Luzynski wrote:
>>>
>>> Particularly, I think that those rules will not be helpful at all for
>>> the languages which use neither Latin nor Cyrillic alphabet.
>>
>> This is certainly a very good point.
>>
>>> If you refer to other languages than Russian which also use the Cyrillic
>>> alphabet but need a different transliteration rules than Russian for
>>> the same characters then it is OK for me now.  I am afraid that the iconv
>>> algorithm does not handle such case.  Of course, we should add this missing
>>> feature eventually but I do not volunteer to do it now.
>>
>> Yes, this would be needed for correct transliteration of different
>> languages, and this might be quite a bit of work. There's also the case
>> of transliteration and character sets, consider the transliteration
>> examples from https://fi.wikipedia.org/wiki/Siirtokirjoitus:
>>
>> Russian:        ÐÐ¾ÑÐ¸Ñ ÐÐ¸ÐºÐ¾Ð»Ð°ÐµÐ²Ð¸Ñ ÐÐ»ÑÑÐ¸Ð½
>> Int'l:          Boris NikolaeviÄ ElÊ¹cin
>> Finnish:        Boris NikolajevitÅ¡ Jeltsin
>> French:         Boris NikolaÃ¯evitch Ieltsine
>> Phonetic (IPA): [bÉËrÊ²is nÊ²ÉªkÉËlaÉªvÊ²ÉªtÉ ËjelÊ²tsÉ¨n]
>>
>> For French you'll get the correct transliteration with iconv by using -t
>> ISO-8859-1//TRANSLIT, for Finnish with -t ISO-8859-15//TRANSLIT but it's
>> not so obvious how to get the above kind transliteration for ISO 9
>> international or especially for the phonetic case.
>>
>> One thing that might be helpful here could be something like:
>>
>> $ echo Ð¶ | LC_ALL=fi_FI.UTF-8 iconv -f UTF-8 -t UTF-8//TRANSLIT_FORCE
>> Å¾
>>
>> That is, force transliteration of each character (if defined) even if
>> it's part of the target character set. AFAICS this is not currently
>> possible.
>>
>>> But, while at this, is there anything that stops are from adding transliteration
>>> rules for additional Cyrillic characters not used in Russian but used in
>>> other languages?
>>
>> This would probably make sense.
>>
>> FWIW, for Finnish the diff for Russian to be applied in the locale on
>> top of translit_cyrillic (ISO 9) rules would be something like below, I
>> still need to check whether there are rules needed for other languages
>> than Russian that could be added (I hope to submit a proper patch
>> against fi_FI shortly after translit_cyrillic has landed):
>>
>> <U0446> "<U0074><U0073>"
>> <U0447> "<U0074><U0161>";"<U0074><U0073><U0068>"
>> <U0448> "<U0161>";"<U0073><U0068>"
>> <U0449> "<U0161><U0074><U0161>";"<U0073><U0068><U0074><U0073><U0068>"
>> <U044A> ""
>> <U044C> ""
>> <U044D> "<U0065>"
>> <U044E> "<U006A><U0075>"
>> <U044F> "<U006A><U0061>"
>> <U0451> "<U006A><U006F>"
>>
>> Thanks,
>>
> 


-- 
Marko Myllynen