public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
@ 2019-06-05  6:47 Diego (Egor) Kobylkin
  2019-06-05 23:51 ` Rafal Luzynski
  0 siblings, 1 reply; 13+ messages in thread
From: Diego (Egor) Kobylkin @ 2019-06-05  6:47 UTC (permalink / raw)
  To: Marko Myllynen, Carlos O'Donell, libc-alpha, libc-locales,
	Siddhesh Poyarekar, Rafal Luzynski
  Cc: Mike Fabian


[-- Attachment #1.1: Type: text/plain, Size: 1246 bytes --]



ping

Egor Kobylkin

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, May 28, 2019 9:46 AM, Diego (Egor) Kobylkin <egor@kobylkin.com> wrote:

> Ping. Upkeeping the good tradition of pings. 
> Egor Kobylkin 
> 

> On Fri, May 10, 2019 at 14:19, Marko Myllynen <myllynen@redhat.com> wrote:
> 

> > Hi Carlos,
> > 

> > On 16/04/2019 22.06, Carlos O'Donell wrote:
> > > On 4/16/19 2:41 PM, Egor Kobylkin wrote:
> > >> It is exactly the reason we had 12 iterations on this patch - we
> > >> wanted to cover the most complete yet workable standard for the
> > >> table. What we reference in the bug memo is the actual accepted
> > >> standard. It is coalesced with the extended standard for further
> > >> outdated cyrillic letters.
> > >
> > > I agree, and this is what makes review complicated and time
> > > consuming. I'm relying on you as the expert, and my goal is only
> > > to spot check for any inconsistencies.
> > 

> > I know you've been very busy with everything else but did you happen to
> > have any chance to check this further, shall we still wait for your
> > results or how would you suggests us to proceed?
> > 

> > Thanks,
> > 

> > --
> > Marko Myllynen


[-- Attachment #1.2: publickey - egor@kobylkin.com - 0x01FEB4E8.asc --]
[-- Type: application/pgp-keys, Size: 657 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-05  6:47 [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Diego (Egor) Kobylkin
@ 2019-06-05 23:51 ` Rafal Luzynski
  2019-06-06  9:42   ` Marko Myllynen
  0 siblings, 1 reply; 13+ messages in thread
From: Rafal Luzynski @ 2019-06-05 23:51 UTC (permalink / raw)
  To: Diego (Egor) Kobylkin, Marko Myllynen, Carlos O'Donell,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

5.06.2019 08:47 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> 
> ping
> 
> Egor Kobylkin

I second these pings.  Marko, Carlos, Siddhesh, Mike, is there anything
else I can do here?

Since the questions may sound overwhelming, I'd like to focus on
a single issue:

How should we handle the upper/lower case when a single Cyrillic letter
is transliterated to a Latin digraph (trigraph, etc.)?

Possible answers (Cyrillic -> Latin Extended -> ASCII):

1. "Ш" -> "Š" -> "SH"

   e.g.: "Шема" -> "Šema" -> "SHema"
         "Схема" ----------> "Shema"

2. "Ш" -> "Š" -> "Sh"

   e.g.: "Шема" -> "Šema" -> "Shema"
         "Схема" ----------> "Shema"

Personally I don't like the answer 1. because "SHema" looks weird
to me.  Egor in turn does not like the answer 2. because the output
string becomes ambiguous.

Should we maybe have a smart algorithm which would select the title
case or the upper case of the output characters depending on the
context in the word?  Note that it would not resolve the problem of
the output text being ambiguous.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-05 23:51 ` Rafal Luzynski
@ 2019-06-06  9:42   ` Marko Myllynen
  2019-06-06 21:31     ` Rafal Luzynski
  0 siblings, 1 reply; 13+ messages in thread
From: Marko Myllynen @ 2019-06-06  9:42 UTC (permalink / raw)
  To: Rafal Luzynski, Diego (Egor) Kobylkin, Carlos O'Donell,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

Hi,

On 06/06/2019 02.49, Rafal Luzynski wrote:
> 5.06.2019 08:47 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
>>
>> ping
> 
> I second these pings.  Marko, Carlos, Siddhesh, Mike, is there anything
> else I can do here?

My understanding of the overall situation here is that for 2.30 we try
to have Cyrillic->ASCII transliteration added into the built-in C locale
and after that we would discuss more about translit rules used by other
locales, and that this C locale patch is pending on Carlos to complete
his verification efforts.

Does the above sound correct to you?

> Since the questions may sound overwhelming, I'd like to focus on
> a single issue:

Yes, the subject becomes overwhelming if considering everything related
at the same time so this sounds like a good approach, however are there
other notable open questions left around the C locale rules in addition
to this? (Ignoring the more generic tranlit rules or other locales for
the time being.)

> How should we handle the upper/lower case when a single Cyrillic letter
> is transliterated to a Latin digraph (trigraph, etc.)?
> 
> Possible answers (Cyrillic -> Latin Extended -> ASCII):
> 
> 1. "Ш" -> "Š" -> "SH"
> 
>    e.g.: "Шема" -> "Šema" -> "SHema"
>          "Схема" ----------> "Shema"
> 
> 2. "Ш" -> "Š" -> "Sh"
> 
>    e.g.: "Шема" -> "Šema" -> "Shema"
>          "Схема" ----------> "Shema"
> 
> Personally I don't like the answer 1. because "SHema" looks weird
> to me.  Egor in turn does not like the answer 2. because the output
> string becomes ambiguous.
> 
> Should we maybe have a smart algorithm which would select the title
> case or the upper case of the output characters depending on the
> context in the word?  Note that it would not resolve the problem of
> the output text being ambiguous.

It seems clear that there is no one right/wrong answer but it's a matter
of preference, especially the way this currently works. It might be an
improvement to output (for instance) SH instead of Sh if all the other
letters of a word are upper-case as well but not sure what would help
with the result being unambiguous.

Thanks,

-- 
Marko Myllynen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-06  9:42   ` Marko Myllynen
@ 2019-06-06 21:31     ` Rafal Luzynski
  2019-06-07  0:58       ` Carlos O'Donell
  0 siblings, 1 reply; 13+ messages in thread
From: Rafal Luzynski @ 2019-06-06 21:31 UTC (permalink / raw)
  To: Marko Myllynen, Diego (Egor) Kobylkin, Carlos O'Donell,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

6.06.2019 11:42 Marko Myllynen <myllynen@redhat.com> wrote:
> 
> 
> Hi,
> 
> On 06/06/2019 02.49, Rafal Luzynski wrote:
> > 5.06.2019 08:47 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> >>
> >> ping
> > 
> > I second these pings.  Marko, Carlos, Siddhesh, Mike, is there anything
> > else I can do here?
> [...]
> My understanding of the overall situation here is that for 2.30 we try
> to have Cyrillic->ASCII transliteration added into the built-in C locale

Even if we have doubts and questions about transliteration of some
characters?  The question I asked applies to Cyrillic->ASCII.  On the other
hand, it does not apply to Cyrillic->Latin Extended (ISO 9) because it
strictly sticks to the "one-letter-to-one-letter" rule.

Maybe we should implement ISO 9 [1] without a fallback first because I am
pretty sure we are able to do it easily?

> and after that we would discuss more about translit rules used by other
> locales, and that this C locale patch is pending on Carlos to complete
> his verification efforts.
> 
> Does the above sound correct to you?

Not really unless we agree that we want to push the transliteration
even if it is not perfect and does not work good for everyone and we
are going to fix it later.

> [...] however are there
> other notable open questions left around the C locale rules in addition
> to this? (Ignoring the more generic tranlit rules or other locales for
> the time being.)

Yes, for example the letters like soft and hard sign which have uppercase
and lowercase variant and they are transliterate to diacritical signs but
I'd like to focus on this one first.

> > How should we handle the upper/lower case when a single Cyrillic letter
> > is transliterated to a Latin digraph (trigraph, etc.)?
> > 
> > Possible answers (Cyrillic -> Latin Extended -> ASCII):
> > 
> > 1. "Ш" -> "Š" -> "SH"
> > 
> >    e.g.: "Шема" -> "Šema" -> "SHema"
> >          "Схема" ----------> "Shema"
> > 
> > 2. "Ш" -> "Š" -> "Sh"
> > 
> >    e.g.: "Шема" -> "Šema" -> "Shema"
> >          "Схема" ----------> "Shema"
> > 
> > Personally I don't like the answer 1. because "SHema" looks weird
> > to me.  Egor in turn does not like the answer 2. because the output
> > string becomes ambiguous.
> > 
> > Should we maybe have a smart algorithm which would select the title
> > case or the upper case of the output characters depending on the
> > context in the word?  Note that it would not resolve the problem of
> > the output text being ambiguous.
> 
> It seems clear that there is no one right/wrong answer but it's a matter
> of preference, especially the way this currently works. It might be an
> improvement to output (for instance) SH instead of Sh if all the other
> letters of a word are upper-case as well but not sure what would help
> with the result being unambiguous.

I think you refer to the idea of implementing a smart algorithm which would
adapt the lower/upper case depending on the context but indeed it would
not resolve the problem of ambiguity.

So, the smart algorithm aside, what should be the preferred transliteration
rule?

Regards,

Rafal

[1] https://en.wikipedia.org/wiki/ISO_9

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-06 21:31     ` Rafal Luzynski
@ 2019-06-07  0:58       ` Carlos O'Donell
  2019-06-07  9:46         ` Diego (Egor) Kobylkin
  2019-06-07 10:51         ` Rafal Luzynski
  0 siblings, 2 replies; 13+ messages in thread
From: Carlos O'Donell @ 2019-06-07  0:58 UTC (permalink / raw)
  To: Rafal Luzynski, Marko Myllynen, Diego (Egor) Kobylkin,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

On 6/6/19 5:31 PM, Rafal Luzynski wrote:
>>> Possible answers (Cyrillic -> Latin Extended -> ASCII):
>>>
>>> 1. "Ш" -> "Š" -> "SH"
>>>
>>>     e.g.: "Шема" -> "Šema" -> "SHema"
>>>           "Схема" ----------> "Shema"
>>>
>>> 2. "Ш" -> "Š" -> "Sh"
>>>
>>>     e.g.: "Шема" -> "Šema" -> "Shema"
>>>           "Схема" ----------> "Shema"
>>>
>>> Personally I don't like the answer 1. because "SHema" looks weird
>>> to me.  Egor in turn does not like the answer 2. because the output
>>> string becomes ambiguous.
>>>
>>> Should we maybe have a smart algorithm which would select the title
>>> case or the upper case of the output characters depending on the
>>> context in the word?  Note that it would not resolve the problem of
>>> the output text being ambiguous.
>>
>> It seems clear that there is no one right/wrong answer but it's a matter
>> of preference, especially the way this currently works. It might be an
>> improvement to output (for instance) SH instead of Sh if all the other
>> letters of a word are upper-case as well but not sure what would help
>> with the result being unambiguous.
> 
> I think you refer to the idea of implementing a smart algorithm which would
> adapt the lower/upper case depending on the context but indeed it would
> not resolve the problem of ambiguity.
> 
> So, the smart algorithm aside, what should be the preferred transliteration
> rule?

I have a weak preference for 1. However, I would change my preference if
someone showed me existing prior implementations that did 1 or 2.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07  0:58       ` Carlos O'Donell
@ 2019-06-07  9:46         ` Diego (Egor) Kobylkin
  2019-06-07 11:11           ` Rafal Luzynski
  2019-06-07 10:51         ` Rafal Luzynski
  1 sibling, 1 reply; 13+ messages in thread
From: Diego (Egor) Kobylkin @ 2019-06-07  9:46 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Rafal Luzynski, Marko Myllynen, libc-alpha, libc-locales,
	Siddhesh Poyarekar, Mike Fabian


[-- Attachment #1.1: Type: text/plain, Size: 4888 bytes --]

Hi Carlos et al. 


On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com> wrote:
> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.

1. gibc already translits letters and ligatures capitalized in locale/C-translit.h.in:
"\x00c6"	"AE"	# <U00C6> LATIN CAPITAL LETTER AE
"\x0132"	"IJ"	# <U0132> LATIN CAPITAL LIGATURE IJ

2. I would just like to quote myself from 2018: 


"collisions due to "one symbol capitalization" would cause irreversible damage to data. 

For a library like glibc this seems to be a very relevant issue to consider..."

On 03.11.18 00:27, Egor Kobylkin wrote:> On 02.11.18 23:22, Rafal Luzynski wrote:
>>> * Consistently transliterate single uppercase Cyrillic letters to
>>> sequences of all uppercase Latin letters in all languages
>>> (whenever a Cyrillic letter is transliterated to more than one
>>> Latin letter), for example "Ї" is now transliterated as "YI" rather
>>> than "Yi".
>> I think you have not yet explained whether this is required by any
>> existing standard (please provide links) or whether this is your
>> genuine idea to distinguish between the cases like "Ш" transliterated > to "Sh" and
>   "Сх" also transliterated to "Sh".
> 

> I remember seeing this form of the capitalization it in actual
> transliterated texts long time ago but can't find a formal description
> as of now. Just don't want to claim this to be my original idea.
> 

>>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for
>>> example for "Сх" and "Ш" that would both transliterate to Sh:
>>> With SH:"Схема"->"Shema" but "Шема"->"SHema"
>>> With Sh:"Схема"->"Shema" and "Шема"->"Shema". Collision!
>>> This is important e.g. for renaming files, grouping as in using uniq >> etc.
> As for the users - I am a user and I have demonstrated the use cases
> where the collisions due to "one symbol capitalization" would cause
> irreversible damage to data. For a library like glibc this seems like a
> relevant issue to consider.
> 

> The "two symbol capitalization" on the other hand would prevent
> collision and can be easily corrected in the userspace if needed
> with something like
> 

> foo="SHema"
> foo="${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})"
> echo "$foo"
> Shema
> 

> It looks like everyone really using transliteration for something
> sensitive already have done it the userspace since at least 2006 when
> this bug was first logged. So we won't break the official use cases
> where the capitalization should be done in a certain way. But we will
> prevent new bugs due to collision if we use "two symbol capitalization"
> indeed.
> 

> Happy to hear arguments to the contrary.


Bests,
Diego

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com> wrote:

> On 6/6/19 5:31 PM, Rafal Luzynski wrote:
> 

> > > > Possible answers (Cyrillic -> Latin Extended -> ASCII):
> > > > 

> > > > 1.  "Ш" -> "Š" -> "SH"
> > > >     e.g.: "Шема" -> "Šema" -> "SHema"
> > > >     "Схема" ----------> "Shema"
> > > >     

> > > > 2.  "Ш" -> "Š" -> "Sh"
> > > >     e.g.: "Шема" -> "Šema" -> "Shema"
> > > >     "Схема" ----------> "Shema"
> > > >     

> > > > 

> > > > Personally I don't like the answer 1. because "SHema" looks weird
> > > > to me. Egor in turn does not like the answer 2. because the output
> > > > string becomes ambiguous.
> > > > Should we maybe have a smart algorithm which would select the title
> > > > case or the upper case of the output characters depending on the
> > > > context in the word? Note that it would not resolve the problem of
> > > > the output text being ambiguous.
> > > 

> > > It seems clear that there is no one right/wrong answer but it's a matter
> > > of preference, especially the way this currently works. It might be an
> > > improvement to output (for instance) SH instead of Sh if all the other
> > > letters of a word are upper-case as well but not sure what would help
> > > with the result being unambiguous.
> > 

> > I think you refer to the idea of implementing a smart algorithm which would
> > adapt the lower/upper case depending on the context but indeed it would
> > not resolve the problem of ambiguity.
> > So, the smart algorithm aside, what should be the preferred transliteration
> > rule?
> 

> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.
> 

> ---------------------------------------------------------------------------------------------------------------------------------------------
> 

> Cheers,
> Carlos.


[-- Attachment #1.2: publickey - egor@kobylkin.com - 0x01FEB4E8.asc --]
[-- Type: application/pgp-keys, Size: 657 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07  0:58       ` Carlos O'Donell
  2019-06-07  9:46         ` Diego (Egor) Kobylkin
@ 2019-06-07 10:51         ` Rafal Luzynski
  2019-06-07 12:36           ` Carlos O'Donell
  1 sibling, 1 reply; 13+ messages in thread
From: Rafal Luzynski @ 2019-06-07 10:51 UTC (permalink / raw)
  To: Carlos O'Donell, Marko Myllynen, Diego (Egor) Kobylkin,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

7.06.2019 02:57 Carlos O'Donell <codonell@redhat.com> wrote:
> 
> On 6/6/19 5:31 PM, Rafal Luzynski wrote:
> >>> Possible answers (Cyrillic -> Latin Extended -> ASCII):
> >>>
> >>> 1. "Ш" -> "Š" -> "SH"
> >>>
> >>>     e.g.: "Шема" -> "Šema" -> "SHema"
> >>>           "Схема" ----------> "Shema"
> >>>
> >>> 2. "Ш" -> "Š" -> "Sh"
> >>>
> >>>     e.g.: "Шема" -> "Šema" -> "Shema"
> >>>           "Схема" ----------> "Shema"
> >>>
> >>> Personally I don't like the answer 1. because "SHema" looks weird
> >>> to me.  Egor in turn does not like the answer 2. because the output
> >>> string becomes ambiguous.
> >>>
> >>> Should we maybe have a smart algorithm which would select the title
> >>> case or the upper case of the output characters depending on the
> >>> context in the word?  Note that it would not resolve the problem of
> >>> the output text being ambiguous.
> >>
> >> It seems clear that there is no one right/wrong answer but it's a
> >> matter
> >> of preference, especially the way this currently works. It might be an
> >> improvement to output (for instance) SH instead of Sh if all the other
> >> letters of a word are upper-case as well but not sure what would help
> >> with the result being unambiguous.
> > 
> > I think you refer to the idea of implementing a smart algorithm which
> > would
> > adapt the lower/upper case depending on the context but indeed it would
> > not resolve the problem of ambiguity.
> > 
> > So, the smart algorithm aside, what should be the preferred
> > transliteration
> > rule?
> 
> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.

uconv implements a smart algorithm to adjust the upper/lower case:

==================================================================
$ echo "Схема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
Skhema

$ echo "Шема" | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Shema

$ echo "ШЕМА" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SHEMA

$ echo "ШЕма" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SHEma

$ echo "Ш Ема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SH Yema
==================================================================

Also for them it is easier because they decided that "Х" should be
transliterated to "KH" (I think this is the common thing when
transliterating to English) while ISO 9 says it should be transliterated
to "H" and GOST says it should be "X".  We can't implement this
fallback in glibc because the glibc algorithm is very simple.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07  9:46         ` Diego (Egor) Kobylkin
@ 2019-06-07 11:11           ` Rafal Luzynski
  0 siblings, 0 replies; 13+ messages in thread
From: Rafal Luzynski @ 2019-06-07 11:11 UTC (permalink / raw)
  To: Diego (Egor) Kobylkin, Carlos O'Donell
  Cc: Marko Myllynen, libc-alpha, libc-locales, Siddhesh Poyarekar,
	Mike Fabian

7.06.2019 11:46 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> 
> Hi Carlos et al. 
> 
> On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com>
> wrote:
> > I have a weak preference for 1. However, I would change my preference if
> > someone showed me existing prior implementations that did 1 or 2.
> 
> 1. gibc already translits letters and ligatures capitalized in
> locale/C-translit.h.in:
> "\x00c6"	"AE"	# <U00C6> LATIN CAPITAL LETTER AE
> "\x0132"	"IJ"	# <U0132> LATIN CAPITAL LIGATURE IJ

Now I lean to thinking that it is wrong because we don't have
a smart algorithm which would adjust the upper/lower case of
the transliterated letters.  I don't criticize this particular
transliteration rule, just any rule here would be wrong and incomplete
(e.g., "\x00c6" -> "Ae" could be good in some cases but also wrong
in many other).

As a real life example, please fix me if I'm wrong, but AFAIK
in German the umlaut letters like "Ö" are sometimes written
(transliterated) as "OE" but when they appear as the first letter
in a titlecased word they are transliteraded as "Oe", not as "OE"
(e.g., "Österreich" -> "Oesterreich" but not "OEsterreich").

> 2. I would just like to quote myself from 2018: 
> 
> 
> "collisions due to "one symbol capitalization" would cause irreversible
> damage to data. 
> 
> For a library like glibc this seems to be a very relevant issue to
> consider..."
> [...]

Could you please elaborate why it is so important to ensure that
the output data is never ambiguous and what damage to data would
that cause?  OK, you mentioned the case of renaming files.  I believe
that a perfect non-collision algorithm is impossible.  A simple
example when it would never work is when you have two files in
the same directory: one with a name written in Cyrillic and another
one written in Latin using exactly the same name which is the output
of the transliteration algorithm.

Another question: why do you need to transliterate the file names
at all?  Wouldn't it work perfectly for you if they were not
transliterated at all?  My guess is that it might be useful when
using files on some older systems which do not support Unicode.

Maybe let's consider who (and why) should use any transliteration
at all. What comes to my mind is:

1. Countries (languages) which use two writing systems and want
   to have an automatic transliteration of the text. Examples:
   Serbian, Kazakh.
2. Countries (languages) which use non-Latin script but want to
   provide automatically some readable content for foreign visitors.
3. Backward compatibility with some older computer devices which
   are unable to handle Unicode.

Now we may think about what are the requirements of these target
groups and whether we can provide a solution which would work for
all of them.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07 10:51         ` Rafal Luzynski
@ 2019-06-07 12:36           ` Carlos O'Donell
  2019-06-07 13:00             ` Diego (Egor) Kobylkin
  2019-06-10 22:44             ` Rafal Luzynski
  0 siblings, 2 replies; 13+ messages in thread
From: Carlos O'Donell @ 2019-06-07 12:36 UTC (permalink / raw)
  To: Rafal Luzynski, Marko Myllynen, Diego (Egor) Kobylkin,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

On 6/7/19 6:52 AM, Rafal Luzynski wrote:
> 7.06.2019 02:57 Carlos O'Donell <codonell@redhat.com> wrote:
>>
>> On 6/6/19 5:31 PM, Rafal Luzynski wrote:
>>>>> Possible answers (Cyrillic -> Latin Extended -> ASCII):
>>>>>
>>>>> 1. "Ш" -> "Š" -> "SH"
>>>>>
>>>>>      e.g.: "Шема" -> "Šema" -> "SHema"
>>>>>            "Схема" ----------> "Shema"
>>>>>
>>>>> 2. "Ш" -> "Š" -> "Sh"
>>>>>
>>>>>      e.g.: "Шема" -> "Šema" -> "Shema"
>>>>>            "Схема" ----------> "Shema"
>>>>>
>>>>> Personally I don't like the answer 1. because "SHema" looks weird
>>>>> to me.  Egor in turn does not like the answer 2. because the output
>>>>> string becomes ambiguous.
>>>>>
>>>>> Should we maybe have a smart algorithm which would select the title
>>>>> case or the upper case of the output characters depending on the
>>>>> context in the word?  Note that it would not resolve the problem of
>>>>> the output text being ambiguous.
>>>>
>>>> It seems clear that there is no one right/wrong answer but it's a
>>>> matter
>>>> of preference, especially the way this currently works. It might be an
>>>> improvement to output (for instance) SH instead of Sh if all the other
>>>> letters of a word are upper-case as well but not sure what would help
>>>> with the result being unambiguous.
>>>
>>> I think you refer to the idea of implementing a smart algorithm which
>>> would
>>> adapt the lower/upper case depending on the context but indeed it would
>>> not resolve the problem of ambiguity.
>>>
>>> So, the smart algorithm aside, what should be the preferred
>>> transliteration
>>> rule?
>>
>> I have a weak preference for 1. However, I would change my preference if
>> someone showed me existing prior implementations that did 1 or 2.
> 
> uconv implements a smart algorithm to adjust the upper/lower case:
> 
> ==================================================================
> $ echo "Схема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> Skhema
> 
> $ echo "Шема" | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> Shema
> 
> $ echo "ШЕМА" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> SHEMA
> 
> $ echo "ШЕма" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> SHEma
> 
> $ echo "Ш Ема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> SH Yema
> ==================================================================
> 
> Also for them it is easier because they decided that "Ð¥" should be
> transliterated to "KH" (I think this is the common thing when
> transliterating to English) while ISO 9 says it should be transliterated
> to "H" and GOST says it should be "X".  We can't implement this
> fallback in glibc because the glibc algorithm is very simple.

*Sigh*

I should have known you could find enough examples that contradict
eachother :-)

I'd like to hear what Egor has to say about the data loss aspects.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07 12:36           ` Carlos O'Donell
@ 2019-06-07 13:00             ` Diego (Egor) Kobylkin
  2019-06-07 21:17               ` Carlos O'Donell
  2019-06-10 22:44             ` Rafal Luzynski
  1 sibling, 1 reply; 13+ messages in thread
From: Diego (Egor) Kobylkin @ 2019-06-07 13:00 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Rafal Luzynski, Marko Myllynen, libc-alpha, libc-locales,
	Siddhesh Poyarekar, Mike Fabian


[-- Attachment #1.1: Type: text/plain, Size: 5044 bytes --]

On Friday, June 7, 2019 2:35 PM, Carlos O'Donell <codonell@redhat.com> wrote:
> I'd like to hear what Egor has to say about the data loss aspects.

It's quite simple really - suppose you have a list of pages in an wikipedia. 

For example there are these two entries in Russian:
1.Шема https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no 

2.Схема https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0 


So you want to scrape wikipedia and them out to files: Шема.txt and Схема.txt
But the target system doesn't support Russian locale and so you must transliterate the filenames.
 

If "Ш"->"Sh" and "Сх"->"Sh", both of them will be written into the same file "Shema.txt". With no other special handing the first file will be overwritten and its data lost.

If "Ш"->"SH" and "Сх"->"Sh" - there will be two separate files 1. SHema.txt 2. Shema.txt . No data loss in this case. 


We cant exclude all data loss scenarios but at least shouldn't knowingly let the most basic ones happen just because how SHema looks. Translit is mostly a technical field at least in glibc so the aesthetics would be the last thing I would care about here. 


Anyway I'm all for committing the patch this way or another and opening a new bug should anyone complain about Sh/SH. Until now we had a hard time getting any input from any outsider on this issue. I guess de-facto I am the only end-user that has an opinion on this :-)

Bests,
Egor Kobylkin 


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, June 7, 2019 2:35 PM, Carlos O'Donell <codonell@redhat.com> wrote:

> On 6/7/19 6:52 AM, Rafal Luzynski wrote:
> 

> > 7.06.2019 02:57 Carlos O'Donell codonell@redhat.com wrote:
> > 

> > > On 6/6/19 5:31 PM, Rafal Luzynski wrote:
> > > 

> > > > > > Possible answers (Cyrillic -> Latin Extended -> ASCII):
> > > > > > 

> > > > > > 1.  "Ш" -> "Š" -> "SH"
> > > > > >     e.g.: "Шема" -> "Šema" -> "SHema"
> > > > > >     "Схема" ----------> "Shema"
> > > > > >     

> > > > > > 2.  "Ш" -> "Š" -> "Sh"
> > > > > >     e.g.: "Шема" -> "Šema" -> "Shema"
> > > > > >     "Схема" ----------> "Shema"
> > > > > >     

> > > > > > 

> > > > > > Personally I don't like the answer 1. because "SHema" looks weird
> > > > > > to me. Egor in turn does not like the answer 2. because the output
> > > > > > string becomes ambiguous.
> > > > > > Should we maybe have a smart algorithm which would select the title
> > > > > > case or the upper case of the output characters depending on the
> > > > > > context in the word? Note that it would not resolve the problem of
> > > > > > the output text being ambiguous.
> > > > > 

> > > > > It seems clear that there is no one right/wrong answer but it's a
> > > > > matter
> > > > > of preference, especially the way this currently works. It might be an
> > > > > improvement to output (for instance) SH instead of Sh if all the other
> > > > > letters of a word are upper-case as well but not sure what would help
> > > > > with the result being unambiguous.
> > > > 

> > > > I think you refer to the idea of implementing a smart algorithm which
> > > > would
> > > > adapt the lower/upper case depending on the context but indeed it would
> > > > not resolve the problem of ambiguity.
> > > > So, the smart algorithm aside, what should be the preferred
> > > > transliteration
> > > > rule?
> > > 

> > > I have a weak preference for 1. However, I would change my preference if
> > > someone showed me existing prior implementations that did 1 or 2.
> > 

> > uconv implements a smart algorithm to adjust the upper/lower case:
> > ==================================================================
> > $ echo "Схема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > Skhema
> > $ echo "Шема" | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> > Shema
> > $ echo "ШЕМА" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SHEMA
> > $ echo "ШЕма" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SHEma
> > 

> > $ echo "Ш Ема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SH Yema
> > 

> > ===================================================================
> > 

> > Also for them it is easier because they decided that "Х" should be
> > transliterated to "KH" (I think this is the common thing when
> > transliterating to English) while ISO 9 says it should be transliterated
> > to "H" and GOST says it should be "X". We can't implement this
> > fallback in glibc because the glibc algorithm is very simple.
> 

> Sigh
> 

> I should have known you could find enough examples that contradict
> eachother :-)
> 

> I'd like to hear what Egor has to say about the data loss aspects.
> 

> -------------------------------------------------------------------------------------------------------------------------------------------------------------
> 

> Cheers,
> Carlos.


[-- Attachment #1.2: publickey - egor@kobylkin.com - 0x01FEB4E8.asc --]
[-- Type: application/pgp-keys, Size: 657 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07 13:00             ` Diego (Egor) Kobylkin
@ 2019-06-07 21:17               ` Carlos O'Donell
  0 siblings, 0 replies; 13+ messages in thread
From: Carlos O'Donell @ 2019-06-07 21:17 UTC (permalink / raw)
  To: Diego (Egor) Kobylkin
  Cc: Rafal Luzynski, Marko Myllynen, libc-alpha, libc-locales,
	Siddhesh Poyarekar, Mike Fabian

On 6/7/19 8:59 AM, Diego (Egor) Kobylkin wrote:
> On Friday, June 7, 2019 2:35 PM, Carlos O'Donell
> <codonell@redhat.com> wrote:
>> I'd like to hear what Egor has to say about the data loss aspects.
> 
> It's quite simple really - suppose you have a list of pages in an
> wikipedia.
> 
> For example there are these two entries in Russian: 1.Шема
> https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no
>
>  2.Схема
> https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0
> 
> 
> So you want to scrape wikipedia and them out to files: Шема.txt and
> Схема.txt But the target system doesn't support Russian locale and so
> you must transliterate the filenames.
> 
> 
> If "Ш"->"Sh" and "Сх"->"Sh", both of them will be written into the
> same file "Shema.txt". With no other special handing the first file
> will be overwritten and its data lost.
> 
> If "Ш"->"SH" and "Сх"->"Sh" - there will be two separate files 1.
> SHema.txt 2. Shema.txt . No data loss in this case.
  
Agreed.
  
> We cant exclude all data loss scenarios but at least shouldn't
> knowingly let the most basic ones happen just because how SHema
> looks. Translit is mostly a technical field at least in glibc so the
> aesthetics would be the last thing I would care about here.
> 
> 
> Anyway I'm all for committing the patch this way or another and
> opening a new bug should anyone complain about Sh/SH. Until now we
> had a hard time getting any input from any outsider on this issue. I
> guess de-facto I am the only end-user that has an opinion on this
> :-)

I appreciate your input.

I expected this example, it's a classic problem with transliteration
that the conversion can result in non-unique representations.

I also think your point about "technical" is relevant here, nobody
really wants to read the transliterated results, they want to read
the original, and providing any hint about the original form has
value.

In glibc we don't have any framework for an intelligent conversion.
We would have to write specific code to handle this case and add
it into the translit code for special handling in this case.

I think we should today leave "Ш"->"SH" and "Сх"->"Sh", since it's
the most conservative position that avoids ambiguity, and then we
can discuss the aesthetics of this and the other impacts and solutions.

I appreciate Rafal's position, but I think being conservative here,
even if it's not as pretty as uconv, is a good guiding idea.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-07 12:36           ` Carlos O'Donell
  2019-06-07 13:00             ` Diego (Egor) Kobylkin
@ 2019-06-10 22:44             ` Rafal Luzynski
  2019-06-17  8:59               ` Diego (Egor) Kobylkin
  1 sibling, 1 reply; 13+ messages in thread
From: Rafal Luzynski @ 2019-06-10 22:44 UTC (permalink / raw)
  To: Carlos O'Donell, Marko Myllynen, Diego (Egor) Kobylkin,
	libc-alpha, libc-locales, Siddhesh Poyarekar
  Cc: Mike Fabian

7.06.2019 14:35 Carlos O'Donell <codonell@redhat.com> wrote:
> On 6/7/19 6:52 AM, Rafal Luzynski wrote:
> [...]
> > 
> > uconv implements a smart algorithm to adjust the upper/lower case:
> > 
> > ==================================================================
> > $ echo "Схема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > Skhema
> > 
> > $ echo "Шема" | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> > Shema
> > 
> > $ echo "ШЕМА" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SHEMA
> > 
> > $ echo "ШЕма" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SHEma
> > 
> > $ echo "Ш Ема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
> > SH Yema
> > ==================================================================
> > 
> > Also for them it is easier because they decided that "Х" should be
> > transliterated to "KH" (I think this is the common thing when
> > transliterating to English) while ISO 9 says it should be transliterated
> > to "H" and GOST says it should be "X".  We can't implement this
> > fallback in glibc because the glibc algorithm is very simple.
> 
> *Sigh*
> 
> I should have known you could find enough examples that contradict
> eachother :-)

Sorry about the confusion.  My aim was to demonstrate how uconv adjusts
upper/lower case depending on the context so "Ш" becomes sometimes "SH"
and sometimes "Sh".



7.06.2019 14:59 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> [...]
> It's quite simple really - suppose you have a list of pages in an
> wikipedia. 
> 
> For example there are these two entries in Russian:
> 1.Шема
> https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no
> 
> 
> 2.Схема https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0 
> 
> 
> So you want to scrape wikipedia and them out to files: Шема.txt and
> Схема.txt
> But the target system doesn't support Russian locale and so you must
> transliterate the filenames.

While talking about the filesystem: I think the problem is not
that it does not support Russian locale but that it tries to
handle it and fails at this.  If the filesystem accepted any
byte string as a file name wouldn't it accept a byte string which
constructs correct Cyrillic characters in UTF-8, without any
transliteration?

> If "Ш"->"Sh" and "Сх"->"Sh", both of them will be written into the same
> file "Shema.txt". With no other special handing the first file will be
> overwritten and its data lost.
> 
> If "Ш"->"SH" and "Сх"->"Sh" - there will be two separate files 1.
> SHema.txt 2. Shema.txt . No data loss in this case. 
> [...]

The problem is exclusively in the limitation of glibc itself.
In fact no standard says that "Ш" should be transliterated as "Sh"
(or "SH") and "Х" as "H" (consequently, "Сх" as "Sh").  ISO-9
says that "Ш" should be "Š" and "Х" should be "H" (consequently,
"Сх" should be "Sh" but that would never be confused for "Ш").
GOST 7.79 says that "Ш" should be "SH" (or "Sh") and "Х" should
be "X" (consequently, ""Сх" should be "Sx").  There is no confusion
in any case.  The problem is that we can't express all these rules
in the language of glibc transliterations; the rule:

    Х    "H";"X"

will not work because it would choose a transliteration of "X" only
if "H" was not available in the target charset (which never happens)
while we want it to choose "X" if "Š" is not available.



7.06.2019 23:17 Carlos O'Donell <codonell@redhat.com> wrote:
> [...]
> I also think your point about "technical" is relevant here, nobody
> really wants to read the transliterated results, they want to read
> the original, and providing any hint about the original form has
> value.

It looks like I totally misunderstood the purpose.  I always thought
the aim is to produce a transliteration system for real natural
language texts and to achieve the same output as it would be written
by a human writer.  Which I still think is possible, at least partially
and not necessarily in the current development cycle.  If you guys
want to have only technical hint and want to relax the linguistic
rules then Egor's patches are mostly sufficient.

> In glibc we don't have any framework for an intelligent conversion.
> We would have to write specific code to handle this case and add
> it into the translit code for special handling in this case.

My suggestion was to add such an intelligent conversion.  The rule
should be simple: if a letter is followed by a lowercase it should
be a titlecase (Sh), otherwise it should be uppercase (SH).  But
this may break Egor's requirement to keep them always uppercase.

> I think we should today leave "Ш"->"SH" and "Сх"->"Sh", since it's
> the most conservative position that avoids ambiguity, and then we
> can discuss the aesthetics of this and the other impacts and solutions.
> 
> I appreciate Rafal's position, but I think being conservative here,
> even if it's not as pretty as uconv, is a good guiding idea.

Just to summarize: if you want to apply the relaxed rules, more
technical than linguistic, then I am more willing to accept these
patches.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
  2019-06-10 22:44             ` Rafal Luzynski
@ 2019-06-17  8:59               ` Diego (Egor) Kobylkin
  0 siblings, 0 replies; 13+ messages in thread
From: Diego (Egor) Kobylkin @ 2019-06-17  8:59 UTC (permalink / raw)
  To: Rafal Luzynski
  Cc: Carlos O'Donell, Marko Myllynen, libc-alpha, libc-locales,
	Siddhesh Poyarekar, Mike Fabian


[-- Attachment #1.1: Type: text/plain, Size: 2857 bytes --]


Carlos, 


we seem to have a consensus of all involved that the patch can be committed as is. 

Do you see it like this on your side as well or are there any more questions or suggestions?

Bests,
Egor

P.S. Just a clarification to Rafal points below and thanks @Rafal for the intensive "peer review" so far!
It definitely looks to me like we finally don't have any more divergent points after all the issues discussed.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 11, 2019 12:40 AM, Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
...
> 7.06.2019 14:59 "Diego (Egor) Kobylkin" egor@kobylkin.com wrote:
> > But the target system doesn't support Russian locale and so you must
> > transliterate the filenames.
> 

> While talking about the filesystem: I think the problem is not
> that it does not support Russian locale but that it tries to
> handle it and fails at this. If the filesystem accepted any
> byte string as a file name wouldn't it accept a byte string which
> constructs correct Cyrillic characters in UTF-8, without any
> transliteration?

Just to clarify here - the need to transliterate is the essential part in this example, not the actual cause of that need. 

A lot of "things" don't support UTF-8 or Cyrillic - filesystems, some UNIX power tools, older network appliances, databases, key-value stores etc. We are talking about a situation where you are forced to transliterate to ASCII. So that requirement is a given. 


...
> 

> > In glibc we don't have any framework for an intelligent conversion.
> > We would have to write specific code to handle this case and add
> > it into the translit code for special handling in this case.
> 

> My suggestion was to add such an intelligent conversion. The rule
> should be simple: if a letter is followed by a lowercase it should
> be a titlecase (Sh), otherwise it should be uppercase (SH). But
> this may break Egor's requirement to keep them always uppercase.

Again for the record my "requirement" is to have a minimal patch committed sooner than later. It turned out surprisingly difficult to keep our focus even on a single flat mapping table that the ASCII transliteration really is. 


> 

> > I think we should today leave "Ш"->"SH" and "Сх"->"Sh", since it's
> > the most conservative position that avoids ambiguity, and then we
> > can discuss the aesthetics of this and the other impacts and solutions.
> > I appreciate Rafal's position, but I think being conservative here,
> > even if it's not as pretty as uconv, is a good guiding idea.
> 

> Just to summarize: if you want to apply the relaxed rules, more
> technical than linguistic, then I am more willing to accept these
> patches.

The great thing is that we seem to have a consensus now and can proceed.



[-- Attachment #1.2: publickey - egor@kobylkin.com - 0x01FEB4E8.asc --]
[-- Type: application/pgp-keys, Size: 657 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-17  8:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05  6:47 [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Diego (Egor) Kobylkin
2019-06-05 23:51 ` Rafal Luzynski
2019-06-06  9:42   ` Marko Myllynen
2019-06-06 21:31     ` Rafal Luzynski
2019-06-07  0:58       ` Carlos O'Donell
2019-06-07  9:46         ` Diego (Egor) Kobylkin
2019-06-07 11:11           ` Rafal Luzynski
2019-06-07 10:51         ` Rafal Luzynski
2019-06-07 12:36           ` Carlos O'Donell
2019-06-07 13:00             ` Diego (Egor) Kobylkin
2019-06-07 21:17               ` Carlos O'Donell
2019-06-10 22:44             ` Rafal Luzynski
2019-06-17  8:59               ` Diego (Egor) Kobylkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).