public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
From: Rafal Luzynski <digitalfreak@lingonborough.com>
To: Marko Myllynen <myllynen@redhat.com>,
	Egor Kobylkin <egor@kobylkin.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org,
	Carlos O'Donell <carlos@redhat.com>
Cc: Siddhesh Poyarekar <siddhesh@gotplt.org>,
	Mike Fabian <mfabian@redhat.com>
Subject: Re: [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30
Date: Fri, 19 Apr 2019 22:24:00 -0000	[thread overview]
Message-ID: <32577152.740058.1555712661835@poczta.nazwa.pl> (raw)
In-Reply-To: <cf4a6fd5-e3f4-ea19-4a05-1afb72a110f7@redhat.com>

Thank you Siddhesh and Carlos for your involvement in testing this
patch and I apologize Egor and Marko and everyone else who need this
patch to be pushed for my poor involvement.  I'd like to reply to
this email from Marko because it summarizes all issues.  Also I hope
I will explain the problems which made me stuck.

14.02.2019 17:48 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> 1) Built-in C locale doesn't read/use any translit_* files and it can't
> have any fallback mechanisms and it only supports ASCII so using GOST
> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
> be the appropriate way to implement Cyrillic transliteration for the
> built-in C locale (it adds some 8KB to the binary).

This sounds like a good idea.

Also, C locale is probably a good way to enforce the plain ASCII
transliteration without any fallback.

> 2) Other locales read/use translit_* files and with them fallbacks and
> non-ASCII are possible so it would seem preferable to first try ISO 9 /
> GOST 7.79 System A

OK, we agree here.

> and only if that fails then use GOST 7.79 System B
> (in which case the end result should match with the built-in C locale).

This is impossible due to this case.  System A transliterates the Cyrillic
"Х" to Latin "H", system B transliterates it to Latin "X".  Transliteration
as implemented in glibc supports a simple fallback algorithm: transliterate
the letter "X" to "YY" but if it is not available then to "ZZ".  It can't
support the complex algorithm which we need here: transliterate "X" to "YY"
but if "Q" cannot be transliterated to "RR" then transliterate "X" to "ZZ".
In our case we would like to transliterate "Х" to "X" if "Ш" cannot be
transliterated to "Š".  The only thing we can implement is a fallback
transliteration which is similar to System B but not 100% compatible.

This is not the case if we are going to implement only System B in C locale
because we know already that "Š" is unavailable so we have to transliterate
"Х" to "X" always.

> For this the translit_cyrillic file should be added (as per patch v9 +
> changes mentioned in patches v10 and v12).
> 
> 3) Individual locale files can then be updated to use translit_cyrillic
> as appropriate (see patch v9) and language/national specific conventions
> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.

Sometimes I wonder whether really any other locale than a language which
uses the Cyrillic script should want to have a Cyrillic transliteration
but on the other hand - why not.

Also I'd like to reiterate other disagreements which we have here:

1. How to handle upper/lower case in System B?  Should we transliterate
   "Ш" to "SH" or "Sh"?  Should we maybe implement a smart context based
   casing algorithm first?  I mean the algorithm which would detect if
   an uppercase letter appears as the first letter of otherwise lowercase
   word so should be transliterated as "Sh", or maybe it's in a context
   of a fully uppercase word so should be transliterated as "SH".
   I think that uconv implements this algorithm.
2. How to handle ambiguous transliterations like "Схема" -> "Shema"
   vs. "Шема" -> "Shema"? "SHema"?
3. How to handle the characters which are proper letters in Cyrillic
   and have an upper and lower case like a hard and soft sign but are
   transliterated to punctuation characters (grave accent "`")?
   Should we transliterate upper and lower case to the same character
   or should we mark them somehow?  uconv adds Unicode combining low
   line to the grave accent (so the output is "`̲") if the original
   Cyrillic character was uppercase.  But this is unavailable if
   our target charset is ASCII.

Regarding the test cases which I mentioned the other day I discussed
this with Dmitry and he convinced me that requiring the test cases is
the bar set too high so I agree we don't need to require them already.

Regards,

Rafal

  parent reply	other threads:[~2019-04-19 22:24 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
     [not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34   ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:41     ` Carlos O'Donell
2018-07-17 19:50       ` Egor Kobylkin
2018-07-17 19:59         ` Carlos O'Donell
2018-08-06 19:00   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03  8:28     ` Egor Kobylkin
2018-10-03  9:20       ` Keld Simonsen
2018-10-03  9:32         ` Egor Kobylkin
2018-10-05  8:44           ` Marko Myllynen
2018-10-05  9:20           ` Rafal Luzynski
2018-10-05 10:37             ` Egor Kobylkin
2018-10-08 22:05               ` Rafal Luzynski
2018-10-08 22:52                 ` Egor Kobylkin
2018-10-09 21:43                   ` Rafal Luzynski
2018-10-09 16:10                 ` Marko Myllynen
2018-10-09 16:22                   ` Egor Kobylkin
2018-10-09 16:49                     ` Marko Myllynen
2018-10-09 22:09                   ` Rafal Luzynski
2018-10-10 11:21                     ` Marko Myllynen
2018-10-11 10:10                   ` Marko Myllynen
     [not found]             ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54               ` Marko Myllynen
2018-10-05 12:01                 ` Egor Kobylkin
2018-10-05 12:21                   ` Marko Myllynen
2018-10-05 20:47                     ` Egor Kobylkin
2018-10-08 12:41                       ` Marko Myllynen
2018-10-08 22:23                         ` Rafal Luzynski
2018-10-08 23:36                           ` Egor Kobylkin
2018-10-09 13:18                             ` Egor Kobylkin
2018-10-09 18:34                               ` Egor Kobylkin
2018-10-09 22:18                                 ` Rafal Luzynski
2018-10-09 22:40                                   ` Egor Kobylkin
2018-10-09 22:43                                     ` Egor Kobylkin
2018-10-10 11:23                                       ` Marko Myllynen
2018-10-10 12:20                                         ` Egor Kobylkin
2018-10-10 12:34                                           ` Marko Myllynen
2018-10-10 22:29   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11 10:00     ` Marko Myllynen
2018-10-11 11:05     ` Rafal Luzynski
2018-10-11 13:10       ` Marko Myllynen
2018-10-11 13:51       ` Volodymyr Lisivka
2018-10-11 14:59       ` Egor Kobylkin
2018-10-11 21:31         ` Egor Kobylkin
2018-10-11 15:05       ` Egor Kobylkin
2018-10-11 15:45   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33   ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:06   ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13  1:01     ` Rafal Luzynski
2018-10-13 16:58       ` Egor Kobylkin
2018-10-15 11:05         ` Marko Myllynen
2018-10-15 11:55           ` Egor Kobylkin
2018-10-23 23:08         ` Rafal Luzynski
2018-10-17 14:17   ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:52   ` [PATCH v7] " Egor Kobylkin
2018-11-02  0:01   ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22     ` Rafal Luzynski
2018-11-02 23:27       ` Egor Kobylkin
2018-11-14 21:25   ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17     ` Rafal Luzynski
2018-11-17 18:35       ` Egor Kobylkin
2018-11-19  7:14         ` Marko Myllynen
2018-11-19  9:22           ` Egor Kobylkin
2018-11-19 19:36             ` Marko Myllynen
2018-12-01 22:09           ` Rafal Luzynski
2018-12-01 22:53             ` Egor Kobylkin
2018-12-03 22:19             ` Egor Kobylkin
     [not found]               ` <1361059722.707244.1544231740358@poczta.nazwa.pl>
2018-12-10 21:20                 ` Marko Myllynen
2018-12-19 22:26                   ` Rafal Luzynski
2018-12-19 22:48                     ` Egor Kobylkin
2018-12-19 23:51                       ` Rafal Luzynski
2018-11-19 11:11   ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35     ` Rafal Luzynski
2018-12-08 21:51       ` Egor Kobylkin
2018-12-19 22:42         ` Rafal Luzynski
2018-12-19 23:02           ` Egor Kobylkin
2018-12-20  0:06             ` Rafal Luzynski
2018-12-08 22:28   ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16     ` Egor Kobylkin
2018-12-26 10:07       ` Siddhesh Poyarekar
2018-12-26 12:14         ` Egor Kobylkin
2018-12-27  1:31           ` Siddhesh Poyarekar
2018-12-27 11:31             ` Rafal Luzynski
2019-01-02 18:39   ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:36     ` Rafal Luzynski
2019-01-05 21:13       ` Egor Kobylkin
2019-01-07 20:37         ` Marko Myllynen
2019-01-09  0:46           ` Egor Kobylkin
2019-01-09 20:03             ` Marko Myllynen
2019-02-04  7:14               ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48                 ` Marko Myllynen
2019-03-04 22:12                   ` Egor Kobylkin
2019-03-11 13:59                     ` PING " Egor Kobylkin
2019-03-14 19:49                       ` Egor Kobylkin
2019-04-19 22:24                   ` Rafal Luzynski [this message]
     [not found]                     ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27  2:51                       ` Siddhesh Poyarekar
2019-04-27  7:34                         ` Diego (Egor) Kobylkin
2019-04-09  1:04     ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39   ` ping " Egor Kobylkin
2019-03-28 16:20     ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44     ` [PING^5][PATCH " Egor Kobylkin
2019-04-06  1:36       ` Siddhesh Poyarekar
2019-04-16  7:15     ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17       ` Carlos O'Donell
2019-04-16 17:07         ` Egor Kobylkin
2019-04-16 17:58           ` Carlos O'Donell
2019-04-16 18:41             ` Egor Kobylkin
2019-04-16 19:06               ` Carlos O'Donell
2019-05-10 12:19                 ` Marko Myllynen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32577152.740058.1555712661835@poczta.nazwa.pl \
    --to=digitalfreak@lingonborough.com \
    --cc=carlos@redhat.com \
    --cc=egor@kobylkin.com \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    --cc=mfabian@redhat.com \
    --cc=myllynen@redhat.com \
    --cc=siddhesh@gotplt.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).