public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
From: Egor Kobylkin <egor@kobylkin.com>
To: Siddhesh Poyarekar <siddhesh@gotplt.org>,
	Rafal Luzynski <digitalfreak@lingonborough.com>,
	Carlos O'Donell <carlos@redhat.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org
Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
Date: Thu, 03 Jan 2019 11:22:00 -0000	[thread overview]
Message-ID: <abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com> (raw)

Hi,

I would appreciate if you all could keep me on the TO: for this patch
discussions as I am not subscribed to the list. Please let me know if
there is another way around it.

> 
> From: Siddhesh Poyarekar <siddhesh at gotplt dot org> To: Rafal 
> Luzynski <digitalfreak at lingonborough dot com>, libc-alpha at 
> sourceware dot org Date: Wed, 2 Jan 2019 23:35:13 +0530 Subject: Re:
>  Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] 
> References: <1042796605.674608.1545262199252@poczta.nazwa.pl>
> 
> I've tried to overcome my general lack of confidence in commenting on
> locale related issues to provide some opinions. I'd take those with
> an appropriate dose of salt since like I've said before, I have 
> little experience in this area.
> 
> 
>> On 20/12/18 4:59 AM, Rafal Luzynski wrote:[SNIP]> * Is the C 
>> builtin locale the correct place to put this transliteration? If 
>> yes, should we think about including the support of other
>> alphabets as well (like extended Latin -> plain ASCII, Greek ->
>> Latin, and so on) ever in future?
> 
> 
> Yes on both counts, although this could result in bloating of the C 
> locale. If we are to provide additional transliteration of this sort,
> we probably need to provide some way to trim it.

Is there a specific way you measure the bloat of the C locale?
Is it the size of the resulting libc.so.6 file we are concerned with?

In terms of the source code we are just adding as many lines as there
are letters (169 insertions for Cyrillic in this patch v12)


> 
> 
>> * Should the Cyrillic transliteration work in every locale 
>> (possibly with few exceptions) or should we require that a locale 
>> actually using Cyrillic script must be used? (E.g., should it work 
>> when ru_RU is not installed? should it work if en_US is the only 
>> locale installed? Should it work when no locale is installed, even 
>> en_US?)
> 
> 
> Would it matter if it was in the C builtin locale?
Just for clarification, the whole point (at least for me) for this patch
is to have the transliteration when other methods are not available. Or
when existing programs/systems can not make use of them. The most basic
example: filenames in Cyrillic on a NAS that get converted to
????????.??? and get overwritten in the worst case. So the most value is
when it works out of box with the C builtin. Other locales can actually
implement their own variant and explicitly use it if they need one; some
already have, others may be just fine with the builtin C.

> 
>> * Is it required that transliteration produces unambiguous output 
>> which means that two different original strings never produce the 
>> same result? (As a consequence, the reverse transliteration could 
>> be possible).
> 
> 
> I don't think so. Transliterations are approximations in the end and
>  striving for such guarantees might be overreach.
> 
> 
>> Additionally we have a disagreement about how should we handle the
>>  case when a single original uppercase character transliterates
>> into a digraph in ASCII.  Should both ASCII characters be
>> uppercase (which is good for all uppercase strings and also good to
>> emphasize that the original character was single rather than two
>> separate characters which accidentally transliterate into two
>> characters making a digraph) or should only the first ASCII
>> character be uppercase (which is good for the titlecase words which
>> is common in natural texts)? An example is "Ш" - should it be "SH"
>> or "Sh"? Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").
> 
> 
> Is that important?

As in the above example about the files, you would probably agree that
it's better not to knowingly introduce a failure vector for such basic
OS operations like working with files. The transliteration
capitalization collisions have this negative potential. The users that
need a different specific capitalization can still implement that in
their locale.

Bests,
Egor

P.S.
Just for your reference here is the current patch:
https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
and the entry in the sourceware wiki:
https://sourceware.org/glibc/wiki/Release/2.29#Desirable_this_release.3F

             reply	other threads:[~2019-01-03 11:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-03 11:22 Egor Kobylkin [this message]
2019-01-03 13:41 ` Siddhesh Poyarekar
2019-01-04  0:27   ` Egor Kobylkin
2019-01-04  4:05     ` Siddhesh Poyarekar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com \
    --to=egor@kobylkin.com \
    --cc=carlos@redhat.com \
    --cc=digitalfreak@lingonborough.com \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    --cc=siddhesh@gotplt.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).