public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
From: Siddhesh Poyarekar <siddhesh@gotplt.org>
To: Egor Kobylkin <egor@kobylkin.com>,
	Rafal Luzynski <digitalfreak@lingonborough.com>,
	Carlos O'Donell <carlos@redhat.com>,
	libc-alpha@sourceware.org, libc-locales@sourceware.org
Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
Date: Thu, 03 Jan 2019 13:41:00 -0000	[thread overview]
Message-ID: <a706684f-d011-c2e7-021c-c1946dfa702c@gotplt.org> (raw)
In-Reply-To: <abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com>

On 03/01/19 4:52 PM, Egor Kobylkin wrote:
> Is there a specific way you measure the bloat of the C locale?
> Is it the size of the resulting libc.so.6 file we are concerned with?

I believe it's built into libc.so, so I suppose you'd have to look at 
its file size.

> In terms of the source code we are just adding as many lines as there
> are letters (169 insertions for Cyrillic in this patch v12)

Yeah, it will likely not be much for a single locale, but it may add up 
across locales.  I have no idea how much, it may well be insignificant.

> Just for clarification, the whole point (at least for me) for this patch
> is to have the transliteration when other methods are not available. Or
> when existing programs/systems can not make use of them. The most basic
> example: filenames in Cyrillic on a NAS that get converted to
> ????????.??? and get overwritten in the worst case. So the most value is
> when it works out of box with the C builtin. Other locales can actually
> implement their own variant and explicitly use it if they need one; some
> already have, others may be just fine with the builtin C.

That's a fair point but given the approximation, that specific use case 
may still be flaky.

>>> Additionally we have a disagreement about how should we handle the
>>>  case when a single original uppercase character transliterates
>>> into a digraph in ASCII.  Should both ASCII characters be
>>> uppercase (which is good for all uppercase strings and also good to
>>> emphasize that the original character was single rather than two
>>> separate characters which accidentally transliterate into two
>>> characters making a digraph) or should only the first ASCII
>>> character be uppercase (which is good for the titlecase words which
>>> is common in natural texts)? An example is "Ш" - should it be "SH"
>>> or "Sh"? Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").
>>
>>
>> Is that important?
> 
> As in the above example about the files, you would probably agree that
> it's better not to knowingly introduce a failure vector for such basic
> OS operations like working with files. The transliteration
> capitalization collisions have this negative potential. The users that
> need a different specific capitalization can still implement that in
> their locale.

OK, I can see reason for reducing collisions but again, it remains 
flaky.  We could in the interest of moving forward, strive towards 
making it less flaky but at the same time be aware that there may 
eventually be collisions.

Siddhesh

  reply	other threads:[~2019-01-03 13:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-03 11:22 Egor Kobylkin
2019-01-03 13:41 ` Siddhesh Poyarekar [this message]
2019-01-04  0:27   ` Egor Kobylkin
2019-01-04  4:05     ` Siddhesh Poyarekar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a706684f-d011-c2e7-021c-c1946dfa702c@gotplt.org \
    --to=siddhesh@gotplt.org \
    --cc=carlos@redhat.com \
    --cc=digitalfreak@lingonborough.com \
    --cc=egor@kobylkin.com \
    --cc=libc-alpha@sourceware.org \
    --cc=libc-locales@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).