Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
@ 2018-12-19 23:45 Rafal Luzynski
  2019-01-02 18:05 ` Siddhesh Poyarekar
  0 siblings, 1 reply; 7+ messages in thread
From: Rafal Luzynski @ 2018-12-19 23:45 UTC (permalink / raw)
  To: libc-alpha

Hi,

Egor provided as many as 11 versions of the patch fixing bug 2872
(Transliteration Cyrillic -> ASCII fails [1]).  I still have trouble
deciding which version is the best because we have some disagreements
about how to implement this.  I need more feedback from more experienced
maintainers.  I am afraid that my reviews so far only made Egor do
more unnecessary work.

Here are some of my questions.  Please note that they don't require any
knowledge of Cyrillic script so everybody are welcome to provide their
opinion.

* Should we take the title of the bug literally and provide the
  transliteration exclusively to plain ASCII or should we support the
  transliteration to extended Latin (with some diacritic characters,
  as per ISO 9 [2]) and support plain ASCII only as a fallback?
* Should we agree for Cyrillic -> extended Latin -> ASCII even if the
  ASCII fallback does not fully conform with any existing standard?
* Should we implement Cyrillic -> plain ASCII as per GOST System B [3]
  and skip extended Latin if it is impossible to handle both for standards
  technical reasons?
* Is the C builtin locale the correct place to put this transliteration?
  If yes, should we think about including the support of other alphabets
  as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on)
  ever in future?
* Should the Cyrillic transliteration work in every locale (possibly with
  few exceptions) or should we require that a locale actually using
  Cyrillic script must be used? (E.g., should it work when ru_RU is not
  installed? should it work if en_US is the only locale installed? Should
  it work when no locale is installed, even en_US?)
* Is it required that transliteration produces unambiguous output which
  means that two different original strings never produce the same result?
  (As a consequence, the reverse transliteration could be possible).

Additionally we have a disagreement about how should we handle the case
when a single original uppercase character transliterates into a digraph
in ASCII.  Should both ASCII characters be uppercase (which is good for
all uppercase strings and also good to emphasize that the original character
was single rather than two separate characters which accidentally
transliterate into two characters making a digraph) or should only the first
ASCII character be uppercase (which is good for the titlecase words which is
common in natural texts)?  An example is "Ш" - should it be "SH" or "Sh"?
Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").

We are lucky that some of existing glibc locales already handle
transliteration
from Cyrillic to Latin, for example sr_RS and uk_UA.  Unfortunately, they
follow their national standards rather than ISO or GOST so they cannot
be copied directly to ru_RU or applied universally to all locales.

Also, taking Egor's work into account, can we include this bug into the
list of desirable to be fixed in 2.29?

Regards,

Rafal

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] https://en.wikipedia.org/wiki/ISO_9
[3] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  2018-12-19 23:45 Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] Rafal Luzynski
@ 2019-01-02 18:05 ` Siddhesh Poyarekar
  2019-01-02 18:11   ` Carlos O'Donell
  0 siblings, 1 reply; 7+ messages in thread
From: Siddhesh Poyarekar @ 2019-01-02 18:05 UTC (permalink / raw)
  To: Rafal Luzynski, libc-alpha

I've tried to overcome my general lack of confidence in commenting on 
locale related issues to provide some opinions.  I'd take those with an 
appropriate dose of salt since like I've said before, I have little 
experience in this area.

On 20/12/18 4:59 AM, Rafal Luzynski wrote:
> * Should we take the title of the bug literally and provide the
>    transliteration exclusively to plain ASCII or should we support the
>    transliteration to extended Latin (with some diacritic characters,
>    as per ISO 9 [2]) and support plain ASCII only as a fallback?

We currently have a patch with ASCII fallback implemented and I reckon 
implementing Latin fallback would just be additional work that could be 
done in a second phase if really necessary.  In that sense, I'd think 
ASCII is sufficient as a first pass.

> * Should we agree for Cyrillic -> extended Latin -> ASCII even if the
>    ASCII fallback does not fully conform with any existing standard?

I have no idea what standards govern this, so I have no opinion on it.

> * Should we implement Cyrillic -> plain ASCII as per GOST System B [3]
>    and skip extended Latin if it is impossible to handle both for standards
>    technical reasons?

Sounds reasonable.

> * Is the C builtin locale the correct place to put this transliteration?
>    If yes, should we think about including the support of other alphabets
>    as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on)
>    ever in future?

Yes on both counts, although this could result in bloating of the C 
locale.  If we are to provide additional transliteration of this sort, 
we probably need to provide some way to trim it.

> * Should the Cyrillic transliteration work in every locale (possibly with
>    few exceptions) or should we require that a locale actually using
>    Cyrillic script must be used? (E.g., should it work when ru_RU is not
>    installed? should it work if en_US is the only locale installed? Should
>    it work when no locale is installed, even en_US?)

Would it matter if it was in the C builtin locale?

> * Is it required that transliteration produces unambiguous output which
>    means that two different original strings never produce the same result?
>    (As a consequence, the reverse transliteration could be possible).

I don't think so.  Transliterations are approximations in the end and 
striving for such guarantees might be overreach.

> Additionally we have a disagreement about how should we handle the case
> when a single original uppercase character transliterates into a digraph
> in ASCII.  Should both ASCII characters be uppercase (which is good for
> all uppercase strings and also good to emphasize that the original character
> was single rather than two separate characters which accidentally
> transliterate into two characters making a digraph) or should only the first
> ASCII character be uppercase (which is good for the titlecase words which is
> common in natural texts)?  An example is "Ш" - should it be "SH" or "Sh"?
> Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").

Is that important?

> We are lucky that some of existing glibc locales already handle
> transliteration
> from Cyrillic to Latin, for example sr_RS and uk_UA.  Unfortunately, they
> follow their national standards rather than ISO or GOST so they cannot
> be copied directly to ru_RU or applied universally to all locales.
> 
> Also, taking Egor's work into account, can we include this bug into the
> list of desirable to be fixed in 2.29?

It's late for 2.29 (sorry, it's partly my fault for not being decisive 
enough about it) but please continue reviewing so that it lands first 
thing in 2.30.  It also looks like something that can be safely 
backported assuming that it does not affect translations, so you could 
well do that for 2.29 or as far back as you like.

Siddhesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  2019-01-02 18:05 ` Siddhesh Poyarekar
@ 2019-01-02 18:11   ` Carlos O'Donell
  0 siblings, 0 replies; 7+ messages in thread
From: Carlos O'Donell @ 2019-01-02 18:11 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Rafal Luzynski, libc-alpha

On 1/2/19 1:05 PM, Siddhesh Poyarekar wrote:
>> Also, taking Egor's work into account, can we include this bug into
>> the list of desirable to be fixed in 2.29?
> 
> It's late for 2.29 (sorry, it's partly my fault for not being
> decisive enough about it) but please continue reviewing so that it
> lands first thing in 2.30.  It also looks like something that can be
> safely backported assuming that it does not affect translations, so
> you could well do that for 2.29 or as far back as you like.

This is also my fault. I keep meaning to review this series but the
technical review of the rwlock bugs has taken priority, and there is
still one more bug left to fix in rwlock which should fix in 2.29.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  2019-01-04  0:27   ` Egor Kobylkin
@ 2019-01-04  4:05     ` Siddhesh Poyarekar
  0 siblings, 0 replies; 7+ messages in thread
From: Siddhesh Poyarekar @ 2019-01-04  4:05 UTC (permalink / raw)
  To: Egor Kobylkin, Rafal Luzynski, Carlos O'Donell, libc-alpha,
	libc-locales

On 04/01/19 5:57 AM, Egor Kobylkin wrote:
> On 03.01.19 14:41, Siddhesh Poyarekar wrote:
>> On 03/01/19 4:52 PM, Egor Kobylkin wrote:
>>> Is there a specific way you measure the bloat of the C locale?
>>> Is it the size of the resulting libc.so.6 file we are concerned with?
>>
>> I believe it's built into libc.so, so I suppose you'd have to look at 
>> its file size.
> 
> I have build the library two times w/wo patch to compare the .so sizes:
> 
> (ls -l build/glibc/libc.so)
> 
> size without Cyrillic translit
> 17172016 bytes
> 
> size with Cyrillic translit (patch applied)
> 17180208 bytes
> 
> the difference is 8192 bytes

Thanks, I think that difference is minimal and even if we add some 
amount of transliteration for all implemented locales the effect 
shouldn't go beyond a few MB, definitely not in the hundreds.

Siddhesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  2019-01-03 13:41 ` Siddhesh Poyarekar
@ 2019-01-04  0:27   ` Egor Kobylkin
  2019-01-04  4:05     ` Siddhesh Poyarekar
  0 siblings, 1 reply; 7+ messages in thread
From: Egor Kobylkin @ 2019-01-04  0:27 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Rafal Luzynski, Carlos O'Donell,
	libc-alpha, libc-locales

On 03.01.19 14:41, Siddhesh Poyarekar wrote:
> On 03/01/19 4:52 PM, Egor Kobylkin wrote:
>> Is there a specific way you measure the bloat of the C locale?
>> Is it the size of the resulting libc.so.6 file we are concerned with?
> 
> I believe it's built into libc.so, so I suppose you'd have to look at 
> its file size.

I have build the library two times w/wo patch to compare the .so sizes:

(ls -l build/glibc/libc.so)

size without Cyrillic translit
17172016 bytes

size with Cyrillic translit (patch applied)
17180208 bytes

the difference is 8192 bytes


>> In terms of the source code we are just adding as many lines as there
>> are letters (169 insertions for Cyrillic in this patch v12)
> 
> Yeah, it will likely not be much for a single locale, but it may add up 
> across locales.Â  I have no idea how much, it may well be insignificant.
> 
 From what I gathered if the Cyrillic translit table goes into the C 
locale it doesn't have to be included into other locales. So it is only 
included once with a ~8K size increase.

If another locale wants to have its own different transit table that 
would then go on top but towards its own size budget.

Bests,
Egor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  2019-01-03 11:22 Egor Kobylkin
@ 2019-01-03 13:41 ` Siddhesh Poyarekar
  2019-01-04  0:27   ` Egor Kobylkin
  0 siblings, 1 reply; 7+ messages in thread
From: Siddhesh Poyarekar @ 2019-01-03 13:41 UTC (permalink / raw)
  To: Egor Kobylkin, Rafal Luzynski, Carlos O'Donell, libc-alpha,
	libc-locales

On 03/01/19 4:52 PM, Egor Kobylkin wrote:
> Is there a specific way you measure the bloat of the C locale?
> Is it the size of the resulting libc.so.6 file we are concerned with?

I believe it's built into libc.so, so I suppose you'd have to look at 
its file size.

> In terms of the source code we are just adding as many lines as there
> are letters (169 insertions for Cyrillic in this patch v12)

Yeah, it will likely not be much for a single locale, but it may add up 
across locales.  I have no idea how much, it may well be insignificant.

> Just for clarification, the whole point (at least for me) for this patch
> is to have the transliteration when other methods are not available. Or
> when existing programs/systems can not make use of them. The most basic
> example: filenames in Cyrillic on a NAS that get converted to
> ????????.??? and get overwritten in the worst case. So the most value is
> when it works out of box with the C builtin. Other locales can actually
> implement their own variant and explicitly use it if they need one; some
> already have, others may be just fine with the builtin C.

That's a fair point but given the approximation, that specific use case 
may still be flaky.

>>> Additionally we have a disagreement about how should we handle the
>>>  case when a single original uppercase character transliterates
>>> into a digraph in ASCII.  Should both ASCII characters be
>>> uppercase (which is good for all uppercase strings and also good to
>>> emphasize that the original character was single rather than two
>>> separate characters which accidentally transliterate into two
>>> characters making a digraph) or should only the first ASCII
>>> character be uppercase (which is good for the titlecase words which
>>> is common in natural texts)? An example is "Ш" - should it be "SH"
>>> or "Sh"? Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").
>>
>>
>> Is that important?
> 
> As in the above example about the files, you would probably agree that
> it's better not to knowingly introduce a failure vector for such basic
> OS operations like working with files. The transliteration
> capitalization collisions have this negative potential. The users that
> need a different specific capitalization can still implement that in
> their locale.

OK, I can see reason for reducing collisions but again, it remains 
flaky.  We could in the interest of moving forward, strive towards 
making it less flaky but at the same time be aware that there may 
eventually be collisions.

Siddhesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
@ 2019-01-03 11:22 Egor Kobylkin
  2019-01-03 13:41 ` Siddhesh Poyarekar
  0 siblings, 1 reply; 7+ messages in thread
From: Egor Kobylkin @ 2019-01-03 11:22 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Rafal Luzynski, Carlos O'Donell,
	libc-alpha, libc-locales

Hi,

I would appreciate if you all could keep me on the TO: for this patch
discussions as I am not subscribed to the list. Please let me know if
there is another way around it.

> 
> From: Siddhesh Poyarekar <siddhesh at gotplt dot org> To: Rafal 
> Luzynski <digitalfreak at lingonborough dot com>, libc-alpha at 
> sourceware dot org Date: Wed, 2 Jan 2019 23:35:13 +0530 Subject: Re:
>  Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] 
> References: <1042796605.674608.1545262199252@poczta.nazwa.pl>
> 
> I've tried to overcome my general lack of confidence in commenting on
> locale related issues to provide some opinions. I'd take those with
> an appropriate dose of salt since like I've said before, I have 
> little experience in this area.
> 
> 
>> On 20/12/18 4:59 AM, Rafal Luzynski wrote:[SNIP]> * Is the C 
>> builtin locale the correct place to put this transliteration? If 
>> yes, should we think about including the support of other
>> alphabets as well (like extended Latin -> plain ASCII, Greek ->
>> Latin, and so on) ever in future?
> 
> 
> Yes on both counts, although this could result in bloating of the C 
> locale. If we are to provide additional transliteration of this sort,
> we probably need to provide some way to trim it.

Is there a specific way you measure the bloat of the C locale?
Is it the size of the resulting libc.so.6 file we are concerned with?

In terms of the source code we are just adding as many lines as there
are letters (169 insertions for Cyrillic in this patch v12)


> 
> 
>> * Should the Cyrillic transliteration work in every locale 
>> (possibly with few exceptions) or should we require that a locale 
>> actually using Cyrillic script must be used? (E.g., should it work 
>> when ru_RU is not installed? should it work if en_US is the only 
>> locale installed? Should it work when no locale is installed, even 
>> en_US?)
> 
> 
> Would it matter if it was in the C builtin locale?
Just for clarification, the whole point (at least for me) for this patch
is to have the transliteration when other methods are not available. Or
when existing programs/systems can not make use of them. The most basic
example: filenames in Cyrillic on a NAS that get converted to
????????.??? and get overwritten in the worst case. So the most value is
when it works out of box with the C builtin. Other locales can actually
implement their own variant and explicitly use it if they need one; some
already have, others may be just fine with the builtin C.

> 
>> * Is it required that transliteration produces unambiguous output 
>> which means that two different original strings never produce the 
>> same result? (As a consequence, the reverse transliteration could 
>> be possible).
> 
> 
> I don't think so. Transliterations are approximations in the end and
>  striving for such guarantees might be overreach.
> 
> 
>> Additionally we have a disagreement about how should we handle the
>>  case when a single original uppercase character transliterates
>> into a digraph in ASCII.  Should both ASCII characters be
>> uppercase (which is good for all uppercase strings and also good to
>> emphasize that the original character was single rather than two
>> separate characters which accidentally transliterate into two
>> characters making a digraph) or should only the first ASCII
>> character be uppercase (which is good for the titlecase words which
>> is common in natural texts)? An example is "Ð¨" - should it be "SH"
>> or "Sh"? Note that "Ð¡Ñ…" may also produce "Sh" ("S" + "h" -> "Sh").
> 
> 
> Is that important?

As in the above example about the files, you would probably agree that
it's better not to knowingly introduce a failure vector for such basic
OS operations like working with files. The transliteration
capitalization collisions have this negative potential. The users that
need a different specific capitalization can still implement that in
their locale.

Bests,
Egor

P.S.
Just for your reference here is the current patch:
https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
and the entry in the sourceware wiki:
https://sourceware.org/glibc/wiki/Release/2.29#Desirable_this_release.3F

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-04  4:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-19 23:45 Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872] Rafal Luzynski
2019-01-02 18:05 ` Siddhesh Poyarekar
2019-01-02 18:11   ` Carlos O'Donell
2019-01-03 11:22 Egor Kobylkin
2019-01-03 13:41 ` Siddhesh Poyarekar
2019-01-04  0:27   ` Egor Kobylkin
2019-01-04  4:05     ` Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).