From: Volodymyr Lisivka <vlisivka@gmail.com>
To: digitalfreak@lingonborough.com
Cc: Egor Kobylkin <egor@kobylkin.com>,
libc-alpha@sourceware.org, libc-locales@sourceware.org,
mfabian@redhat.com, myllynen@redhat.com, ldv@altlinux.org,
Max Kutny <mkutny@gmail.com>,
danilo@gnome.org
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2
Date: Thu, 11 Oct 2018 13:51:00 -0000 [thread overview]
Message-ID: <CAKjGFBUjTOtO95KO7BSZa5cRyvxGmMdrdGZzBU-pUjDUrKsfBg@mail.gmail.com> (raw)
In-Reply-To: <180516689.458569.1539255868196@poczta.nazwa.pl>
чт, 11 жовт. 2018 о 14:05 Rafal Luzynski <digitalfreak@lingonborough.com> пише:
>
> Thank you, Egor. I am looking at your patch and although I have
> not yet finished, here are some remarks:
>
> First of all, I think that such a large patch should also include
> the tests. Please see how automatic tests are performed in locale
> data and write your own.
>
> 11.10.2018 00:29 Egor Kobylkin <egor@kobylkin.com> wrote:
> > [...]
> > From this patch I have excluded locales that already mention cyrillic or
> > have a transliteration table for it:
> > az_AZ
> > iso14651_t1_common
> > ky_KG
> > mn_MN
> > sr_RS
> > tg_TJ
> > tk_TM
> > tt_RU
> > uk_UA
> > uz_UZ
> > uz_UZ@cyrillic
> > [...]
>
> I think that eventually we would like to include your translit_cyrillic
> also in these locales because I assume that your rules should work good
> for them as well, also should include more characters than the individual
> language contributors took into account.
It's very good idea. Transliteration in Ukrainian locale predates this
work for about decade. It well tested. I also have automatic test
cases, which I can adapt to current standard. Let's drop Russian
transliteration rules and replace them with Ukrainian transliteration
rules. I assume that Ukrainian rules should work good for them as
well.
Ukrainian language is the oldest and most developed language in Slavic
family - last king of all Slavs named Madzhak/Muzhik (Brave), leader
of Volyniana union, was lived in Western Ukraine in Volyn` region.
After Madzhak capturing of Madzhak, kingdom was split into multiple
western parts and eastern part, where 9 Slavic tribes were united by
Rus` tribe, which abandoned their city, now known as Old Russa,
because of epidemic. IMHO, it's will be fair to use rules of the
oldest Slavic union.
> Similarly to Mike's work on
> collation: a common rules were created and all locales include them adding
> their own language specific modifications.
It's good idea too. In our own locale we prefer that words in our
language will be at top of a sorted list. Currently, in Ukrainian
locale it works as intended, but Russian locale has inverted order.
IMHO, Russian locale should use Ukrainian rules.
$ echo 'один два three four'| tr ' ' '\n' | LANG=uk_UA.utf8 sort
два
один
four
three
$ echo 'один два three four'| tr ' ' '\n' | LANG=ru_RU.utf8 sort
four
three
два
один
>
> > [...]
> > COMMIT MESSAGE:
> > [...]
> > I am excluding these locales from this proposed patch. I have written
> > directly to locale maintainer emails listed in the files. Volodymyr
> > Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
> > Данило Шеган <danilo@gnome.org> (sr_YU, sr_CS) have confirmed the
>
> I am not sure if we want Cyrillic text in the commit message. Shouldn't
> it be, uhm, tranlisterated? :-)
>
> "sr_CS" - I guess you meant "sr_RS".
>
> "sr_YU" has been dropped, do we want to mention it?
>
> > [...]
> > [BZ #2872]
> > * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79
>
> Please start "Add" with an uppercase. BTW, shouldn't it be "New file"
> instead?
>
> > System A transliteration System B transcription table from Cyrillic to
> > Latin/ASCII.
> > * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE
> > translit section.
>
> Same, "Add" here.
>
> > * localedata/locales/aa_DJ: Likewise.
>
> Good (here and everywhere below).
>
> > [...]
> > diff -uNr a/localedata/locales/translit_cyrillic
> > b/localedata/locales/translit_cyrillic
> > --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000
> > +0000
> > +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.000000000
> > +0000
> > @@ -0,0 +1,383 @@
> > +escape_char /
> > +comment_char %
> > +
> > +% This file is part of the GNU C Library and contains locale data.
> > +% The Free Software Foundation does not claim any copyright interest
> > +% in the locale data contained in this file. The foregoing does not
> > +% affect the license of the GNU C Library as a whole. It does not
> > +% exempt you from the conditions of the license if your use would
> > +% otherwise be governed by that license.
> > +
> > +% Transliterations of cyrillic letters to latin and/or ascii symbols.
>
> "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII".
>
> > +% Inspired by ISO 9.1995 / GOST 7.79-2000.
> > +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
> > +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995
>
> Typos:
>
> "i.e" -> "i.e.," (somebody please fix me if I'm wrong here)
> "U4001" - I guess you meant "U0401"
> "U4F9" -> "U04F9". I think that "U4F9" is not definitely bad but
> let's be consistent.
>
> Also I can see some gaps in the range. Are you going to fill them
> or maybe for now just mention that they exist?
>
> > +% It implements the GOST_7.79 System A (Latin Script) as a first
> > +% option and System B Cyrillic (ASCII) as a second option. Check
> > +% https://en.wikipedia.org/wiki/ISO_9 for reference.
> > +% The System B is extended from GOST_7.79-Russian using open sources
> > +% of the transliteration mappings and the "h/`" diacritics logic.
>
> What is "h/`" diacritics logic?
>
> > +
> > +% Usage examples:
> > +% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
> > +% | iconv -f ISO-8859-15 -t UTF-8 # System A
> > +% iconv -f UTF-8 -t ASCII//TRANSLIT # System B.
> > +
> > +% Contributions welcome for the rest of Cyrillic script in Unicode
>
> Sure, I'm not going to stop you from pushing these changes just because
> there are missing characters. I will consider adding them later.
>
> > +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.
> > +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872.
> > +% Generated from UnicodeData.txt with
> > +% https://sourceware.org/bugzilla/attachment.cgi?id=11301.
>
> 1. Is the file really generated with a script and not modified later?
> If yes then maybe you should contribute the script instead? In that case,
> you should also not post this file to libc-locale, maintainers and
> developers should be able to regenerate it.
> 2. The link leads to a LibreOffice spreadsheet.
>
> > +LC_CTYPE
> > +
> > +translit_start
> > +
>
> <U0400> is missing here. Are you going to leave it for now?
>
> > +% CYRILLIC CAPITAL LETTER IO
> > +<U0401> <U00CB>;"<U0059><U004F>"
> > [...]
> > +% CYRILLIC CAPITAL LETTER KJE
> > +<U040C> <U1E30>;"<U004B><U0060>"
>
> <U040D> is missing here. Can we add it already?
>
> > +% CYRILLIC CAPITAL LETTER SHORT U
> > +<U040E> <U016C>;"<U0055><U0060>"
> > [...]
> > +% CYRILLIC CAPITAL LETTER U
> > +<U0423> <U0055>
> > +% CYRILLIC UNDEFINED
> > +<U0423><U0301> <U00DA>;"<U0055><U0060>"
>
> This still makes me wonder.
>
> Does it work at all?
> What if we remove this rule, won't it be transliterated as
> <U0423> => "U", <U0301> - left unchanged, so "U" + <U0301>"
> will eventually produce "Ú"?
> Why is it called "UNDEFINED"?
> Do we need similar rules for other characters?
>
> > [...]
> > +% CYRILLIC SMALL LETTER U
> > +<U0443> <U0075>
> > +% CYRILLIC UNDEFINED
> > +<U0443><U0301> <U00FA>;"<U0075><U0060>"
>
> Same here.
>
> > [...]
> > +% CYRILLIC SMALL LETTER YA
> > +<U044F> <U00E2>;"<U0079><U0061>"
>
> Again <U0450> missing (because it is lowercase variant of <U0400>).
>
> > +% CYRILLIC SMALL LETTER IO
> > +<U0451> <U00EB>;"<U0079><U006F>"
> > [...]
> > +% CYRILLIC SMALL LETTER KJE
> > +<U045C> <U1E31>;"<U006B><U0060>"
>
> <U045D> missing (same reason as <U040D>).
>
> > +% CYRILLIC SMALL LETTER SHORT U
> > +<U045E> <U016D>;"<U0075><U0060>"
> > +% CYRILLIC SMALL LETTER DZHE
> > +<U045F> "<U0064><U0302>";"<U0064><U0068>"
>
> More letters missing here. Is this because they are historic so we
> don't want to include them now? Well, but "YUS" is also historic.
> (Please, do not remove YUS for consistency).
>
> > +% CYRILLIC CAPITAL LETTER BIG YUS
> > +<U046A> <U01CD>;"<U004F><U0060>"
> > +% CYRILLIC SMALL LETTER BIG YUS
> > +<U046B> <U01CE>;"<U006F><U0060>"
> > [...]
>
> I will continue but, again, I don't give any ETA so other reviewers
> are welcome here.
>
> Regards,
>
> Rafal
next prev parent reply other threads:[~2018-10-11 13:51 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com>
[not found] ` <20180412224352.GB2911@altlinux.org>
2018-07-17 19:34 ` SUBJECT: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-07-17 19:41 ` Carlos O'Donell
2018-07-17 19:50 ` Egor Kobylkin
2018-07-17 19:59 ` Carlos O'Donell
2018-08-06 19:00 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29 Egor Kobylkin
2018-10-03 8:28 ` Egor Kobylkin
2018-10-03 9:20 ` Keld Simonsen
2018-10-03 9:32 ` Egor Kobylkin
2018-10-05 8:44 ` Marko Myllynen
2018-10-05 9:20 ` Rafal Luzynski
2018-10-05 10:37 ` Egor Kobylkin
2018-10-08 22:05 ` Rafal Luzynski
2018-10-08 22:52 ` Egor Kobylkin
2018-10-09 21:43 ` Rafal Luzynski
2018-10-09 16:10 ` Marko Myllynen
2018-10-09 16:22 ` Egor Kobylkin
2018-10-09 16:49 ` Marko Myllynen
2018-10-09 22:09 ` Rafal Luzynski
2018-10-10 11:21 ` Marko Myllynen
2018-10-11 10:10 ` Marko Myllynen
[not found] ` <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com>
2018-10-05 11:54 ` Marko Myllynen
2018-10-05 12:01 ` Egor Kobylkin
2018-10-05 12:21 ` Marko Myllynen
2018-10-05 20:47 ` Egor Kobylkin
2018-10-08 12:41 ` Marko Myllynen
2018-10-08 22:23 ` Rafal Luzynski
2018-10-08 23:36 ` Egor Kobylkin
2018-10-09 13:18 ` Egor Kobylkin
2018-10-09 18:34 ` Egor Kobylkin
2018-10-09 22:18 ` Rafal Luzynski
2018-10-09 22:40 ` Egor Kobylkin
2018-10-09 22:43 ` Egor Kobylkin
2018-10-10 11:23 ` Marko Myllynen
2018-10-10 12:20 ` Egor Kobylkin
2018-10-10 12:34 ` Marko Myllynen
2018-10-10 22:29 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 Egor Kobylkin
2018-10-11 10:00 ` Marko Myllynen
2018-10-11 11:05 ` Rafal Luzynski
2018-10-11 13:10 ` Marko Myllynen
2018-10-11 13:51 ` Volodymyr Lisivka [this message]
2018-10-11 14:59 ` Egor Kobylkin
2018-10-11 21:31 ` Egor Kobylkin
2018-10-11 15:05 ` Egor Kobylkin
2018-10-11 15:45 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v3 Egor Kobylkin
2018-10-11 21:33 ` [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v4 Egor Kobylkin
2018-10-12 14:06 ` [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] Egor Kobylkin
2018-10-13 1:01 ` Rafal Luzynski
2018-10-13 16:58 ` Egor Kobylkin
2018-10-15 11:05 ` Marko Myllynen
2018-10-15 11:55 ` Egor Kobylkin
2018-10-23 23:08 ` Rafal Luzynski
2018-10-17 14:17 ` [PATCH v6] " Egor Kobylkin
2018-11-01 22:52 ` [PATCH v7] " Egor Kobylkin
2018-11-02 0:01 ` [PATCH v8] " Egor Kobylkin
2018-11-02 22:22 ` Rafal Luzynski
2018-11-02 23:27 ` Egor Kobylkin
2018-11-14 21:25 ` [PATCH v9] " Egor Kobylkin
2018-11-16 22:17 ` Rafal Luzynski
2018-11-17 18:35 ` Egor Kobylkin
2018-11-19 7:14 ` Marko Myllynen
2018-11-19 9:22 ` Egor Kobylkin
2018-11-19 19:36 ` Marko Myllynen
2018-12-01 22:09 ` Rafal Luzynski
2018-12-01 22:53 ` Egor Kobylkin
2018-12-03 22:19 ` Egor Kobylkin
[not found] ` <1361059722.707244.1544231740358@poczta.nazwa.pl>
2018-12-10 21:20 ` Marko Myllynen
2018-12-19 22:26 ` Rafal Luzynski
2018-12-19 22:48 ` Egor Kobylkin
2018-12-19 23:51 ` Rafal Luzynski
2018-11-19 11:11 ` [PATCH v10] " Egor Kobylkin
2018-12-07 23:35 ` Rafal Luzynski
2018-12-08 21:51 ` Egor Kobylkin
2018-12-19 22:42 ` Rafal Luzynski
2018-12-19 23:02 ` Egor Kobylkin
2018-12-20 0:06 ` Rafal Luzynski
2018-12-08 22:28 ` [PATCH v11] Locales: Cyrillic -> ASCII transliteration " Egor Kobylkin
2018-12-19 23:16 ` Egor Kobylkin
2018-12-26 10:07 ` Siddhesh Poyarekar
2018-12-26 12:14 ` Egor Kobylkin
2018-12-27 1:31 ` Siddhesh Poyarekar
2018-12-27 11:31 ` Rafal Luzynski
2019-01-02 18:39 ` [PATCH v12] " Egor Kobylkin
2019-01-05 14:36 ` Rafal Luzynski
2019-01-05 21:13 ` Egor Kobylkin
2019-01-07 20:37 ` Marko Myllynen
2019-01-09 0:46 ` Egor Kobylkin
2019-01-09 20:03 ` Marko Myllynen
2019-02-04 7:14 ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30 Egor Kobylkin
2019-02-14 16:48 ` Marko Myllynen
2019-03-04 22:12 ` Egor Kobylkin
2019-03-11 13:59 ` PING " Egor Kobylkin
2019-03-14 19:49 ` Egor Kobylkin
2019-04-19 22:24 ` Rafal Luzynski
[not found] ` <5ELixS9SQ0DW4mlvswp96ASpLobBabU9KQ6zOTH-Udrb34mABhcqiPERpBZfPWZ9F77s8XNmiLIAq9UWu0AjLFFdjOz_FZVU5_xF-SiQkrw=@kobylkin.com>
2019-04-27 2:51 ` Siddhesh Poyarekar
2019-04-27 7:34 ` Diego (Egor) Kobylkin
2019-04-09 1:04 ` [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Carlos O'Donell
2019-03-19 10:39 ` ping " Egor Kobylkin
2019-03-28 16:20 ` [PING^4][PATCH " Marko Myllynen
2019-04-04 19:44 ` [PING^5][PATCH " Egor Kobylkin
2019-04-06 1:36 ` Siddhesh Poyarekar
2019-04-16 7:15 ` [PING^6][PATCH " Marko Myllynen
2019-04-16 13:17 ` Carlos O'Donell
2019-04-16 17:07 ` Egor Kobylkin
2019-04-16 17:58 ` Carlos O'Donell
2019-04-16 18:41 ` Egor Kobylkin
2019-04-16 19:06 ` Carlos O'Donell
2019-05-10 12:19 ` Marko Myllynen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKjGFBUjTOtO95KO7BSZa5cRyvxGmMdrdGZzBU-pUjDUrKsfBg@mail.gmail.com \
--to=vlisivka@gmail.com \
--cc=danilo@gnome.org \
--cc=digitalfreak@lingonborough.com \
--cc=egor@kobylkin.com \
--cc=ldv@altlinux.org \
--cc=libc-alpha@sourceware.org \
--cc=libc-locales@sourceware.org \
--cc=mfabian@redhat.com \
--cc=mkutny@gmail.com \
--cc=myllynen@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).