From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 70980 invoked by alias); 11 Oct 2018 11:05:17 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 69534 invoked by uid 89); 11 Oct 2018 11:05:16 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-4.3 required=5.0 tests=AWL,BAYES_50,BODY_8BITS,GARBLED_BODY,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=HImportance:Medium, H*x:Open-Xchange, H*UA:Open-Xchange, HTo:U*libc-locales X-HELO: shared-ano163.rev.nazwa.pl X-Spam-Score: 1 Date: Thu, 11 Oct 2018 11:05:00 -0000 From: Rafal Luzynski Reply-To: Rafal Luzynski To: Egor Kobylkin , libc-alpha@sourceware.org, libc-locales@sourceware.org, mfabian@redhat.com, Marko Myllynen Cc: "Dmitry V. Levin" , Volodymyr Lisivka , Max Kutny , danilo@gnome.org Message-ID: <180516689.458569.1539255868196@poczta.nazwa.pl> In-Reply-To: References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2018-q4/txt/msg00044.txt.bz2 Thank you, Egor. I am looking at your patch and although I have not yet finished, here are some remarks: First of all, I think that such a large patch should also include the tests. Please see how automatic tests are performed in locale data and write your own. 11.10.2018 00:29 Egor Kobylkin wrote: > [...] > From this patch I have excluded locales that already mention cyrillic or > have a transliteration table for it: > az_AZ > iso14651_t1_common > ky_KG > mn_MN > sr_RS > tg_TJ > tk_TM > tt_RU > uk_UA > uz_UZ > uz_UZ@cyrillic > [...] I think that eventually we would like to include your translit_cyrillic also in these locales because I assume that your rules should work good for them as well, also should include more characters than the individual language contributors took into account. Similarly to Mike's work on collation: a common rules were created and all locales include them adding their own language specific modifications. > [...] > COMMIT MESSAGE: > [...] > I am excluding these locales from this proposed patch. I have written > directly to locale maintainer emails listed in the files. Volodymyr > Lisivka , Max Kutny (uk_UA), > =D0=94=D0=B0=D0=BD=D0=B8=D0=BB=D0=BE =D0=A8=D0=B5=D0=B3=D0=B0=D0=BD (sr_YU, sr_CS) have confirmed the I am not sure if we want Cyrillic text in the commit message. Shouldn't it be, uhm, tranlisterated? :-) "sr_CS" - I guess you meant "sr_RS". "sr_YU" has been dropped, do we want to mention it? > [...] > [BZ #2872] > * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79 Please start "Add" with an uppercase. BTW, shouldn't it be "New file" instead? > System A transliteration System B transcription table from Cyrillic to > Latin/ASCII. > * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE > translit section. Same, "Add" here. > * localedata/locales/aa_DJ: Likewise. Good (here and everywhere below). > [...] > diff -uNr a/localedata/locales/translit_cyrillic > b/localedata/locales/translit_cyrillic > --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000 > +0000 > +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.000000000 > +0000 > @@ -0,0 +1,383 @@ > +escape_char / > +comment_char % > + > +% This file is part of the GNU C Library and contains locale data. > +% The Free Software Foundation does not claim any copyright interest > +% in the locale data contained in this file. The foregoing does not > +% affect the license of the GNU C Library as a whole. It does not > +% exempt you from the conditions of the license if your use would > +% otherwise be governed by that license. > + > +% Transliterations of cyrillic letters to latin and/or ascii symbols. "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII". > +% Inspired by ISO 9.1995 / GOST 7.79-2000. > +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf > +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 Typos: "i.e" -> "i.e.," (somebody please fix me if I'm wrong here) "U4001" - I guess you meant "U0401" "U4F9" -> "U04F9". I think that "U4F9" is not definitely bad but let's be consistent. Also I can see some gaps in the range. Are you going to fill them or maybe for now just mention that they exist? > +% It implements the GOST_7.79 System A (Latin Script) as a first > +% option and System B Cyrillic (ASCII) as a second option. Check > +% https://en.wikipedia.org/wiki/ISO_9 for reference. > +% The System B is extended from GOST_7.79-Russian using open sources > +% of the transliteration mappings and the "h/`" diacritics logic. What is "h/`" diacritics logic? > + > +% Usage examples: > +% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \ > +% | iconv -f ISO-8859-15 -t UTF-8 # System A > +% iconv -f UTF-8 -t ASCII//TRANSLIT # System B. > + > +% Contributions welcome for the rest of Cyrillic script in Unicode Sure, I'm not going to stop you from pushing these changes just because there are missing characters. I will consider adding them later. > +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode. > +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=3D2872. > +% Generated from UnicodeData.txt with > +% https://sourceware.org/bugzilla/attachment.cgi?id=3D11301. 1. Is the file really generated with a script and not modified later? If yes then maybe you should contribute the script instead? In that case, you should also not post this file to libc-locale, maintainers and developers should be able to regenerate it. 2. The link leads to a LibreOffice spreadsheet. > +LC_CTYPE > + > +translit_start > + is missing here. Are you going to leave it for now? > +% CYRILLIC CAPITAL LETTER IO > + ;"" > [...] > +% CYRILLIC CAPITAL LETTER KJE > + ;"" is missing here. Can we add it already? > +% CYRILLIC CAPITAL LETTER SHORT U > + ;"" > [...] > +% CYRILLIC CAPITAL LETTER U > + > +% CYRILLIC UNDEFINED > + ;"" This still makes me wonder. Does it work at all? What if we remove this rule, won't it be transliterated as =3D> "U", - left unchanged, so "U" + " will eventually produce "=C3=9A"? Why is it called "UNDEFINED"? Do we need similar rules for other characters? > [...] > +% CYRILLIC SMALL LETTER U > + > +% CYRILLIC UNDEFINED > + ;"" Same here. > [...] > +% CYRILLIC SMALL LETTER YA > + ;"" Again missing (because it is lowercase variant of ). > +% CYRILLIC SMALL LETTER IO > + ;"" > [...] > +% CYRILLIC SMALL LETTER KJE > + ;"" missing (same reason as ). > +% CYRILLIC SMALL LETTER SHORT U > + ;"" > +% CYRILLIC SMALL LETTER DZHE > + "";"" More letters missing here. Is this because they are historic so we don't want to include them now? Well, but "YUS" is also historic. (Please, do not remove YUS for consistency). > +% CYRILLIC CAPITAL LETTER BIG YUS > + ;"" > +% CYRILLIC SMALL LETTER BIG YUS > + ;"" > [...] I will continue but, again, I don't give any ETA so other reviewers are welcome here. Regards, Rafal