From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 109043 invoked by alias); 4 Apr 2019 19:44:25 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 108983 invoked by uid 89); 4 Apr 2019 19:44:21 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,KAM_SHORT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 spammy=federal, Federal, agency, supplying X-HELO: mout.kundenserver.de From: Egor Kobylkin Subject: [PING^5][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] Reply-To: Egor Kobylkin To: Marko Myllynen , libc-alpha@sourceware.org, libc-locales@sourceware.org, Carlos O'Donell , Siddhesh Poyarekar , Rafal Luzynski Cc: Mike Fabian References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com> Message-ID: <0457d25c-b4d6-c010-d50a-cbfbf9b2af1c@kobylkin.com> Date: Thu, 04 Apr 2019 19:44:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2019-q2/txt/msg00015.txt.bz2 Ping? On 19/03/2019 12.39, Egor Kobylkin wrote: > Changelog v12: > * Adjusted to the new comment style suddenly appearing in the target > file locale/C-translit.h.in (the original file changed on the master > branch from /* style to # style since v11) > * Fixed a typo for CYRILLIC SMALL LETTER SHHA to be mapped to > "sh`" instead of erroneous "SH`" in v11 > > Changelog v11: > * Re-targeted the patch against locale/C-translit.h.in as the proper > file for the ASCII translit table. > * Correspondingly the patch now only contains the additional > Cyrillic-ASCII strings in the format of locale/C-translit.h.in table. > The 'include "translit_cyrillic";""' directives are not necessary in the > locale files and they are now all left intact. > * Also the file translit_cyrillic is not longer needed and is omitted. > * Edited below email, commit message. > > Changelog v10: > * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin > with diacritics) as conflicting with System B within glibc mechanics and > not solving BZ #2872 > * Edited below email, commit message, comment in translit_cyrillic to > reflect System A removal > * Removed and (Cyrillic U with acute, > using composition) as composing is not covered by current glibc > conversion mechanics > > Changelog v9: > * Fixed formatting (trailing spaces etc.) > * Put commit summary in the patch file, now it is generated completely > by git format-patch > > Changelog v8: > * Re-added missing translit_cyrillic in patch v7 (due to missing "git > add" in the script). > > Changelog v7: > * Generated against git://sourceware.org/git/glibc.git master with git > format-patch. > * The 'include "translit_cyrillic";""' now immediately follows last > 'include "translit_XXX";""' string (was inserted just before > translit_end previously.) > * Only the locales already having 'include .*translit.*;""' are patched > (see the list for manual exclusions below, full list of included locales > at the end of the email in the commit section.) > * Excluded az_AZ completely to avoid circular reference from tr_TR via > “copy "tr_TR"”. > > Changelog v6: > * Locales removed from the patch: C and sd_PK. > * Added locales: az_AZ and ky_KG. > * Consistently transliterate single uppercase Cyrillic letters >    to sequences of all uppercase Latin letters in all languages (whenever >    a Cyrillic letter is transliterated to more than one Latin letter), >    for example "Ї" is now transliterated as "YI" rather than "Yi". > > Dear locale maintainers, > > fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails" > > https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1] > > add the Cyrillic transliteration rows to locale/C-translit.h.in. > > The patch is attached. > > > Current bug effect: > > The glibc wiki explicitly lists this use case as the test example and > currently it fails on Cyrillic texts [1] [8] [9]: > > iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC > > CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???. > > - it produces a string of question marks and spaces. > > This is what it should produce and it does so after the patch applied: > > CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe > chayu. > > > The root problem and the fix: > > The root problem is the missing transliteration table that I am > supplying here. > > > COMMIT MESSAGE: > This translit_cyrillic table enables conversion (e.g. with iconv) from a > UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text. > > Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII > compatible transcription. > > While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of > a transliteration/transcription has only Latin/ASCII codes but still can > be read by a native speaker. Among other things it is useful for > processing the Cyrillic texts and filenames by programs or on systems > that are not specifically prepared to work with Cyrillic, don't have > corresponding fonts installed or can't handle UTF-8. > > The patch content (mapping) is based on ISO 9.1995 standard [10] and its > derivative GOST 7.79-2000 System B official source (Federal Agency on > Technical Regulating and Metrology Of Russian Federation [2]). > Technically an independent but mostly identical source [3] was used and > prepared in a spreadsheet [6]. > > The transliteration of Cyrillic to ASCII according to GOST 7.79-2000 > System B represents what is actually called transcription (preserving > phonemes), while System A is the transliteration (preserving graphemes). > There is no meaningful way to preserve graphemes converting Cyrillic to > ASCII and thus the System B is chosen [11]. To be super clear the System > A has nothing to do with this bug regardless it being a transliteration. > > Those interested in implementing System A for transliteration of > Cyrillic to Latin with Diacritic as a new feature are welcome to use the > spreadsheet in [6] as a starting point. > > Links: > > [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872 > [2] GOST 7.79-2000 official source > http://protect.gost.ru/document.aspx?control=7&id=130715 (is only > available in low quality gif format) > [3] http://transliteration.ru/gost-7-79-2000/ and > http://www.yfermer.ru/specifications/285821.html > [4] Wikipedia article on Cyrillic transliteration with Latin alphabet > https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9 > > [5] http://man7.org/linux/man-pages/man5/locale.5.html > [6] Spreadsheet for generating translit_cyrillic > https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1 > > [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales > [9] translit-test-input.txt > https://sourceware.org/bugzilla/attachment.cgi?id=11304 > [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B > [11] > https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3 > > > Best regards, > Egor Kobylkin > > > > -- Marko Myllynen