Changelog v10: * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin with diacritics) as conflicting with System B within glibc mechanics and not solving BZ #2872 * Edited below email, commit message, comment in translit_cyrillic to reflect System A removal * Removed and (Cyrillic U with acute, using composition) as composing is not covered by current glibc conversion mechanics Changelog v9: * Fixed formatting (trailing spaces etc.) * Put commit summary in the patch file, now it is generated completely by git format-patch Changelog v8: * Re-added missing translit_cyrillic in patch v7 (due to missing "git add" in the script). Changelog v7: * Generated against git://sourceware.org/git/glibc.git master with git format-patch. * The 'include "translit_cyrillic";""' now immediately follows last 'include "translit_XXX";""' string (was inserted just before translit_end previously.) * Only the locales already having 'include .*translit.*;""' are patched (see the list for manual exclusions below, full list of included locales at the end of the email in the commit section.) * Excluded az_AZ completely to avoid circular reference from tr_TR via “copy "tr_TR"”. Changelog v6: * Locales removed from the patch: C and sd_PK. * Added locales: az_AZ and ky_KG. * Consistently transliterate single uppercase Cyrillic letters to sequences of all uppercase Latin letters in all languages (whenever a Cyrillic letter is transliterated to more than one Latin letter), for example "Ї" is now transliterated as "YI" rather than "Yi". Dear locale maintainers, fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails" https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1] add the Cyrillic transliteration table translit_cyrillic file [7] to localedata/locales/ and include it in all your locales going forward. The patch is attached. From this patch I have excluded locales that already mention cyrillic or have a transliteration table for it: mn_MN sr_RS tg_TJ tk_TM tt_RU uk_UA uz_UZ uz_UZ@cyrillic uk_UA Their maintainers are requested to make an explicit decision on how and whether at all to include this patch. Current bug effect: The glibc wiki explicitly lists this use case as the test example https://sourceware.org/glibc/wiki/Locales#Testing_Locales : LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt currently it fails on Cyrillic texts in most locales including ru_RU [1] [8] [9]: LC_ALL=ru_RU.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???. - It produces a string of question marks and spaces. This is what it should produce and it does so after the patch applied: CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu. The root problem and the fix: The root problem is the missing transliteration table that I am supplying here. Furthermore it has to be referenced/included into the active locale at the compilation time to be used by iconv. COMMIT MESSAGE: This translit_cyrillic table enables conversion (e.g. with iconv) from a UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text. Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII compatible transcription. While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of a transliteration/transcription has only Latin/ASCII codes but still can be read by a native speaker. Among other things it is useful for processing the Cyrillic texts and filenames by programs or on systems that are not specifically prepared to work with Cyrillic, don't have corresponding fonts installed or can't handle UTF-8. The transliteration table itself is in the file translit_cyrillic [7]. Its content (mapping) is based on ISO 9.1995 standard [10] and its derivative GOST 7.79-2000 System B official source (Federal Agency on Technical Regulating and Metrology Of Russian Federation [2]). Technically an independent but mostly identical source [3] was used and prepared in a spreadsheet [6]. The transliteration of Cyrillic to ASCII according to GOST 7.79-2000 System B represents what is actually called transcription (preserving phonemes), while System A is the transliteration (preserving graphemes). There is no meaningful way to preserve graphemes converting Cyrillic to ASCII and thus the System B is chosen. [11] The documentation suggests that the transliteration tables inclusion is done by adding *include "translit_cyrillic";""* string into LC_CTYPE translit_start section http://man7.org/linux/man-pages/man5/locale.5.html [5] Practically all locales that already have 'include .*translit.*;""' string were identified and included into this patch. The Cyrillic transliteration of e.g. Russian text may have already worked to some extent for mn_MN, sr_RS, tk_TM, uz_UZ, uk_UA locales that have their transliteration tables included inline. I am excluding these locales from this proposed patch. I have written directly to locale maintainer emails listed in the files. Volodymyr Lisivka , Max Kutny (uk_UA), Данило Шеган (sr_RS) have confirmed the exclusion. Links: [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [2] GOST 7.79-2000 official source http://protect.gost.ru/document.aspx?control=7&id=130715 (is only available in low quality gif format) [3] http://transliteration.ru/gost-7-79-2000/ and http://www.yfermer.ru/specifications/285821.html [4] Wikipedia article on Cyrillic transliteration with Latin alphabet https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9 [5] http://man7.org/linux/man-pages/man5/locale.5.html [6] Spreadsheet for generating translit_cyrillic https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1 [7] translit_cyrillic https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1 [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales [9] translit-test-input.txt https://sourceware.org/bugzilla/attachment.cgi?id=11304 [10] https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A [11] https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3 Best regards, Egor Kobylkin