From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 78206 invoked by alias); 16 Apr 2019 18:41:58 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 78181 invoked by uid 89); 16 Apr 2019 18:41:58 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=2.1 required=5.0 tests=BAYES_00,BODY_8BITS,GARBLED_BODY,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 spammy=8:=d2=aa, 8:=d2=b2, 8:=d2=bb, outdated?= X-HELO: mout.kundenserver.de Subject: Re: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] To: Carlos O'Donell , Marko Myllynen , libc-alpha@sourceware.org, libc-locales@sourceware.org, Carlos O'Donell , Siddhesh Poyarekar , Rafal Luzynski Cc: Mike Fabian References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com> <8923a5a0-65c8-4784-6d7d-f3571933dcb5@redhat.com> <4ebfdba5-41c1-3465-0b01-9152d6417350@redhat.com> <5aa900a3-b6ce-66c9-d2b5-fcc71e764154@kobylkin.com> From: Egor Kobylkin Message-ID: <7d272d07-87ca-12fd-0f1b-00cbd93ea43d@kobylkin.com> Date: Tue, 16 Apr 2019 18:41:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2019-q2/txt/msg00025.txt.bz2 On 16.04.19 19:58, Carlos O'Donell wrote: > On 4/16/19 1:06 PM, Egor Kobylkin wrote: >> Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f >> UTF-8 -t ASCII//TRANSLIT <<< >> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ >> ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’" >> >> And this is the expected result ("" added by myself): >> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e` >> G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'" >> > > Thanks. > > I was using CyrTranslit (python translater) to review other work done in > this area, > but it wasn't very fruitful. > > $ python3 > Python 3.7.3 (default, Mar 27 2019, 13:36:35) > [GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux > Type "help", "copyright", "credits" or "license" for more information. >>>> import cyrtranslit >>>> cyrtranslit.supported() > dict_keys(['sr', 'me', 'mk', 'ru']) >>>> cyrtranslit.to_latin("ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’") >>>> > 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’' > >>>> > > "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’" > > 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’' > > > Which doesn't give a good transliteration. I guess the reason for that is that it is using the first key 'sr' from your list that stands for Serbian. And Serbian doesn't have those characters that are omitted ( "Щ" for example). > But the table is better: > https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L138-L155 > > > Ё -> YO. > > Which is a good cross-check for me. Yet the closest one from that codebase should be this https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L88 It is exactly the reason we had 12 iterations on this patch - we wanted to cover the most complete yet workable standard for the table. What we reference in the bug memo is the actual accepted standard. It is coalesced with the extended standard for further outdated cyrillic letters. Bests, Egor Kobylkin