From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 333 invoked by alias); 15 Jan 2018 14:35:10 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 131023 invoked by uid 48); 15 Jan 2018 14:35:05 -0000 From: "maiku.fabian at gmail dot com" To: libc-locales@sourceware.org Subject: [Bug localedata/21547] Tibetan script collation broken (Dzongkha and Tibetan) Date: Mon, 15 Jan 2018 14:35:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.24 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: maiku.fabian at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2018-q1/txt/msg00016.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D21547 --- Comment #7 from Mike FABIAN --- (In reply to Elie Roux from comment #6) > Hello Fabian, >=20 > Thanks a lot for your thorough review, that's appreciated! >=20 > I have to say I don't really understand the second part, why would line 30 > causes =E0=BD=82=E0=BD=89 to be sorted after =E0=BD=82=E0=BD=89=E0=BE=AB = ? can you elaborate a little bit? I am not sure why this happens either. But it seems to happen. I tested like this: First the input test file to be sorted, made very short to contain only the strings in question: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ cat localedata/dz_BT.UTF-8.in.mini =E0=BD=82=E0=BD=89 =E0=BD=82=E0=BD=89=E0=BE=AB mfabian@taka:/local/mfabian/src/glibc (locales *$%) And I use a very short rule file, first containg only &=E0=BD=82=E0=BD=89<= =E0=BD=82=E0=BD=89=E0=BE=AB and the other rule commented out: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ cat rules-mini.txt &=E0=BD=82=E0=BD=89<=E0=BD=82=E0=BD=89=E0=BE=AB #&=E0=BD=89<<=E0=BE=8B=E0=BE=99<=E0=BD=82=E0=BD=89<=E0=BD=98=E0=BD=89<= =E0=BD=A2=E0=BE=99=3D=E0=BD=AA=E0=BE=99<=E0=BD=A6=E0=BE=99<=E0=BD=96=E0=BD= =A2=E0=BE=99=3D=E0=BD=96=E0=BD=AA=E0=BE=99<=E0=BD=96=E0=BD=A6=E0=BE=99 mfabian@taka:/local/mfabian/src/glibc (locales *$%) Now I sort using my small test program ~/bin/icu-collation-test.py (I=E2=80=99ll attach it in the next comment): mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ ~/bin/icu-collation-test.py -r rules-mini.txt -i localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out=20 mfabian@taka:/local/mfabian/src/glibc (locales *$%) And check the result: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini /tmp/dz_BT.UTF-8.out=20 mfabian@taka:/local/mfabian/src/glibc (locales *$%) No difference between input and output, i.e. =E0=BD=82=E0=BD=89 is still b= efore =E0=BD=82=E0=BD=89=E0=BE=AB in dz_BT.UTF-8.out. Now I remove the comment in front of the second line in rules-mini.txt: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ cat rules-mini.txt &=E0=BD=82=E0=BD=89<=E0=BD=82=E0=BD=89=E0=BE=AB &=E0=BD=89<<=E0=BE=8B=E0=BE=99<=E0=BD=82=E0=BD=89<=E0=BD=98=E0=BD=89<= =E0=BD=A2=E0=BE=99=3D=E0=BD=AA=E0=BE=99<=E0=BD=A6=E0=BE=99<=E0=BD=96=E0=BD= =A2=E0=BE=99=3D=E0=BD=96=E0=BD=AA=E0=BE=99<=E0=BD=96=E0=BD=A6=E0=BE=99 mfabian@taka:/local/mfabian/src/glibc (locales *$%) And sort again: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ ~/bin/icu-collation-test.py -r rules-mini.txt -i localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out=20 mfabian@taka:/local/mfabian/src/glibc (locales *$%) Checking the result: mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini /tmp/dz_BT.UTF-8.out=20 --- /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini 2018-01-15 15:21:59.357477414 +0100 +++ /tmp/dz_BT.UTF-8.out 2018-01-15 15:26:12.266632745 +0100 @@ -1,2 +1,2 @@ -=E0=BD=82=E0=BD=89 =E0=BD=82=E0=BD=89=E0=BE=AB +=E0=BD=82=E0=BD=89 mfabian@taka:/local/mfabian/src/glibc (locales *$%) $ Now the order is reversed, =E0=BD=82=E0=BD=89 comes after =E0=BD=82=E0=BD= =89=E0=BE=AB. The same happened to me while I was implementing the rules for glibc and test sorting using glibc. I found this very confusing and thought I might have done something wrong implementing the rules in the glibc way. But then I tested with the above small Python3 program using icu and found that it behaves the same way. --=20 You are receiving this mail because: You are on the CC list for the bug.