From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6645 invoked by alias); 6 Nov 2014 11:00:11 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Received: (qmail 6498 invoked by uid 48); 6 Nov 2014 11:00:06 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:00:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00020.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094 --- Comment #19 from Mike FABIAN --- I extended Pravin=E2=80=99s ctype-compatibility.py script to produce more human readable output and added many extra tests. Joseph Myers> * Ensure the character type data in Joseph Myers> localedata/charmaps/i18n can be properly reproduced from Joseph Myers> Unicode 5.0 data using gen-unicode-ctype.c, adapting Joseph Myers> gen-unicode-ctype.c as needed to replicate any changes Joseph Myers> that may have been made not using that program. When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0 to generate LC_CTYPE, the generated file lacks many characters which apparently have been manually added to glibc=E2=80=99s i18n file: alpha: Missing 1238 characters of old ctype in new ctype=20 blank: Missing 0 characters of old ctype in new ctype=20 cntrl: Missing 0 characters of old ctype in new ctype=20 combining: Missing 124 characters of old ctype in new ctype=20 combining_level3: Missing 49 characters of old ctype in new ctype=20 digit: Missing 0 characters of old ctype in new ctype=20 graph: Missing 1571 characters of old ctype in new ctype=20 lower: Missing 115 characters of old ctype in new ctype=20 print: Missing 1571 characters of old ctype in new ctype=20 punct: Missing 335 characters of old ctype in new ctype=20 space: Missing 0 characters of old ctype in new ctype=20 tolower: Missing 19 characters of old ctype in new ctype=20 totitle: Missing 8 characters of old ctype in new ctype=20 toupper: Missing 18 characters of old ctype in new ctype=20 upper: Missing 100 characters of old ctype in new ctype=20 xdigit: Missing 0 characters of old ctype in new ctype=20 I.e. reproducing the localedata/charmaps/i18n character type data from Unicode 5.0 data using gen-unicode-ctype.c does not work well because glibc=E2=80=99s i18n file apparently has been edited manually a lot already to include newer Unicode data. Apparently quite a few mistake have been made by manually editing the i18n file. For example, the report from ctype-compatibility.py also produces for the old i18n file: error: 0xa67f =EA=99=BF punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode= 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is =E2=80=9CAlphabetic=E2=80=9D. Apparently added manually to punc= t by mistake in glibc=E2=80=99s old LC_CTYPE. error: 0xa67f =EA=99=BF alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicod= e 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is =E2=80=9CAlphabetic=E2=80=9D. Apparently added manually to punc= t by mistake in glibc=E2=80=99s old LC_CTYPE. Another example: error: 0x9f4 =E0=A7=B4 alpha True:=20 =E2=80=9C09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;= ;;;=E2=80=9D =E2=80=9C09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;= ;;=E2=80=9D =E2=80=9C09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N= ;;;;;=E2=80=9D =E2=80=9C09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;= ;;;=E2=80=9D =E2=80=9C09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR;No;0;L;;;;3/4;N;;;;;=E2=80=9D =E2=80=9C09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16= ;N;;;;;=E2=80=9D =E2=80=9C09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;=E2=80=9D According to DerivedCoreProperties.txt (7.0.0) these are *not* =E2=80=9CAlphabetic=E2=80=9D. So this has been mistakenly added to =E2=80=9Calpha=E2=80=9D in the old i18= n file of glibc (but gen-unicode-ctype.c correctly puts in into =E2=80=9Cpunct=E2= =80=9D, i.e. this seems to be another mistake by manual editing). Some of the errors reported by ctype-compatibility.py error: 0x250 =C9=90 lower False: Should be lower in Unicode 7.0.0 (was not = lower in Unicode 5.0.0). would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input. There are many more problems like this in the old i18n file, my tests found 133 errors total: ------------------------------------------------------------ Old file =3D /local/mfabian/src/glibc/localedata/locales/i18n Number of errors in old file =3D 133 ------------------------------------------------------------ I=E2=80=99ll attach the full report. --=20 You are receiving this mail because: You are on the CC list for the bug. >>From glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:02:11 2014 Return-Path: Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 8397 invoked by alias); 6 Nov 2014 11:02:10 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 8327 invoked by uid 48); 6 Nov 2014 11:02:07 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:02:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00021.txt.bz2 Content-length: 505 https://sourceware.org/bugzilla/show_bug.cgi?id=14094 --- Comment #20 from Mike FABIAN --- Created attachment 7907 --> https://sourceware.org/bugzilla/attachment.cgi?id=7907&action=edit unicode-5.0.0-report-full-output Full report from ctype-compatibility.py when comparing the old i18n file in glibc with the file generated by gen-unicode-ctype.c using UnicodeData.txt from Unicode 5.0.0. -- You are receiving this mail because: You are on the CC list for the bug.