https://sourceware.org/bugzilla/show_bug.cgi?id=14094 --- Comment #19 from Mike FABIAN --- I extended Pravin’s ctype-compatibility.py script to produce more human readable output and added many extra tests. Joseph Myers> * Ensure the character type data in Joseph Myers> localedata/charmaps/i18n can be properly reproduced from Joseph Myers> Unicode 5.0 data using gen-unicode-ctype.c, adapting Joseph Myers> gen-unicode-ctype.c as needed to replicate any changes Joseph Myers> that may have been made not using that program. When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0 to generate LC_CTYPE, the generated file lacks many characters which apparently have been manually added to glibc’s i18n file: alpha: Missing 1238 characters of old ctype in new ctype blank: Missing 0 characters of old ctype in new ctype cntrl: Missing 0 characters of old ctype in new ctype combining: Missing 124 characters of old ctype in new ctype combining_level3: Missing 49 characters of old ctype in new ctype digit: Missing 0 characters of old ctype in new ctype graph: Missing 1571 characters of old ctype in new ctype lower: Missing 115 characters of old ctype in new ctype print: Missing 1571 characters of old ctype in new ctype punct: Missing 335 characters of old ctype in new ctype space: Missing 0 characters of old ctype in new ctype tolower: Missing 19 characters of old ctype in new ctype totitle: Missing 8 characters of old ctype in new ctype toupper: Missing 18 characters of old ctype in new ctype upper: Missing 100 characters of old ctype in new ctype xdigit: Missing 0 characters of old ctype in new ctype I.e. reproducing the localedata/charmaps/i18n character type data from Unicode 5.0 data using gen-unicode-ctype.c does not work well because glibc’s i18n file apparently has been edited manually a lot already to include newer Unicode data. Apparently quite a few mistake have been made by manually editing the i18n file. For example, the report from ctype-compatibility.py also produces for the old i18n file: error: 0xa67f ꙿ punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is “Alphabetic”. Apparently added manually to punct by mistake in glibc’s old LC_CTYPE. error: 0xa67f ꙿ alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is “Alphabetic”. Apparently added manually to punct by mistake in glibc’s old LC_CTYPE. Another example: error: 0x9f4 ৴ alpha True: “09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;;;;” “09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;;;” “09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N;;;;;” “09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;;;;” “09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR;No;0;L;;;;3/4;N;;;;;” “09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16;N;;;;;” “09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;” According to DerivedCoreProperties.txt (7.0.0) these are *not* “Alphabetic”. So this has been mistakenly added to “alpha” in the old i18n file of glibc (but gen-unicode-ctype.c correctly puts in into “punct”, i.e. this seems to be another mistake by manual editing). Some of the errors reported by ctype-compatibility.py error: 0x250 ɐ lower False: Should be lower in Unicode 7.0.0 (was not lower in Unicode 5.0.0). would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input. There are many more problems like this in the old i18n file, my tests found 133 errors total: ------------------------------------------------------------ Old file = /local/mfabian/src/glibc/localedata/locales/i18n Number of errors in old file = 133 ------------------------------------------------------------ I’ll attach the full report. -- You are receiving this mail because: You are on the CC list for the bug. >From glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:02:11 2014 Return-Path: Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 8397 invoked by alias); 6 Nov 2014 11:02:10 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 8327 invoked by uid 48); 6 Nov 2014 11:02:07 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:02:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00021.txt.bz2 Content-length: 505 https://sourceware.org/bugzilla/show_bug.cgi?id094 --- Comment #20 from Mike FABIAN --- Created attachment 7907 --> https://sourceware.org/bugzilla/attachment.cgi?idy07&actionit unicode-5.0.0-report-full-output Full report from ctype-compatibility.py when comparing the old i18n file in glibc with the file generated by gen-unicode-ctype.c using UnicodeData.txt from Unicode 5.0.0. -- You are receiving this mail because: You are on the CC list for the bug.