https://sourceware.org/bugzilla/show_bug.cgi?id=14094 --- Comment #21 from Mike FABIAN --- Now when using gen-unicode-ctype.c with UnicodeData.txt-7.0.0 to generate LC_CTYPE, the generated file lacks far fewer characters compared to the old i18n file in glibc: alpha: Missing 246 characters of old ctype in new ctype blank: Missing 1 characters of old ctype in new ctype cntrl: Missing 0 characters of old ctype in new ctype combining: Missing 3 characters of old ctype in new ctype combining_level3: Missing 5 characters of old ctype in new ctype digit: Missing 0 characters of old ctype in new ctype graph: Missing 0 characters of old ctype in new ctype lower: Missing 20 characters of old ctype in new ctype print: Missing 0 characters of old ctype in new ctype punct: Missing 16 characters of old ctype in new ctype space: Missing 1 characters of old ctype in new ctype tolower: Missing 0 characters of old ctype in new ctype totitle: Missing 0 characters of old ctype in new ctype toupper: Missing 0 characters of old ctype in new ctype upper: Missing 0 characters of old ctype in new ctype xdigit: Missing 0 characters of old ctype in new ctype For example, gen-unicode-ctype.c does not put U+0901 into the “alpha” class although it should be there according to DerivedCoreProperties.txt: error: 0x901 ँ alpha False: These have general category “Mn” i.e. these are combining characters (both in UnicodeData.txt 5.0.0 and 7.0.0): “0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;”, ”0902;DEVANAGARI SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;”, “0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;”. According to DerivedCoreProperties.txt (7.0.0) these are “Alphabetic”. Apparently this has been edited manually (correctly) in the old i18n file of glibc. So this would be fixed in the automatic generation when using DerivedCoreProperties.txt for “alpha”. But some of the above seem to be errors in the old i18n file of glib, for example: error: 0x1090 ႐ punct True: MYANMAR SHAN DIGIT ZERO - MYANMAR SHAN DIGIT NINE. These are digits, but because ISO C 99 forbids to put them into digit they should go into alpha. This is in “punct” in the old i18n file but gen-unicode-ctype.c would put it into “alpha” which seems better for such digits according to the comments in gen-unicode-ctype.c. I went through all these “Missing” characters individually and looked them up in UnicodeData.txt and DerivedCoreProperties.txt, checked what how should be classified and added test cases for them to the ctype-compatibility.py script. I’ll attach the full report after using gen-unicode-ctype.c with UnicodeData.txt-7.0.0 to generate LC_CTYPE. -- You are receiving this mail because: You are on the CC list for the bug. >From glibc-bugs-return-26531-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:06:32 2014 Return-Path: Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 11037 invoked by alias); 6 Nov 2014 11:06:32 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 10964 invoked by uid 48); 6 Nov 2014 11:06:28 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:06:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00023.txt.bz2 Content-length: 505 https://sourceware.org/bugzilla/show_bug.cgi?id094 --- Comment #22 from Mike FABIAN --- Created attachment 7908 --> https://sourceware.org/bugzilla/attachment.cgi?idy08&actionit unicode-7.0.0-report-full-output Full report from ctype-compatibility.py when comparing the old i18n file in glibc with the file generated by gen-unicode-ctype.c using UnicodeData.txt from Unicode 7.0.0. -- You are receiving this mail because: You are on the CC list for the bug.