public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:00:00 -0000	[thread overview]
Message-ID: <bug-14094-131-WnMYOwTgy4@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 6822 bytes --]

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #19 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I extended Pravin’s ctype-compatibility.py script to produce more
human readable output and added many extra tests.

Joseph Myers> * Ensure the character type data in
Joseph Myers>   localedata/charmaps/i18n can be properly reproduced from
Joseph Myers>   Unicode 5.0 data using gen-unicode-ctype.c, adapting
Joseph Myers>   gen-unicode-ctype.c as needed to replicate any changes
Joseph Myers>   that may have been made not using that program.

When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0
to generate LC_CTYPE, the generated file lacks
many characters which apparently have been manually added
to glibc’s i18n file:

alpha: Missing 1238 characters of old ctype in new ctype 
blank: Missing 0 characters of old ctype in new ctype 
cntrl: Missing 0 characters of old ctype in new ctype 
combining: Missing 124 characters of old ctype in new ctype 
combining_level3: Missing 49 characters of old ctype in new ctype 
digit: Missing 0 characters of old ctype in new ctype 
graph: Missing 1571 characters of old ctype in new ctype 
lower: Missing 115 characters of old ctype in new ctype 
print: Missing 1571 characters of old ctype in new ctype 
punct: Missing 335 characters of old ctype in new ctype 
space: Missing 0 characters of old ctype in new ctype 
tolower: Missing 19 characters of old ctype in new ctype 
totitle: Missing 8 characters of old ctype in new ctype 
toupper: Missing 18 characters of old ctype in new ctype 
upper: Missing 100 characters of old ctype in new ctype 
xdigit: Missing 0 characters of old ctype in new ctype 

I.e. reproducing the localedata/charmaps/i18n character type data
from Unicode 5.0 data using gen-unicode-ctype.c does not work
well because glibc’s i18n file apparently has been edited
manually a lot already to include newer Unicode data.

Apparently quite a few mistake have been made by manually editing
the i18n file. For example, the report from ctype-compatibility.py
also produces for the old i18n file:

error: 0xa67f ꙿ punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.
error: 0xa67f ꙿ alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.

Another example:

error: 0x9f4 ৴ alpha True: 
            “09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;;;;”
            “09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;;;”
            “09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N;;;;;”
            “09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;;;;”
            “09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE
DENOMINATOR;No;0;L;;;;3/4;N;;;;;”
            “09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16;N;;;;;”
            “09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;”
            According to DerivedCoreProperties.txt (7.0.0) these are *not*
            “Alphabetic”.

So this has been mistakenly added to “alpha” in the old i18n file
of glibc (but gen-unicode-ctype.c correctly puts in into “punct”,
i.e. this seems to be another mistake by manual editing).

Some of the errors reported by ctype-compatibility.py

error: 0x250 ɐ lower False: Should be lower in Unicode 7.0.0 (was not lower in
            Unicode 5.0.0).

would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input.

There are many more problems like this in the old i18n file,
my tests found 133 errors total:

------------------------------------------------------------
Old file = /local/mfabian/src/glibc/localedata/locales/i18n
Number of errors in old file = 133
------------------------------------------------------------

I’ll attach the full report.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:02:11 2014
Return-Path: <glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 8397 invoked by alias); 6 Nov 2014 11:02:10 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 8327 invoked by uid 48); 6 Nov 2014 11:02:07 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:02:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: attachments.created
Message-ID: <bug-14094-131-k9LxqlwtOp@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00021.txt.bz2
Content-length: 505

https://sourceware.org/bugzilla/show_bug.cgi?id\x14094

--- Comment #20 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7907
  --> https://sourceware.org/bugzilla/attachment.cgi?idy07&actioníit
unicode-5.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 5.0.0.

--
You are receiving this mail because:
You are on the CC list for the bug.


  parent reply	other threads:[~2014-11-06 11:00 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-10 20:28 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
2012-05-11  3:26 ` [Bug localedata/14094] " bugdal at aerifal dot cx
2013-11-26 17:05 ` myllynen at redhat dot com
2014-02-18  9:24 ` pravin.d.s at gmail dot com
2014-05-21 11:11 ` allan at archlinux dot org
2014-05-23  7:54 ` [Bug localedata/14094] Update locale data to Unicode 6.3 pravin.d.s at gmail dot com
2014-05-23 12:02 ` joseph at codesourcery dot com
2014-05-23 13:20 ` pravin.d.s at gmail dot com
2014-06-10  9:38 ` pravin.d.s at gmail dot com
2014-06-10 14:38 ` carlos at redhat dot com
2014-06-11  3:49 ` pravin.d.s at gmail dot com
2014-06-19 10:28 ` pravin.d.s at gmail dot com
2014-06-21 19:10 ` [Bug localedata/14094] Update locale data to Unicode 7.0.0 pravin.d.s at gmail dot com
2014-06-25 11:02 ` fweimer at redhat dot com
2014-06-25 12:24 ` pravin.d.s at gmail dot com
2014-06-25 13:47 ` carlos at redhat dot com
2014-07-04  9:13 ` pravin.d.s at gmail dot com
2014-07-17 10:41 ` pravin.d.s at gmail dot com
2014-07-22 12:18 ` pravin.d.s at gmail dot com
2014-09-05  1:07 ` carlos at redhat dot com
2014-09-29  7:13 ` maiku.fabian at gmail dot com
2014-09-29  7:17 ` pravin.d.s at gmail dot com
2014-11-06 11:00 ` maiku.fabian at gmail dot com [this message]
2014-11-06 11:03 ` maiku.fabian at gmail dot com
2014-11-06 11:45 ` maiku.fabian at gmail dot com
2014-11-12 10:13 ` pravin.d.s at gmail dot com
2014-11-12 10:18 ` pravin.d.s at gmail dot com
2014-11-14  7:15 ` maiku.fabian at gmail dot com
2014-11-14  7:34 ` maiku.fabian at gmail dot com
2014-11-24 11:20 ` maiku.fabian at gmail dot com
2014-12-01 10:14 ` maiku.fabian at gmail dot com
2014-12-03 12:27 ` maiku.fabian at gmail dot com
2014-12-03 12:27 ` maiku.fabian at gmail dot com
2014-12-04 10:33 ` maiku.fabian at gmail dot com
2015-02-20 22:36 ` cvs-commit at gcc dot gnu.org
2015-02-21  0:06 ` aoliva at sourceware dot org
2015-02-21 20:24 ` aoliva at sourceware dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-14094-131-WnMYOwTgy4@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).