From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glibc-bugs-return-26528-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Received: (qmail 6645 invoked by alias); 6 Nov 2014 11:00:11 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Received: (qmail 6498 invoked by uid 48); 6 Nov 2014 11:00:06 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:00:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-WnMYOwTgy4@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00020.txt.bz2

https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094

--- Comment #19 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I extended Pravin=E2=80=99s ctype-compatibility.py script to produce more
human readable output and added many extra tests.

Joseph Myers> * Ensure the character type data in
Joseph Myers>   localedata/charmaps/i18n can be properly reproduced from
Joseph Myers>   Unicode 5.0 data using gen-unicode-ctype.c, adapting
Joseph Myers>   gen-unicode-ctype.c as needed to replicate any changes
Joseph Myers>   that may have been made not using that program.

When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0
to generate LC_CTYPE, the generated file lacks
many characters which apparently have been manually added
to glibc=E2=80=99s i18n file:

alpha: Missing 1238 characters of old ctype in new ctype=20
blank: Missing 0 characters of old ctype in new ctype=20
cntrl: Missing 0 characters of old ctype in new ctype=20
combining: Missing 124 characters of old ctype in new ctype=20
combining_level3: Missing 49 characters of old ctype in new ctype=20
digit: Missing 0 characters of old ctype in new ctype=20
graph: Missing 1571 characters of old ctype in new ctype=20
lower: Missing 115 characters of old ctype in new ctype=20
print: Missing 1571 characters of old ctype in new ctype=20
punct: Missing 335 characters of old ctype in new ctype=20
space: Missing 0 characters of old ctype in new ctype=20
tolower: Missing 19 characters of old ctype in new ctype=20
totitle: Missing 8 characters of old ctype in new ctype=20
toupper: Missing 18 characters of old ctype in new ctype=20
upper: Missing 100 characters of old ctype in new ctype=20
xdigit: Missing 0 characters of old ctype in new ctype=20

I.e. reproducing the localedata/charmaps/i18n character type data
from Unicode 5.0 data using gen-unicode-ctype.c does not work
well because glibc=E2=80=99s i18n file apparently has been edited
manually a lot already to include newer Unicode data.

Apparently quite a few mistake have been made by manually editing
the i18n file. For example, the report from ctype-compatibility.py
also produces for the old i18n file:

error: 0xa67f =EA=99=BF punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode=
 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            =E2=80=9CAlphabetic=E2=80=9D. Apparently added manually to punc=
t by mistake in
            glibc=E2=80=99s old LC_CTYPE.
error: 0xa67f =EA=99=BF alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicod=
e 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            =E2=80=9CAlphabetic=E2=80=9D. Apparently added manually to punc=
t by mistake in
            glibc=E2=80=99s old LC_CTYPE.

Another example:

error: 0x9f4 =E0=A7=B4 alpha True:=20
            =E2=80=9C09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;=
;;;=E2=80=9D
            =E2=80=9C09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;=
;;=E2=80=9D
            =E2=80=9C09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N=
;;;;;=E2=80=9D
            =E2=80=9C09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;=
;;;=E2=80=9D
            =E2=80=9C09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE
DENOMINATOR;No;0;L;;;;3/4;N;;;;;=E2=80=9D
            =E2=80=9C09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16=
;N;;;;;=E2=80=9D
            =E2=80=9C09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;=E2=80=9D
            According to DerivedCoreProperties.txt (7.0.0) these are *not*
            =E2=80=9CAlphabetic=E2=80=9D.

So this has been mistakenly added to =E2=80=9Calpha=E2=80=9D in the old i18=
n file
of glibc (but gen-unicode-ctype.c correctly puts in into =E2=80=9Cpunct=E2=
=80=9D,
i.e. this seems to be another mistake by manual editing).

Some of the errors reported by ctype-compatibility.py

error: 0x250 =C9=90 lower False: Should be lower in Unicode 7.0.0 (was not =
lower in
            Unicode 5.0.0).

would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input.

There are many more problems like this in the old i18n file,
my tests found 133 errors total:

------------------------------------------------------------
Old file =3D /local/mfabian/src/glibc/localedata/locales/i18n
Number of errors in old file =3D 133
------------------------------------------------------------

I=E2=80=99ll attach the full report.

--=20
You are receiving this mail because:
You are on the CC list for the bug.
>>From glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:02:11 2014
Return-Path: <glibc-bugs-return-26529-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 8397 invoked by alias); 6 Nov 2014 11:02:10 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 8327 invoked by uid 48); 6 Nov 2014 11:02:07 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:02:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: attachments.created
Message-ID: <bug-14094-131-k9LxqlwtOp@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00021.txt.bz2
Content-length: 505

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #20 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7907
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7907&action=edit
unicode-5.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 5.0.0.

-- 
You are receiving this mail because:
You are on the CC list for the bug.