From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glibc-bugs-return-26532-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Received: (qmail 26835 invoked by alias); 6 Nov 2014 11:45:38 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Received: (qmail 26752 invoked by uid 48); 6 Nov 2014 11:45:32 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:45:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-5W0eiTpcw0@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00024.txt.bz2

https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094

--- Comment #23 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now Pravin=E2=80=99s approach in the patch attached to comment#15
is to comment out the generation  of =E2=80=9Cupper=E2=80=9D, =E2=80=9Clowe=
r=E2=80=9D
and =E2=80=9Calpha=E2=80=9D from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.

But this is a bit problematic.

1) it does not put digits like

   alpha: Missing: =D9=A0 0x660 ARABIC-INDIC DIGIT ZERO

into =E2=80=9Calpha=E2=80=9D, which  gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment

          /* Consider all the non-ASCII digits as alphabetic.
         ISO C 99 forbids us to have them in category "digit",
         but we want iswalnum to return true on them.  */

which sounds reasonable.

2) it does not put characters like

    lower: Missing: =C7=85 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z=
 WITH
CARON

into lower. This is actually title case, not lower case,
but glibc does have only =E2=80=9Clower=E2=80=9D and =E2=80=9Cupper=E2=80=
=9D, not =E2=80=9Ctitle=E2=80=9D.
Although it has =E2=80=9Ctoupper=E2=80=9D, =E2=80=9Ctolower=E2=80=9D, and =
=E2=80=9Ctotitle=E2=80=9D.

gen-unicode-ctype.c puts characters which change when =E2=80=9Ctoupper=E2=
=80=9D
is applied into =E2=80=9Clower=E2=80=9D and characters which change when =
=E2=80=9Ctolower=E2=80=9D
is applied into =E2=80=9Cupper=E2=80=9D. Therefore, gen-unicode-ctype.c
puts title case characters like =C7=85 0x1c5 into *both*, =E2=80=9Cupper=E2=
=80=9D *and*
=E2=80=9Clower=E2=80=9D. Which seems reasonable if glibc has no =E2=80=9Cti=
tle=E2=80=9D.

3) it does not put some characters like:

    upper: Missing: =E1=BE=88 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI =
AND
PROSGEGRAMMENI

into =E2=80=9Cupper=E2=80=9D. Surprisingly,

=E2=80=9CU+1F88 =E1=BE=88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEG=
RAMMENI=E2=80=9D
is *not* listed as =E2=80=9CUppercase=E2=80=9D in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

    1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
0345;;;;N;;;1F88;;1F88
    1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F=
08
0345;;;;N;;;;1F80;

So this might be a bug in DerivedCoreProperties.txt.

Generating =E2=80=9Cupper=E2=80=9D and =E2=80=9Clower=E2=80=9D the way gen-=
unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for =E2=80=9Cupper=E2=
=80=9D
and =E2=80=9Clower=E2=80=9D is fine, it is not necessary to use DerivedCore=
Properties.txt
for this.

4) *many* characters end up being in =E2=80=9Calpha=E2=80=9D *and* =E2=80=
=9Cpunct=E2=80=9D

For example:

    error: =E2=B7=B6 0x2df6 is alpha and punct

gen-unicode-ctype.c has the comment:

      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */

This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravin=E2=80=99s 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
=E2=80=9Cpunct=E2=80=9D before putting it into =E2=80=9Calpha=E2=80=9D.

The character  =E2=80=9C=E2=B7=B6 U+2df6 COMBINING CYRILLIC LETTER A=E2=80=
=9D is =E2=80=9CAlphabetic=E2=80=9D
according to DerivedCoreProperties.txt:

    2DE0..2DFF    ; Alphabetic # Mn  [32] COMBINING CYRILLIC LETTER
BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS

So Pravin=E2=80=99s script does rightly put it in to =E2=80=9Calpha=E2=80=
=9D.

But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.

Verifications like gen-unicode-ctype.c does:

      /* toupper restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...=20=20
      /* tolower restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...
      /* alpha restriction: "Characters classified as either upper or lower
     shall automatically belong to this class.  */
      ...
      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */
      ...
      /* space restriction: "No character specified for the keywords upper,
     lower, alpha, digit, graph or xdigit shall be specified."
     upper, lower, alpha already checked above.  */
      ...
      /* cntrl restriction: "No character specified for the keywords upper,
     lower, alpha, digit, punct, graph, print or xdigit shall be
     specified."  upper, lower, alpha already checked above.  */
      ...

can be done much easier when using a single program.

--=20
You are receiving this mail because:
You are on the CC list for the bug.
>>From glibc-bugs-return-26533-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:56:09 2014
Return-Path: <glibc-bugs-return-26533-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 3091 invoked by alias); 6 Nov 2014 11:56:09 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 2807 invoked by uid 48); 6 Nov 2014 11:56:03 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-wx3Fq7dU00@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00025.txt.bz2
Content-length: 967

https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094

--- Comment #24 from Mike FABIAN <maiku.fabian at gmail dot com> ---
So I think we should do either:

1) improve gen-unicode-ctype.c and make it use
   DerivedCoreProperties.txt for =E2=80=9Calpha=E2=80=9D

or:

2) rewrite gen-unicode-ctype.c to Python
   First a rewrite which produces *exactly* the same
   output as gen-unicode-ctype.c, then add code
   to use DerivedCoreProperties.txt for =E2=80=9Calpha=E2=80=9D

No matter whether extending the C-Program or writing a Python program,
it should be a single program to be able to verify the restrictions
mentioned easily.

It would be nice of course to make the program read in the old i18n
file and replace the characters classes and write out a new file which
keeps the rest of the original file so that no manual copy&paste of
the generated character classes is necessary.

--=20
You are receiving this mail because:
You are on the CC list for the bug.
>>From glibc-bugs-return-26534-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 12:00:30 2014
Return-Path: <glibc-bugs-return-26534-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 7412 invoked by alias); 6 Nov 2014 12:00:29 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 7338 invoked by uid 48); 6 Nov 2014 12:00:26 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 12:00:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-U4qhW3ij2p@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00026.txt.bz2
Content-length: 539

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #25 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #24)

> No matter whether extending the C-Program or writing a Python program,
> it should be a single program to be able to verify the restrictions
> mentioned easily.

And as a 2nd pass, after the single program to generate the character
class data, use ctype-compatibility.py as a "test-suite".

-- 
You are receiving this mail because:
You are on the CC list for the bug.