public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:45:00 -0000	[thread overview]
Message-ID: <bug-14094-131-5W0eiTpcw0@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #23 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now Pravin’s approach in the patch attached to comment#15
is to comment out the generation  of “upper”, “lower”
and “alpha” from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.

But this is a bit problematic.

1) it does not put digits like

   alpha: Missing: ٠ 0x660 ARABIC-INDIC DIGIT ZERO

into “alpha”, which  gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment

          /* Consider all the non-ASCII digits as alphabetic.
         ISO C 99 forbids us to have them in category "digit",
         but we want iswalnum to return true on them.  */

which sounds reasonable.

2) it does not put characters like

    lower: Missing: Dž 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON

into lower. This is actually title case, not lower case,
but glibc does have only “lower” and “upper”, not “title”.
Although it has “toupper”, “tolower”, and “totitle”.

gen-unicode-ctype.c puts characters which change when “toupper”
is applied into “lower” and characters which change when “tolower”
is applied into “upper”. Therefore, gen-unicode-ctype.c
puts title case characters like Dž 0x1c5 into *both*, “upper” *and*
“lower”. Which seems reasonable if glibc has no “title”.

3) it does not put some characters like:

    upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
PROSGEGRAMMENI

into “upper”. Surprisingly,

“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
is *not* listed as “Uppercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

    1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
0345;;;;N;;;1F88;;1F88
    1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08
0345;;;;N;;;;1F80;

So this might be a bug in DerivedCoreProperties.txt.

Generating “upper” and “lower” the way gen-unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for “upper”
and “lower” is fine, it is not necessary to use DerivedCoreProperties.txt
for this.

4) *many* characters end up being in “alpha” *and* “punct”

For example:

    error: ⷶ 0x2df6 is alpha and punct

gen-unicode-ctype.c has the comment:

      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */

This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravin’s 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
“punct” before putting it into “alpha”.

The character  “ⷶ U+2df6 COMBINING CYRILLIC LETTER A” is “Alphabetic”
according to DerivedCoreProperties.txt:

    2DE0..2DFF    ; Alphabetic # Mn  [32] COMBINING CYRILLIC LETTER
BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS

So Pravin’s script does rightly put it in to “alpha”.

But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.

Verifications like gen-unicode-ctype.c does:

      /* toupper restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...  
      /* tolower restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...
      /* alpha restriction: "Characters classified as either upper or lower
     shall automatically belong to this class.  */
      ...
      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */
      ...
      /* space restriction: "No character specified for the keywords upper,
     lower, alpha, digit, graph or xdigit shall be specified."
     upper, lower, alpha already checked above.  */
      ...
      /* cntrl restriction: "No character specified for the keywords upper,
     lower, alpha, digit, punct, graph, print or xdigit shall be
     specified."  upper, lower, alpha already checked above.  */
      ...

can be done much easier when using a single program.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-26533-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:56:09 2014
Return-Path: <glibc-bugs-return-26533-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 3091 invoked by alias); 6 Nov 2014 11:56:09 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 2807 invoked by uid 48); 6 Nov 2014 11:56:03 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 11:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-wx3Fq7dU00@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00025.txt.bz2
Content-length: 967

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #24 from Mike FABIAN <maiku.fabian at gmail dot com> ---
So I think we should do either:

1) improve gen-unicode-ctype.c and make it use
   DerivedCoreProperties.txt for “alpha”

or:

2) rewrite gen-unicode-ctype.c to Python
   First a rewrite which produces *exactly* the same
   output as gen-unicode-ctype.c, then add code
   to use DerivedCoreProperties.txt for “alpha”

No matter whether extending the C-Program or writing a Python program,
it should be a single program to be able to verify the restrictions
mentioned easily.

It would be nice of course to make the program read in the old i18n
file and replace the characters classes and write out a new file which
keeps the rest of the original file so that no manual copy&paste of
the generated character classes is necessary.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-26534-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 12:00:30 2014
Return-Path: <glibc-bugs-return-26534-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 7412 invoked by alias); 6 Nov 2014 12:00:29 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 7338 invoked by uid 48); 6 Nov 2014 12:00:26 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Thu, 06 Nov 2014 12:00:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-U4qhW3ij2p@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-11/txt/msg00026.txt.bz2
Content-length: 539

https://sourceware.org/bugzilla/show_bug.cgi?id\x14094

--- Comment #25 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #24)

> No matter whether extending the C-Program or writing a Python program,
> it should be a single program to be able to verify the restrictions
> mentioned easily.

And as a 2nd pass, after the single program to generate the character
class data, use ctype-compatibility.py as a "test-suite".

--
You are receiving this mail because:
You are on the CC list for the bug.


  parent reply	other threads:[~2014-11-06 11:45 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-10 20:28 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
2012-05-11  3:26 ` [Bug localedata/14094] " bugdal at aerifal dot cx
2013-11-26 17:05 ` myllynen at redhat dot com
2014-02-18  9:24 ` pravin.d.s at gmail dot com
2014-05-21 11:11 ` allan at archlinux dot org
2014-05-23  7:54 ` [Bug localedata/14094] Update locale data to Unicode 6.3 pravin.d.s at gmail dot com
2014-05-23 12:02 ` joseph at codesourcery dot com
2014-05-23 13:20 ` pravin.d.s at gmail dot com
2014-06-10  9:38 ` pravin.d.s at gmail dot com
2014-06-10 14:38 ` carlos at redhat dot com
2014-06-11  3:49 ` pravin.d.s at gmail dot com
2014-06-19 10:28 ` pravin.d.s at gmail dot com
2014-06-21 19:10 ` [Bug localedata/14094] Update locale data to Unicode 7.0.0 pravin.d.s at gmail dot com
2014-06-25 11:02 ` fweimer at redhat dot com
2014-06-25 12:24 ` pravin.d.s at gmail dot com
2014-06-25 13:47 ` carlos at redhat dot com
2014-07-04  9:13 ` pravin.d.s at gmail dot com
2014-07-17 10:41 ` pravin.d.s at gmail dot com
2014-07-22 12:18 ` pravin.d.s at gmail dot com
2014-09-05  1:07 ` carlos at redhat dot com
2014-09-29  7:13 ` maiku.fabian at gmail dot com
2014-09-29  7:17 ` pravin.d.s at gmail dot com
2014-11-06 11:00 ` maiku.fabian at gmail dot com
2014-11-06 11:03 ` maiku.fabian at gmail dot com
2014-11-06 11:45 ` maiku.fabian at gmail dot com [this message]
2014-11-12 10:13 ` pravin.d.s at gmail dot com
2014-11-12 10:18 ` pravin.d.s at gmail dot com
2014-11-14  7:15 ` maiku.fabian at gmail dot com
2014-11-14  7:34 ` maiku.fabian at gmail dot com
2014-11-24 11:20 ` maiku.fabian at gmail dot com
2014-12-01 10:14 ` maiku.fabian at gmail dot com
2014-12-03 12:27 ` maiku.fabian at gmail dot com
2014-12-03 12:27 ` maiku.fabian at gmail dot com
2014-12-04 10:33 ` maiku.fabian at gmail dot com
2015-02-20 22:36 ` cvs-commit at gcc dot gnu.org
2015-02-21  0:06 ` aoliva at sourceware dot org
2015-02-21 20:24 ` aoliva at sourceware dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-14094-131-5W0eiTpcw0@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).