public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/14094] New: Update locale data to Unicode 6.1
@ 2012-05-10 21:06 jsm28 at gcc dot gnu.org
  2012-05-11  7:28 ` [Bug localedata/14094] " bugdal at aerifal dot cx
                   ` (47 more replies)
  0 siblings, 48 replies; 49+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2012-05-10 21:06 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14094

             Bug #: 14094
           Summary: Update locale data to Unicode 6.1
           Product: glibc
           Version: 2.15
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: unassigned@sourceware.org
        ReportedBy: jsm28@gcc.gnu.org
                CC: libc-locales@sources.redhat.com
    Classification: Unclassified


The Unicode locale data - character map and LC_CTYPE information - should be
updated from Unicode 6.1 (the character map is currently based on 6.0, and
LC_CTYPE is currently based on 5.0).  This should be done with proper
automation and wiki documentation being added of how to do future updates.  I
identified the following tasks at
<http://sourceware.org/ml/libc-alpha/2012-05/msg00590.html>:

* Ensure the character type data in localedata/charmaps/i18n can be
  properly reproduced from Unicode 5.0 data using gen-unicode-ctype.c,
  adapting gen-unicode-ctype.c as needed to replicate any changes that
  may have been made not using that program.

* Update the character type data to Unicode 6.1, removing any local
  hacks from gen-unicode-ctype.c that are no longer needed.
  (10646:2012, corresponding to Unicode 6.1, appears to be in
  publication stage so should be out very soon.)

* Ensure the character data in localedata/charmaps/UTF-8 can be
  reproduced in some automated fashion from Unicode 6.0, locating any
  previously used automation for this or creating some new automation
  if any previous automation can't be found.

* Update the character data to Unicode 6.1, removing any local hacks
  in the automation from the previous step.

* Document thoroughly on the wiki how the automation works and how to
  do updates to new Unicode versions.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.1
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
@ 2012-05-11  7:28 ` bugdal at aerifal dot cx
  2013-11-26 17:07 ` myllynen at redhat dot com
                   ` (46 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: bugdal at aerifal dot cx @ 2012-05-11  7:28 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14094

Rich Felker <bugdal at aerifal dot cx> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugdal at aerifal dot cx

--- Comment #1 from Rich Felker <bugdal at aerifal dot cx> 2012-05-11 03:25:47 UTC ---
One of the major "local hacks" can be fixed, fixing many other problems at the
same time, by switching to using the Unicode "Alphabetic" property (from
DerivedCoreProperties.txt) instead of just categories L* for class alpha. Right
now there are many languages whose letters are considered non-alphabetic by
glibc because they're in category Mn or Mc or even Cf. There are "local hacks"
to fix this for maybe one or two languages, but using the right Unicode
property would fix it for all languages.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.1
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
  2012-05-11  7:28 ` [Bug localedata/14094] " bugdal at aerifal dot cx
@ 2013-11-26 17:07 ` myllynen at redhat dot com
  2014-02-18 10:12 ` pravin.d.s at gmail dot com
                   ` (45 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: myllynen at redhat dot com @ 2013-11-26 17:07 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Marko Myllynen <myllynen at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |myllynen at redhat dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.1
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
  2012-05-11  7:28 ` [Bug localedata/14094] " bugdal at aerifal dot cx
  2013-11-26 17:07 ` myllynen at redhat dot com
@ 2014-02-18 10:12 ` pravin.d.s at gmail dot com
  2014-05-21 12:52 ` allan at archlinux dot org
                   ` (44 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-02-18 10:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pravin.d.s at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.1
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2014-02-18 10:12 ` pravin.d.s at gmail dot com
@ 2014-05-21 12:52 ` allan at archlinux dot org
  2014-05-21 12:52 ` johannes at kyriasis dot com
                   ` (43 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: allan at archlinux dot org @ 2014-05-21 12:52 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Allan McRae <allan at archlinux dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |allan at archlinux dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.1
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2014-05-21 12:52 ` allan at archlinux dot org
@ 2014-05-21 12:52 ` johannes at kyriasis dot com
  2014-05-23  7:56 ` [Bug localedata/14094] Update locale data to Unicode 6.3 pravin.d.s at gmail dot com
                   ` (42 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: johannes at kyriasis dot com @ 2014-05-21 12:52 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Johannes Löthberg <johannes at kyriasis dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |johannes at kyriasis dot com

--- Comment #2 from Johannes Löthberg <johannes at kyriasis dot com> ---
*** Bug 16969 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2014-05-21 12:52 ` johannes at kyriasis dot com
@ 2014-05-23  7:56 ` pravin.d.s at gmail dot com
  2014-05-23 12:11 ` joseph at codesourcery dot com
                   ` (41 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-05-23  7:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Update locale data to       |Update locale data to
                   |Unicode 6.1                 |Unicode 6.3

--- Comment #3 from Pravin S <pravin.d.s at gmail dot com> ---
Rather than Uniocode 6.1, it should be Unicode 6.3.

Two files as mentioned in bug are 
1. i18n (LC_CTYPE) (it used to be generated by gen-unicode-ctype.c, )
2. UTF-8 (it looks conversion from Unicode to UTF-8), i will find out 

Are there any other files also involved in upgrading glibc localedata to
Unicode 6.1?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2014-05-23  7:56 ` [Bug localedata/14094] Update locale data to Unicode 6.3 pravin.d.s at gmail dot com
@ 2014-05-23 12:11 ` joseph at codesourcery dot com
  2014-05-23 13:55 ` pravin.d.s at gmail dot com
                   ` (40 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: joseph at codesourcery dot com @ 2014-05-23 12:11 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #4 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
Once the data is updated (maybe once just the character map is updated), 
__STDC_ISO_10646__ should be updated in include/stdc-predef.h to reflect 
the publication date of the edition or amendment to ISO 10646 
corresponding to the version of Unicode in use.

I advise keeping each of the tasks I listed as a separate patch, as it's 
important to be confident we aren't losing desired local changes in the 
course of the update (which means the existing files need to be reproduced 
exactly by some automation before the update is done).

Bug 16061 relates to transliteration data, some of which came from 
Unicode, and bug 14095 to collation data.  The same principles apply to 
those - reproduce the existing files, understanding any local changes in 
the process, then update to a newer Unicode version - but they are likely 
to involve much more work in understanding the existing state then 
updating while preserving any desired local changes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2014-05-23 12:11 ` joseph at codesourcery dot com
@ 2014-05-23 13:55 ` pravin.d.s at gmail dot com
  2014-06-10  9:43 ` pravin.d.s at gmail dot com
                   ` (39 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-05-23 13:55 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #5 from Pravin S <pravin.d.s at gmail dot com> ---
Yeah, Backward compatibility is must. 
I will write small script to check we are not changing existing maps, so we can
be confident before commiting.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2014-05-23 13:55 ` pravin.d.s at gmail dot com
@ 2014-06-10  9:43 ` pravin.d.s at gmail dot com
  2014-06-10 14:40 ` carlos at redhat dot com
                   ` (38 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-10  9:43 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at sourceware dot org   |pravin.d.s at gmail dot com

--- Comment #6 from Pravin S <pravin.d.s at gmail dot com> ---
I have written script for checking backward compabitibility of new LC_CTYPE
with old LC_CTYPE.

Script is available at https://github.com/pravins/glibc-i18n

Important thing for us presently is report generated by script. i.e. 

https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report

While doing this also found in existing i18n file <U0D70>..<U0D75>; included
twice.

% MALAYALAM/
   <U0D66>..<U0D75>;<U0D70>..<U0D75>;/

Let me know if anything is missing.

In next step, i will check missing characters from LC_CTYPE 5.0.0 with LC_CTYPE
6.3.0 and confirm are these intentional changes at Unicode or something we are
missing.

Will be ready with patch for updating LC_CTYPE next time.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2014-06-10  9:43 ` pravin.d.s at gmail dot com
@ 2014-06-10 14:40 ` carlos at redhat dot com
  2014-06-11  4:25 ` pravin.d.s at gmail dot com
                   ` (37 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: carlos at redhat dot com @ 2014-06-10 14:40 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com

--- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Pravin S from comment #6)
> I have written script for checking backward compabitibility of new LC_CTYPE
> with old LC_CTYPE.
> 
> Script is available at https://github.com/pravins/glibc-i18n
> 
> Important thing for us presently is report generated by script. i.e. 
> 
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report
> 
> While doing this also found in existing i18n file <U0D70>..<U0D75>; included
> twice.
> 
> % MALAYALAM/
>    <U0D66>..<U0D75>;<U0D70>..<U0D75>;/
> 
> Let me know if anything is missing.
> 
> In next step, i will check missing characters from LC_CTYPE 5.0.0 with
> LC_CTYPE 6.3.0 and confirm are these intentional changes at Unicode or
> something we are missing.
> 
> Will be ready with patch for updating LC_CTYPE next time.

Thanks Pravin! I think the missing step is to get these scripts checked into
glibc's script/ directory so that we have them in a central location with some
internal comments showing how to run the script. This way we can re-run them at
later stages to verify what's missing and stay in sync (say the release manager
runs it before a release).

Eventually we want a documented process here:
https://sourceware.org/glibc/wiki/Regeneration

Even if it's just "Run this script. Fix all warnings by hand" it would be a
good start.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2014-06-10 14:40 ` carlos at redhat dot com
@ 2014-06-11  4:25 ` pravin.d.s at gmail dot com
  2014-06-19 11:28 ` pravin.d.s at gmail dot com
                   ` (36 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-11  4:25 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #8 from Pravin S <pravin.d.s at gmail dot com> ---
Agree with you, will do it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 6.3
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2014-06-11  4:25 ` pravin.d.s at gmail dot com
@ 2014-06-19 11:28 ` pravin.d.s at gmail dot com
  2014-06-21 19:15 ` [Bug localedata/14094] Update locale data to Unicode 7.0.0 pravin.d.s at gmail dot com
                   ` (35 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-19 11:28 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #9 from Pravin S <pravin.d.s at gmail dot com> ---
(In reply to Rich Felker from comment #1)
> One of the major "local hacks" can be fixed, fixing many other problems at
> the same time, by switching to using the Unicode "Alphabetic" property (from
> DerivedCoreProperties.txt) instead of just categories L* for class alpha.
> Right now there are many languages whose letters are considered
> non-alphabetic by glibc because they're in category Mn or Mc or even Cf.
> There are "local hacks" to fix this for maybe one or two languages, but
> using the right Unicode property would fix it for all languages.

I was almost done with things bug While updating this, i found around 248
characters were added after gen-unicode-ctype.c processing in ALPHA group in
present i18n CTYPE (Unicode 5.1
https://github.com/pravins/glibc-i18n/blob/master/unicode5-1/Report ) and i am
facing same issue while upgrading it to Unicode 6.3 (246 characters)
(https://github.com/pravins/glibc-i18n/blob/master/Report)

During reading http://www.unicode.org/reports/tr44/#Property_List_Table It is
mentioned 

"Implementations should simply use the derived properties, and should not try
to rederive them from lists of simple properties and collections of rules,
because of the chances for error and divergence when doing so."  

I agree with Rich, We should collect available things from
DerivedCoreProperties.txt rather than processing raw UnicodeData.txt. I am
writing script to process groups from DerivedCoreProperties.txt

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2014-06-19 11:28 ` pravin.d.s at gmail dot com
@ 2014-06-21 19:15 ` pravin.d.s at gmail dot com
  2014-06-25 12:08 ` fweimer at redhat dot com
                   ` (34 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-21 19:15 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Update locale data to       |Update locale data to
                   |Unicode 6.3                 |Unicode 7.0.0

--- Comment #10 from Pravin S <pravin.d.s at gmail dot com> ---
I am working with latest Unicode standard, so updated bug summary.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2014-06-21 19:15 ` [Bug localedata/14094] Update locale data to Unicode 7.0.0 pravin.d.s at gmail dot com
@ 2014-06-25 12:08 ` fweimer at redhat dot com
  2014-06-25 12:43 ` pravin.d.s at gmail dot com
                   ` (33 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: fweimer at redhat dot com @ 2014-06-25 12:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2014-06-25 12:08 ` fweimer at redhat dot com
@ 2014-06-25 12:43 ` pravin.d.s at gmail dot com
  2014-06-25 13:54 ` carlos at redhat dot com
                   ` (32 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-06-25 12:43 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #11 from Pravin S <pravin.d.s at gmail dot com> ---
(In reply to Joseph Myers from comment #0)
> 
> * Ensure the character data in localedata/charmaps/UTF-8 can be
>   reproduced in some automated fashion from Unicode 6.0, locating any
>   previously used automation for this or creating some new automation
>   if any previous automation can't be found.

  Me too not able to find previous automation for same. 

  I can simply pass all Unicode to python unicode-to-utf8 and format it as
required by UTF-8 file.

  Any hint on how to do this?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2014-06-25 12:43 ` pravin.d.s at gmail dot com
@ 2014-06-25 13:54 ` carlos at redhat dot com
  2014-07-04 10:51 ` pravin.d.s at gmail dot com
                   ` (31 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: carlos at redhat dot com @ 2014-06-25 13:54 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #12 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Pravin S from comment #11)
> (In reply to Joseph Myers from comment #0)
> > 
> > * Ensure the character data in localedata/charmaps/UTF-8 can be
> >   reproduced in some automated fashion from Unicode 6.0, locating any
> >   previously used automation for this or creating some new automation
> >   if any previous automation can't be found.
> 
>   Me too not able to find previous automation for same. 
> 
>   I can simply pass all Unicode to python unicode-to-utf8 and format it as
> required by UTF-8 file.
> 
>   Any hint on how to do this?

Not really, this is why this problem requires "work" ;-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2014-06-25 13:54 ` carlos at redhat dot com
@ 2014-07-04 10:51 ` pravin.d.s at gmail dot com
  2014-07-17 12:44 ` pravin.d.s at gmail dot com
                   ` (30 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-07-04 10:51 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #13 from Pravin S <pravin.d.s at gmail dot com> ---
Created attachment 7679
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7679&action=edit
Patch to update UTF-8 CHARMAP to unicode 7.0

 I have worked on updating UTF-8 file to Unicode 7.0. Following are the
important points before review this patch.

  1. Present patch is only for CHARMAP, patch for updating WIDTH will be
available soon.
  2. utf8-gen.py: New script to generate UTF-8 file.
  3. patch is created by ignoring space changes (-w)
  4.
   ''' Where UnicodeData.txt file has given characters in range
    Example:
    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

    UTF-8 file mention these range by adding 0x3F inbetween First and
Last Unicode character.
    Example:
    <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
    .
    .
    <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>

*    Note: No idea why Hangul syllable AC00; D7A3; were not expanded in
Unicode **
**    5.0 UTF-8. We are following consistency and expanding Hangul as
well.**
*    '''

    5. Name changes are in UnicodeData.txt in some cases.
    ''' Some characters have <control> as a name, so using "Unicode 1.0
Name" 
     Characters U+0080, U+0081, U+0084 and U+0099 has "<control>" as a
name and even no "Unicode 1.0 Name" (10th field) in UnicodeData.txt
     We can write code to take there alternate name from NameAliases.txt '''

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2014-07-04 10:51 ` pravin.d.s at gmail dot com
@ 2014-07-17 12:44 ` pravin.d.s at gmail dot com
  2014-07-22 13:03 ` pravin.d.s at gmail dot com
                   ` (29 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-07-17 12:44 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #7679|0                           |1
        is obsolete|                            |

--- Comment #14 from Pravin S <pravin.d.s at gmail dot com> ---
Created attachment 7715
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7715&action=edit
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Done with all work with UTF-8 file. 
Added two script:
1. utf8-gen.py to generate UTF-8 file
2. utf8-compatibility.py : to check backward compatibility of newly generated
UTF-8 file
3. Report of new UTF-8 file backward compatibility is available AT
https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8

Submitting to glibc-alpha, please help to quick review and push to git.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2014-07-17 12:44 ` pravin.d.s at gmail dot com
@ 2014-07-22 13:03 ` pravin.d.s at gmail dot com
  2014-09-05  1:08 ` carlos at redhat dot com
                   ` (28 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-07-22 13:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #15 from Pravin S <pravin.d.s at gmail dot com> ---
Created attachment 7720
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7720&action=edit
Patch to update UTF-8 i18n file (CTYPE) to unicode 7.0

Patch does the following stuff:
* locales/i18n: Updated to Unicode 7.0.0

* scripts/gen-unicode-ctype.c: Disabled upper, lower, alpha and outdigit
classes.

* scripts/ctype-gen.sh: Shell script to generate LC_CTYPE for new Unicode
version.

* scripts/gen-unicode-ctype-dcp.py: New script for generating locales/i18n
upper, lower and alpha ctype from DerivedCoreProperties.txt

* scripts/ctype-compatibility.py:  Script for testing testing backward
compatibility of LC_CTYPE locales/i18n.

Report for backward compatibility is available at 
https://raw.githubusercontent.com/pravins/glibc-i18n/master/unicode7-0/ctype-compatibility5_1-to-7_0

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2014-07-22 13:03 ` pravin.d.s at gmail dot com
@ 2014-09-05  1:08 ` carlos at redhat dot com
  2014-09-29  7:29 ` maiku.fabian at gmail dot com
                   ` (27 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: carlos at redhat dot com @ 2014-09-05  1:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|2.15                        |2.21

--- Comment #16 from Carlos O'Donell <carlos at redhat dot com> ---
Pravin,

Is any part of your work ready for 2.21 when it opens?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2014-09-05  1:08 ` carlos at redhat dot com
@ 2014-09-29  7:29 ` maiku.fabian at gmail dot com
  2014-09-29  7:30 ` pravin.d.s at gmail dot com
                   ` (26 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-09-29  7:29 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2014-09-29  7:29 ` maiku.fabian at gmail dot com
@ 2014-09-29  7:30 ` pravin.d.s at gmail dot com
  2014-10-14  8:08 ` maiku.fabian at gmail dot com
                   ` (25 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-09-29  7:30 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #17 from Pravin S <pravin.d.s at gmail dot com> ---
I am still waiting for someone to review these patches. 
Best way will be, 
1. Build glibc with patches.
2. Test WIDTH and CTYPE function (does it return proper value) may be one can
do same with existing glibc and compare.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2014-09-29  7:30 ` pravin.d.s at gmail dot com
@ 2014-10-14  8:08 ` maiku.fabian at gmail dot com
  2014-11-06 11:03 ` maiku.fabian at gmail dot com
                   ` (24 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-10-14  8:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #18 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Pravin S from comment #14)
> Created attachment 7715 [details]
> Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> 
> Done with all work with UTF-8 file. 
> Added two script:
> 1. utf8-gen.py to generate UTF-8 file
> 2. utf8-compatibility.py : to check backward compatibility of newly
> generated UTF-8 file
> 3. Report of new UTF-8 file backward compatibility is available AT
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
> 
> Submitting to glibc-alpha, please help to quick review and push to git.

I checked the scripts Pravin used and the resulting UTF-8 file.

I found only one minor problem:

In some cases, both UnicodeData.txt and EastAsianWidth.txt have information
about width. For example, EastAsianWidth.txt has:

    302A..302D;W     # Mn     [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
ENTERING TONE MARK

which gives us width 2 for these 4 characters (because of “W”) but
UnicodeData.txt has:

    302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;;
    302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;;
    302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;;
    302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;;

which would give width 0 (because of “NSM”).

I changed Pravin’s script a bit to prefer the information from
EastAsianWidth.txt in case of conflicts.

Pravin has already merged my change into his git repository.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2014-11-06 11:03 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:03 ` maiku.fabian at gmail dot com
  2014-11-06 11:05 ` maiku.fabian at gmail dot com
                   ` (22 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #19 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I extended Pravin’s ctype-compatibility.py script to produce more
human readable output and added many extra tests.

Joseph Myers> * Ensure the character type data in
Joseph Myers>   localedata/charmaps/i18n can be properly reproduced from
Joseph Myers>   Unicode 5.0 data using gen-unicode-ctype.c, adapting
Joseph Myers>   gen-unicode-ctype.c as needed to replicate any changes
Joseph Myers>   that may have been made not using that program.

When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0
to generate LC_CTYPE, the generated file lacks
many characters which apparently have been manually added
to glibc’s i18n file:

alpha: Missing 1238 characters of old ctype in new ctype 
blank: Missing 0 characters of old ctype in new ctype 
cntrl: Missing 0 characters of old ctype in new ctype 
combining: Missing 124 characters of old ctype in new ctype 
combining_level3: Missing 49 characters of old ctype in new ctype 
digit: Missing 0 characters of old ctype in new ctype 
graph: Missing 1571 characters of old ctype in new ctype 
lower: Missing 115 characters of old ctype in new ctype 
print: Missing 1571 characters of old ctype in new ctype 
punct: Missing 335 characters of old ctype in new ctype 
space: Missing 0 characters of old ctype in new ctype 
tolower: Missing 19 characters of old ctype in new ctype 
totitle: Missing 8 characters of old ctype in new ctype 
toupper: Missing 18 characters of old ctype in new ctype 
upper: Missing 100 characters of old ctype in new ctype 
xdigit: Missing 0 characters of old ctype in new ctype 

I.e. reproducing the localedata/charmaps/i18n character type data
from Unicode 5.0 data using gen-unicode-ctype.c does not work
well because glibc’s i18n file apparently has been edited
manually a lot already to include newer Unicode data.

Apparently quite a few mistake have been made by manually editing
the i18n file. For example, the report from ctype-compatibility.py
also produces for the old i18n file:

error: 0xa67f ꙿ punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.
error: 0xa67f ꙿ alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In
Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.

Another example:

error: 0x9f4 ৴ alpha True: 
            “09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;;;;”
            “09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;;;”
            “09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N;;;;;”
            “09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;;;;”
            “09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE
DENOMINATOR;No;0;L;;;;3/4;N;;;;;”
            “09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16;N;;;;;”
            “09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;”
            According to DerivedCoreProperties.txt (7.0.0) these are *not*
            “Alphabetic”.

So this has been mistakenly added to “alpha” in the old i18n file
of glibc (but gen-unicode-ctype.c correctly puts in into “punct”,
i.e. this seems to be another mistake by manual editing).

Some of the errors reported by ctype-compatibility.py

error: 0x250 ɐ lower False: Should be lower in Unicode 7.0.0 (was not lower in
            Unicode 5.0.0).

would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input.

There are many more problems like this in the old i18n file,
my tests found 133 errors total:

------------------------------------------------------------
Old file = /local/mfabian/src/glibc/localedata/locales/i18n
Number of errors in old file = 133
------------------------------------------------------------

I’ll attach the full report.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2014-10-14  8:08 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:03 ` maiku.fabian at gmail dot com
  2014-11-06 11:03 ` maiku.fabian at gmail dot com
                   ` (23 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #20 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7907
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7907&action=edit
unicode-5.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 5.0.0.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2014-11-06 11:03 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:05 ` maiku.fabian at gmail dot com
  2014-11-06 11:22 ` maiku.fabian at gmail dot com
                   ` (21 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:05 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #21 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now when using gen-unicode-ctype.c with UnicodeData.txt-7.0.0
to generate LC_CTYPE, the generated file lacks far fewer
characters compared to the old i18n file in glibc:

alpha: Missing 246 characters of old ctype in new ctype 
blank: Missing 1 characters of old ctype in new ctype 
cntrl: Missing 0 characters of old ctype in new ctype 
combining: Missing 3 characters of old ctype in new ctype 
combining_level3: Missing 5 characters of old ctype in new ctype 
digit: Missing 0 characters of old ctype in new ctype 
graph: Missing 0 characters of old ctype in new ctype 
lower: Missing 20 characters of old ctype in new ctype 
print: Missing 0 characters of old ctype in new ctype 
punct: Missing 16 characters of old ctype in new ctype 
space: Missing 1 characters of old ctype in new ctype 
tolower: Missing 0 characters of old ctype in new ctype 
totitle: Missing 0 characters of old ctype in new ctype 
toupper: Missing 0 characters of old ctype in new ctype 
upper: Missing 0 characters of old ctype in new ctype 
xdigit: Missing 0 characters of old ctype in new ctype

For example, gen-unicode-ctype.c does not put U+0901 into
the “alpha” class although it should be there
according to DerivedCoreProperties.txt:

error: 0x901 ँ alpha False: These have general category “Mn” i.e. these are
combining
            characters (both in UnicodeData.txt 5.0.0 and 7.0.0):
            “0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;”,
            ”0902;DEVANAGARI SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;”,
            “0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;”.
            According to DerivedCoreProperties.txt (7.0.0) these are
            “Alphabetic”.  

Apparently this has been edited manually (correctly) in the old i18n file
of glibc.

So this would be fixed in the automatic generation
when using DerivedCoreProperties.txt for “alpha”.

But some of the above seem to be errors in the old i18n file
of glib, for example:

error: 0x1090 ႐ punct True: MYANMAR SHAN DIGIT ZERO - MYANMAR SHAN DIGIT NINE.
            These are digits, but because ISO C 99 forbids to
            put them into digit they should go into alpha.

This is in “punct” in the old i18n file but gen-unicode-ctype.c
would put it into “alpha” which seems better for such digits
according to the comments in gen-unicode-ctype.c.

I went through all these “Missing” characters individually
and looked them up in UnicodeData.txt and DerivedCoreProperties.txt,
checked what how should be classified and added test cases
for them to the ctype-compatibility.py script.

I’ll attach the full report after using gen-unicode-ctype.c with
UnicodeData.txt-7.0.0 to generate LC_CTYPE.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2014-11-06 11:05 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:22 ` maiku.fabian at gmail dot com
  2014-11-06 11:56 ` maiku.fabian at gmail dot com
                   ` (20 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:22 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #22 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7908
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7908&action=edit
unicode-7.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 7.0.0.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2014-11-06 11:22 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:56 ` maiku.fabian at gmail dot com
  2014-11-06 11:59 ` maiku.fabian at gmail dot com
                   ` (19 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #23 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now Pravin’s approach in the patch attached to comment#15
is to comment out the generation  of “upper”, “lower”
and “alpha” from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.

But this is a bit problematic.

1) it does not put digits like

   alpha: Missing: ٠ 0x660 ARABIC-INDIC DIGIT ZERO

into “alpha”, which  gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment

          /* Consider all the non-ASCII digits as alphabetic.
         ISO C 99 forbids us to have them in category "digit",
         but we want iswalnum to return true on them.  */

which sounds reasonable.

2) it does not put characters like

    lower: Missing: Dž 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON

into lower. This is actually title case, not lower case,
but glibc does have only “lower” and “upper”, not “title”.
Although it has “toupper”, “tolower”, and “totitle”.

gen-unicode-ctype.c puts characters which change when “toupper”
is applied into “lower” and characters which change when “tolower”
is applied into “upper”. Therefore, gen-unicode-ctype.c
puts title case characters like Dž 0x1c5 into *both*, “upper” *and*
“lower”. Which seems reasonable if glibc has no “title”.

3) it does not put some characters like:

    upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
PROSGEGRAMMENI

into “upper”. Surprisingly,

“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
is *not* listed as “Uppercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

    1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
0345;;;;N;;;1F88;;1F88
    1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08
0345;;;;N;;;;1F80;

So this might be a bug in DerivedCoreProperties.txt.

Generating “upper” and “lower” the way gen-unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for “upper”
and “lower” is fine, it is not necessary to use DerivedCoreProperties.txt
for this.

4) *many* characters end up being in “alpha” *and* “punct”

For example:

    error: ⷶ 0x2df6 is alpha and punct

gen-unicode-ctype.c has the comment:

      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */

This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravin’s 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
“punct” before putting it into “alpha”.

The character  “ⷶ U+2df6 COMBINING CYRILLIC LETTER A” is “Alphabetic”
according to DerivedCoreProperties.txt:

    2DE0..2DFF    ; Alphabetic # Mn  [32] COMBINING CYRILLIC LETTER
BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS

So Pravin’s script does rightly put it in to “alpha”.

But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.

Verifications like gen-unicode-ctype.c does:

      /* toupper restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...  
      /* tolower restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...
      /* alpha restriction: "Characters classified as either upper or lower
     shall automatically belong to this class.  */
      ...
      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */
      ...
      /* space restriction: "No character specified for the keywords upper,
     lower, alpha, digit, graph or xdigit shall be specified."
     upper, lower, alpha already checked above.  */
      ...
      /* cntrl restriction: "No character specified for the keywords upper,
     lower, alpha, digit, punct, graph, print or xdigit shall be
     specified."  upper, lower, alpha already checked above.  */
      ...

can be done much easier when using a single program.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2014-11-06 11:56 ` maiku.fabian at gmail dot com
@ 2014-11-06 11:59 ` maiku.fabian at gmail dot com
  2014-11-06 12:09 ` maiku.fabian at gmail dot com
                   ` (18 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 11:59 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #24 from Mike FABIAN <maiku.fabian at gmail dot com> ---
So I think we should do either:

1) improve gen-unicode-ctype.c and make it use
   DerivedCoreProperties.txt for “alpha”

or:

2) rewrite gen-unicode-ctype.c to Python
   First a rewrite which produces *exactly* the same
   output as gen-unicode-ctype.c, then add code
   to use DerivedCoreProperties.txt for “alpha”

No matter whether extending the C-Program or writing a Python program,
it should be a single program to be able to verify the restrictions
mentioned easily.

It would be nice of course to make the program read in the old i18n
file and replace the characters classes and write out a new file which
keeps the rest of the original file so that no manual copy&paste of
the generated character classes is necessary.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2014-11-06 11:59 ` maiku.fabian at gmail dot com
@ 2014-11-06 12:09 ` maiku.fabian at gmail dot com
  2014-11-12 10:15 ` pravin.d.s at gmail dot com
                   ` (17 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-06 12:09 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #25 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #24)

> No matter whether extending the C-Program or writing a Python program,
> it should be a single program to be able to verify the restrictions
> mentioned easily.

And as a 2nd pass, after the single program to generate the character
class data, use ctype-compatibility.py as a "test-suite".

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2014-11-06 12:09 ` maiku.fabian at gmail dot com
@ 2014-11-12 10:15 ` pravin.d.s at gmail dot com
  2014-11-12 10:25 ` pravin.d.s at gmail dot com
                   ` (16 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-11-12 10:15 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |17588

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2014-11-12 10:15 ` pravin.d.s at gmail dot com
@ 2014-11-12 10:25 ` pravin.d.s at gmail dot com
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
                   ` (15 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: pravin.d.s at gmail dot com @ 2014-11-12 10:25 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #26 from Pravin S <pravin.d.s at gmail dot com> ---
(In reply to Mike FABIAN from comment #18)
> (In reply to Pravin S from comment #14)
> > Created attachment 7715 [details]
> > Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> > 
> > Done with all work with UTF-8 file. 
> > Added two script:
> > 1. utf8-gen.py to generate UTF-8 file
> > 2. utf8-compatibility.py : to check backward compatibility of newly
> > generated UTF-8 file
> > 3. Report of new UTF-8 file backward compatibility is available AT
> > https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
> > 
> > Submitting to glibc-alpha, please help to quick review and push to git.
> 
> I checked the scripts Pravin used and the resulting UTF-8 file.
> 
> I found only one minor problem:
> 
> In some cases, both UnicodeData.txt and EastAsianWidth.txt have information
> about width. For example, EastAsianWidth.txt has:
>     
>     302A..302D;W     # Mn     [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
> ENTERING TONE MARK
>     
> which gives us width 2 for these 4 characters (because of “W”) but
> UnicodeData.txt has:
>     
>     302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;;
>     302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;;
>     302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;;
>     302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;;
>     
> which would give width 0 (because of “NSM”).
> 
> I changed Pravin’s script a bit to prefer the information from
> EastAsianWidth.txt in case of conflicts.
> 
> Pravin has already merged my change into his git repository.

Thanks Mike for review. This bug is presently tracking two changes one with
i18n file and other with UTF-8 file. Both changes are significant so for better
tracking i created new bug
https://sourceware.org/bugzilla/show_bug.cgi?id=17588 for UTF-8 file. I will
submit respective patches there.

i18n ctype is still pending.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
@ 2014-11-14 15:10 ` maiku.fabian at gmail dot com
  2014-11-14 15:11 ` maiku.fabian at gmail dot com
                   ` (11 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-14 15:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #28 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7932
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7932&action=edit
gen-unicode-ctype.py

Improved version of gen-unicode-ctype.py which also parses
DerivedCoreProperties.txt and uses it (partly) for is_alpha(),
is_lower(), and is_upper().

"partly" because of 1):

            # Consider all the non-ASCII digits as alphabetic.
            # ISO C 99 forbids us to have them in category “digit”,
            # but we want iswalnum to return true on them.

These digits are not “Alphabetic” in DerivedCoreProperties.txt
but it seems to makes sense to treat them as alpha according
to this comment by Bruno.

and 2):
    title case characters are treated as both upper *and* lower.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
@ 2014-11-14 15:10 ` maiku.fabian at gmail dot com
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
                   ` (12 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-14 15:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #29 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7933
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7933&action=edit
report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
@ 2014-11-14 15:10 ` maiku.fabian at gmail dot com
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
                   ` (13 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-14 15:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #30 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #29)
> Created attachment 7933 [details]
> report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0

From this report:

alpha: Missing: ⒜ 0x249c PARENTHESIZED LATIN SMALL LETTER A
...

These are *not* “Alphabetic” in DerivedCoreProperties.txt, therefore
it is correct to remove them.

978 characters have been removed from “punct” which are now in “alpha”
because of DerivedCoreProperties.txt.

Number of errors in new file = 11:

These are only errors like:

error: 0xe2f ฯ alpha True: FIXME: Theppitak Karoonboonyanan
<thep@links.nectec.or.th> says
            <U0E2F>, <U0E46> should belong to punct. DerivedCoreProperties.txt
            says it is alpha.
...
error: 0xe4e ๎ alpha False: FIXME: gen-unicode-ctype.c: Theppitak
Karoonboonyanan
            <thep@links.nectec.or.th> says <U0E47>..<U0E4E> are
            is_alpha. DerivedCoreProperties does *not*.

I wrote mail to Theppitak Karoonboonyanan <thep@links.nectec.or.th>
and Bruno, The mail to thep@links.nectec.or.th bounced and I did not
get an answer from Bruno.

I think it is better to trust DerivedCoreProperties.txt here, so I don’t
think these are errors.

So I think my updated gen-unicode-ctype.py produces the character
classes correctly (as far as possible with the limitations caused by
glibc and ISO C 99).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2014-11-12 10:25 ` pravin.d.s at gmail dot com
@ 2014-11-14 15:10 ` maiku.fabian at gmail dot com
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
                   ` (14 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-14 15:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #27 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7931
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7931&action=edit
gen-unicode-ctype.py

Python rewrite of Bruno Haible’s gen-unicode-ctype.c.

This version produces *exactly* the same output as the C program:

    $ gcc -o gen-unicode-ctype gen-unicode-ctype.c
    $ ./gen-unicode-ctype UnicodeData.txt 7.0.0
    $ ./gen-unicode-ctype.py -u UnicodeData.txt -o unicode-new
--unicode_version 7.0.0
    $ diff -u unicode unicode-new
    $

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2014-11-14 15:10 ` maiku.fabian at gmail dot com
@ 2014-11-14 15:11 ` maiku.fabian at gmail dot com
  2014-11-24 11:28 ` maiku.fabian at gmail dot com
                   ` (10 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-14 15:11 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #31 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I think I should probably do another update to gen-unicode-ctype.py
to read in the original “i18n” file of glibc and write out a new
one replacing the character classes to avoid having to do cut and paste
manually.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2014-11-14 15:11 ` maiku.fabian at gmail dot com
@ 2014-11-24 11:28 ` maiku.fabian at gmail dot com
  2014-12-01 10:38 ` maiku.fabian at gmail dot com
                   ` (9 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-11-24 11:28 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #32 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #23)

> 3) it does not put some characters like:
> 
>     upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
> PROSGEGRAMMENI
> 
> into “upper”. Surprisingly,
> 
> “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
> is *not* listed as “Uppercase” in
> http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .
> 
> Although U+1F80 seems to be Uppercase according to
> http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> because it has a tolower mapping to U+1F80:
> 
>     1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
> 0345;;;;N;;;1F88;;1F88
>     1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND
> PROSGEGRAMMENI;Lt;0;L;1F08 0345;;;;N;;;;1F80;
> 
> So this might be a bug in DerivedCoreProperties.txt.

It is not a bug in DerivedCoreProperties.txt, I asked on the Unicode
mailing list:

http://www.unicode.org/mail-arch/unicode-ml/y2014-m11/0010.html

So these are actually title case as well.

That means, because of the restrictions of ISO C 99, these title
characters should be both in the “upper” and “lower” character class
in LC_CTYPE (my gen-unicode-ctype.py from comment#28 does this).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2014-11-24 11:28 ` maiku.fabian at gmail dot com
@ 2014-12-01 10:38 ` maiku.fabian at gmail dot com
  2014-12-03 10:01 ` maiku.fabian at gmail dot com
                   ` (8 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-12-01 10:38 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #7931|0                           |1
        is obsolete|                            |
   Attachment #7932|0                           |1
        is obsolete|                            |

--- Comment #33 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7979
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7979&action=edit
gen-unicode-ctype.py

New version of gen-unicode-ctype.py which can read the head and tail
of the original i18n file.  To avoid having to cut and paste the
generated LC_CTYPE character classes into the new glibc i18n file,
read the old file as well. Copy everything from the old file to the
newly generated file except the LC_CTYPE character class data, which
are generated from the UnicodeData.txt and DerivedCoreProperties.txt
given.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2014-12-01 10:38 ` maiku.fabian at gmail dot com
@ 2014-12-03 10:01 ` maiku.fabian at gmail dot com
  2014-12-03 12:47 ` maiku.fabian at gmail dot com
                   ` (7 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-12-03 10:01 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #34 from Mike FABIAN <maiku.fabian at gmail dot com> ---
When I generate a new glibc/localedata/locales/i18n file
using gen-unicode-ctype.py from comment#33 and build
glibc with that and then run the tests with “make check”, I get
one failure:

    FAIL: localedata/tst-ctype

Looking why it fails I find in ./localedata/tst-ctype.out:

    Locale-specific tests for `lower'
      islower('ª' = '\xaa') is true
      islower('º' = '\xba') is true
    Locale-specific tests for `lower'
    ...
    2 errors for `de_DE.ISO-8859-1' locale

The new “lower” character class generated by gen-unicode-ctype.py
contains U+00AA ª FEMININE ORDINAL INDICATOR and U+00BA º MASCULINE
ORDINAL INDICATOR.

The test tst-ctype run by “make check” wants them *not* to be lower case.

DerivedCoreProperties.txt lists both as lower case though:

    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

That’s why gen-unicode-ctype.py adds them to the “lower” character
class, it adds all characters found in DerivedCoreProperties.txt
marked as “Lowercase” to the character class “lower”.

I wonder what needs to be done here.

Is the test in glibc wrong?

If so, it could be fixed by a patch like this:

$ git show | iconv -f iso-8859-1 -t utf-8
commit 25c913674386011a44b6270579a894b2e8200d25
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Dec 3 10:05:42 2014 +0100

    Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in

    DerivedCoreProperties.txt from Unicode 7.0.0 lists
    the characters U+00AA (ª) and U+00BA (º) as lower case:

    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in
b/localedata/tst-ctype-de_DE.ISO-8859-1.in
index f71d76c..e124a52 100644
--- a/localedata/tst-ctype-de_DE.ISO-8859-1.in
+++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in
@@ -1,5 +1,5 @@
 lower    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
-        000000000000000000000100000000000000000000000000
+        000000000010000000000100001000000000000000000000
 lower   ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
         000000000000000111111111111111111111111011111111
 upper    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2014-12-03 12:47 ` maiku.fabian at gmail dot com
@ 2014-12-03 12:47 ` maiku.fabian at gmail dot com
  2014-12-04 10:35 ` maiku.fabian at gmail dot com
                   ` (5 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-12-03 12:47 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #35 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7988
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7988&action=edit
0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2014-12-03 10:01 ` maiku.fabian at gmail dot com
@ 2014-12-03 12:47 ` maiku.fabian at gmail dot com
  2014-12-03 12:47 ` maiku.fabian at gmail dot com
                   ` (6 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-12-03 12:47 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #36 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 7989
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7989&action=edit
0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2014-12-03 12:47 ` maiku.fabian at gmail dot com
@ 2014-12-04 10:35 ` maiku.fabian at gmail dot com
  2015-02-21  1:04 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: maiku.fabian at gmail dot com @ 2014-12-04 10:35 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #37 from Mike FABIAN <maiku.fabian at gmail dot com> ---
*** Bug 14010 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2014-12-04 10:35 ` maiku.fabian at gmail dot com
@ 2015-02-21  1:04 ` cvs-commit at gcc dot gnu.org
  2015-02-21  1:05 ` aoliva at sourceware dot org
                   ` (3 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2015-02-21  1:04 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #38 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 (commit)
      from  e4a399dc3dbb3228eb39af230ad11bc42a018c93 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a4839c94a4c93ffc0d5b95c69a08b02a57007f2

commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2
Author: Alexandre Oliva <aoliva@redhat.com>
Date:   Fri Feb 20 20:14:59 2015 -0200

    Unicode 7.0.0 update; added generator scripts.

    for  localedata/ChangeLog

        [BZ #17588]
        [BZ #13064]
        [BZ #14094]
        [BZ #17998]
        * unicode-gen/Makefile: New.
        * unicode-gen/unicode-license.txt: New, from Unicode.
        * unicode-gen/UnicodeData.txt: New, from Unicode.
        * unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
        * unicode-gen/EastAsianWidth.txt: New, from Unicode.
        * unicode-gen/gen_unicode_ctype.py: New generator, from Mike
        FABIAN <mfabian@redhat.com>.
        * unicode-gen/ctype_compatibility.py: New verifier, from
        Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
        * unicode-gen/ctype_compatibility_test_cases.py: New verifier
        module, from Mike FABIAN.
        * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
        and Mike FABIAN.
        * unicode-gen/utf8_compatibility.py: New verifier, from Pravin
        Satpute and Mike FABIAN.
        * charmaps/UTF-8: Update.
        * locales/i18n: Update.
        * gen-unicode-ctype.c: Remove.
        * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
        true for ordinal indicators.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                                               |   11 +-
 localedata/ChangeLog                               |   27 +
 localedata/charmaps/UTF-8                          |11946 ++++++---
 localedata/gen-unicode-ctype.c                     |  784 -
 localedata/locales/i18n                            | 2652 +-
 localedata/tst-ctype-de_DE.ISO-8859-1.in           |    2 +-
 localedata/unicode-gen/DerivedCoreProperties.txt   |10794 ++++++++
 localedata/unicode-gen/EastAsianWidth.txt          | 2121 ++
 localedata/unicode-gen/Makefile                    |   99 +
 localedata/unicode-gen/UnicodeData.txt             |27268 ++++++++++++++++++++
 localedata/unicode-gen/ctype_compatibility.py      |  546 +
 .../unicode-gen/ctype_compatibility_test_cases.py  |  951 +
 localedata/unicode-gen/gen_unicode_ctype.py        |  751 +
 localedata/unicode-gen/unicode-license.txt         |   50 +
 localedata/unicode-gen/utf8_compatibility.py       |  399 +
 localedata/unicode-gen/utf8_gen.py                 |  286 +
 16 files changed, 53305 insertions(+), 5382 deletions(-)
 delete mode 100644 localedata/gen-unicode-ctype.c
 create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt
 create mode 100644 localedata/unicode-gen/EastAsianWidth.txt
 create mode 100644 localedata/unicode-gen/Makefile
 create mode 100644 localedata/unicode-gen/UnicodeData.txt
 create mode 100755 localedata/unicode-gen/ctype_compatibility.py
 create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py
 create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py
 create mode 100644 localedata/unicode-gen/unicode-license.txt
 create mode 100755 localedata/unicode-gen/utf8_compatibility.py
 create mode 100755 localedata/unicode-gen/utf8_gen.py

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (43 preceding siblings ...)
  2015-02-21  1:04 ` cvs-commit at gcc dot gnu.org
@ 2015-02-21  1:05 ` aoliva at sourceware dot org
  2015-02-21 20:50 ` aoliva at sourceware dot org
                   ` (2 subsequent siblings)
  47 siblings, 0 replies; 49+ messages in thread
From: aoliva at sourceware dot org @ 2015-02-21  1:05 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094
Bug 14094 depends on bug 17588, which changed state.

Bug 17588 Summary: Update UTF-8 charmap and width to Unicode 7.0.0
https://sourceware.org/bugzilla/show_bug.cgi?id=17588

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (44 preceding siblings ...)
  2015-02-21  1:05 ` aoliva at sourceware dot org
@ 2015-02-21 20:50 ` aoliva at sourceware dot org
  2016-03-22 11:19 ` egmont at gmail dot com
  2016-03-22 18:33 ` vapier at gentoo dot org
  47 siblings, 0 replies; 49+ messages in thread
From: aoliva at sourceware dot org @ 2015-02-21 20:50 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Alexandre Oliva <aoliva at sourceware dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                 CC|                            |aoliva at sourceware dot org
         Resolution|---                         |FIXED

--- Comment #39 from Alexandre Oliva <aoliva at sourceware dot org> ---
Fixed

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (45 preceding siblings ...)
  2015-02-21 20:50 ` aoliva at sourceware dot org
@ 2016-03-22 11:19 ` egmont at gmail dot com
  2016-03-22 18:33 ` vapier at gentoo dot org
  47 siblings, 0 replies; 49+ messages in thread
From: egmont at gmail dot com @ 2016-03-22 11:19 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Egmont Koblinger <egmont at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |egmont at gmail dot com

--- Comment #40 from Egmont Koblinger <egmont at gmail dot com> ---
Please see bug 19852 for a followup issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Bug localedata/14094] Update locale data to Unicode 7.0.0
  2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
                   ` (46 preceding siblings ...)
  2016-03-22 11:19 ` egmont at gmail dot com
@ 2016-03-22 18:33 ` vapier at gentoo dot org
  47 siblings, 0 replies; 49+ messages in thread
From: vapier at gentoo dot org @ 2016-03-22 18:33 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://sourceware.org/bugz
                   |                            |illa/show_bug.cgi?id=19852

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-03-22 18:33 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-10 21:06 [Bug localedata/14094] New: Update locale data to Unicode 6.1 jsm28 at gcc dot gnu.org
2012-05-11  7:28 ` [Bug localedata/14094] " bugdal at aerifal dot cx
2013-11-26 17:07 ` myllynen at redhat dot com
2014-02-18 10:12 ` pravin.d.s at gmail dot com
2014-05-21 12:52 ` allan at archlinux dot org
2014-05-21 12:52 ` johannes at kyriasis dot com
2014-05-23  7:56 ` [Bug localedata/14094] Update locale data to Unicode 6.3 pravin.d.s at gmail dot com
2014-05-23 12:11 ` joseph at codesourcery dot com
2014-05-23 13:55 ` pravin.d.s at gmail dot com
2014-06-10  9:43 ` pravin.d.s at gmail dot com
2014-06-10 14:40 ` carlos at redhat dot com
2014-06-11  4:25 ` pravin.d.s at gmail dot com
2014-06-19 11:28 ` pravin.d.s at gmail dot com
2014-06-21 19:15 ` [Bug localedata/14094] Update locale data to Unicode 7.0.0 pravin.d.s at gmail dot com
2014-06-25 12:08 ` fweimer at redhat dot com
2014-06-25 12:43 ` pravin.d.s at gmail dot com
2014-06-25 13:54 ` carlos at redhat dot com
2014-07-04 10:51 ` pravin.d.s at gmail dot com
2014-07-17 12:44 ` pravin.d.s at gmail dot com
2014-07-22 13:03 ` pravin.d.s at gmail dot com
2014-09-05  1:08 ` carlos at redhat dot com
2014-09-29  7:29 ` maiku.fabian at gmail dot com
2014-09-29  7:30 ` pravin.d.s at gmail dot com
2014-10-14  8:08 ` maiku.fabian at gmail dot com
2014-11-06 11:03 ` maiku.fabian at gmail dot com
2014-11-06 11:03 ` maiku.fabian at gmail dot com
2014-11-06 11:05 ` maiku.fabian at gmail dot com
2014-11-06 11:22 ` maiku.fabian at gmail dot com
2014-11-06 11:56 ` maiku.fabian at gmail dot com
2014-11-06 11:59 ` maiku.fabian at gmail dot com
2014-11-06 12:09 ` maiku.fabian at gmail dot com
2014-11-12 10:15 ` pravin.d.s at gmail dot com
2014-11-12 10:25 ` pravin.d.s at gmail dot com
2014-11-14 15:10 ` maiku.fabian at gmail dot com
2014-11-14 15:10 ` maiku.fabian at gmail dot com
2014-11-14 15:10 ` maiku.fabian at gmail dot com
2014-11-14 15:10 ` maiku.fabian at gmail dot com
2014-11-14 15:11 ` maiku.fabian at gmail dot com
2014-11-24 11:28 ` maiku.fabian at gmail dot com
2014-12-01 10:38 ` maiku.fabian at gmail dot com
2014-12-03 10:01 ` maiku.fabian at gmail dot com
2014-12-03 12:47 ` maiku.fabian at gmail dot com
2014-12-03 12:47 ` maiku.fabian at gmail dot com
2014-12-04 10:35 ` maiku.fabian at gmail dot com
2015-02-21  1:04 ` cvs-commit at gcc dot gnu.org
2015-02-21  1:05 ` aoliva at sourceware dot org
2015-02-21 20:50 ` aoliva at sourceware dot org
2016-03-22 11:19 ` egmont at gmail dot com
2016-03-22 18:33 ` vapier at gentoo dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).