public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0
Date: Wed, 03 Dec 2014 07:17:00 -0000	[thread overview]
Message-ID: <bug-17588-131-DkqnHQQZQ1@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-17588-131@http.sourceware.org/bugzilla/>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 9632 bytes --]

https://sourceware.org/bugzilla/show_bug.cgi?id=17588

--- Comment #9 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I built glibc with the patch from comment#8.

I produces some FAILs in “make check”:

    FAIL: localedata/cs_CZ.UTF-8/LC_CTYPE
    ... similar FAILs ...

Shortly after starting “make check” one sees:

    ./charmaps/UTF-8:42734: unknown character `U00009FCD'
    ... similar messages ...

All the above problems are cause by ranges of reserved code points
which are listed in EastAsianWidth.txt like this:

    9FCD..9FFF;W     # Cn    [51] <reserved-9FCD>..<reserved-9FFF>

and these code points are not in UnicodeData.txt.

Therefore, they are not generated into the CHARMAP section
of glibc’s UTF-8 file and it causes the above problems if they
are generated into the WIDTH section of glibc’s  UTF-8 file.

This can be fixed by not generating reserved code points into
the WIDTH section, i.e. by ignoring the  reserved  code points
mentioned in EastAsianWidth.txt. Patch for utf8-gen.py:

diff --git a/utf8-gen.py b/utf8-gen.py
index 57875b6..20b68bb 100755
--- a/utf8-gen.py
+++ b/utf8-gen.py
@@ -218,6 +218,8 @@ if __name__ == "__main__":
         write_comments(outfile, 1)
         elines = []
         for line in easta_file.readlines():
+                if re.match(r'.*<reserved-.+>\.\.<reserved-.+>.*', line):
+                        continue
                 if re.match(r'^[^;]*;[WF]', line):
                         elines.append(line.strip())
         process_width(outfile, flines, elines)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-26784-listarch-glibc-bugs=sources.redhat.com@sourceware.org Wed Dec 03 09:59:20 2014
Return-Path: <glibc-bugs-return-26784-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 30733 invoked by alias); 3 Dec 2014 09:59:20 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 30661 invoked by uid 48); 3 Dec 2014 09:59:14 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Date: Wed, 03 Dec 2014 09:59:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields:
Message-ID: <bug-14094-131-MOycuzHapv@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-14094-131@http.sourceware.org/bugzilla/>
References: <bug-14094-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-12/txt/msg00027.txt.bz2
Content-length: 3386

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #34 from Mike FABIAN <maiku.fabian at gmail dot com> ---
When I generate a new glibc/localedata/locales/i18n file
using gen-unicode-ctype.py from comment#33 and build
glibc with that and then run the tests with “make check”, I get
one failure:

    FAIL: localedata/tst-ctype

Looking why it fails I find in ./localedata/tst-ctype.out:

    Locale-specific tests for `lower'
      islower('ª' = '\xaa') is true
      islower('º' = '\xba') is true
    Locale-specific tests for `lower'
    ...
    2 errors for `de_DE.ISO-8859-1' locale

The new “lower” character class generated by gen-unicode-ctype.py
contains U+00AA ª FEMININE ORDINAL INDICATOR and U+00BA º MASCULINE
ORDINAL INDICATOR.

The test tst-ctype run by “make check” wants them *not* to be lower case.

DerivedCoreProperties.txt lists both as lower case though:

    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

That’s why gen-unicode-ctype.py adds them to the “lower” character
class, it adds all characters found in DerivedCoreProperties.txt
marked as “Lowercase” to the character class “lower”.

I wonder what needs to be done here.

Is the test in glibc wrong?

If so, it could be fixed by a patch like this:

$ git show | iconv -f iso-8859-1 -t utf-8
commit 25c913674386011a44b6270579a894b2e8200d25
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Dec 3 10:05:42 2014 +0100

    Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in

    DerivedCoreProperties.txt from Unicode 7.0.0 lists
    the characters U+00AA (ª) and U+00BA (º) as lower case:

    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in
b/localedata/tst-ctype-de_DE.ISO-8859-1.in
index f71d76c..e124a52 100644
--- a/localedata/tst-ctype-de_DE.ISO-8859-1.in
+++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in
@@ -1,5 +1,5 @@
 lower    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
-        000000000000000000000100000000000000000000000000
+        000000000010000000000100001000000000000000000000
 lower   ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
         000000000000000111111111111111111111111011111111
 upper    ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-26785-listarch-glibc-bugs=sources.redhat.com@sourceware.org Wed Dec 03 11:49:27 2014
Return-Path: <glibc-bugs-return-26785-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 6139 invoked by alias); 3 Dec 2014 11:49:27 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 6105 invoked by uid 48); 3 Dec 2014 11:49:23 -0000
From: "pravin.d.s at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0
Date: Wed, 03 Dec 2014 11:49:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: pravin.d.s at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: attachments.isobsolete attachments.created
Message-ID: <bug-17588-131-DUHIElb1jH@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-17588-131@http.sourceware.org/bugzilla/>
References: <bug-17588-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-12/txt/msg00028.txt.bz2
Content-length: 1283

https://sourceware.org/bugzilla/show_bug.cgi?id\x17588

Pravin S <pravin.d.s at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #7980|0                           |1
        is obsolete|                            |

--- Comment #10 from Pravin S <pravin.d.s at gmail dot com> ---
Created attachment 7987
  --> https://sourceware.org/bugzilla/attachment.cgi?idy87&actioníit
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

2014-12-01  Pravin Satpute  <psatpute@redhat.com>

        [BZ #17588 #13064]
        * charmaps/UTF-8: Updated UTF-8 CHARMAP and WIDTH to Unicode 7.0.0.

        * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from
        latest UnicodeData.txt.

        * localedata/utf8-compatibility.py: New script for testing backward
        compatibility of newly generated UTF-8 file.
       Reviewed and improved by Mike FABIAN <mfabian@redhat.com>

------------------------------------------------------------------------------

Yes, i also able to reproduce same issues while building glibc with patch. This
patch fixes those issues.

--
You are receiving this mail because:
You are on the CC list for the bug.


  parent reply	other threads:[~2014-12-03  7:17 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-12 10:11 [Bug localedata/17588] New: " pravin.d.s at gmail dot com
2014-11-12 11:19 ` [Bug localedata/17588] " pravin.d.s at gmail dot com
2014-11-12 11:22 ` pravin.d.s at gmail dot com
2014-11-21  6:27 ` pravin.d.s at gmail dot com
2014-11-21  7:35 ` maiku.fabian at gmail dot com
2014-11-21 16:49 ` maiku.fabian at gmail dot com
2014-11-24 16:34 ` pravin.d.s at gmail dot com
2014-12-01 11:49 ` maiku.fabian at gmail dot com
2014-12-01 11:54 ` pravin.d.s at gmail dot com
2014-12-03  7:17 ` maiku.fabian at gmail dot com [this message]
2014-12-12 11:31 ` pravin.d.s at gmail dot com
2015-02-20 22:36 ` cvs-commit at gcc dot gnu.org
2015-02-21  0:06 ` aoliva at sourceware dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-17588-131-DkqnHQQZQ1@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).