From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26835 invoked by alias); 6 Nov 2014 11:45:38 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Received: (qmail 26752 invoked by uid 48); 6 Nov 2014 11:45:32 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:45:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00024.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094 --- Comment #23 from Mike FABIAN --- Now Pravin=E2=80=99s approach in the patch attached to comment#15 is to comment out the generation of =E2=80=9Cupper=E2=80=9D, =E2=80=9Clowe= r=E2=80=9D and =E2=80=9Calpha=E2=80=9D from gen-unicode-ctype.c and add another script gen-unicode-ctype-dcp.py which adds these. But this is a bit problematic. 1) it does not put digits like alpha: Missing: =D9=A0 0x660 ARABIC-INDIC DIGIT ZERO into =E2=80=9Calpha=E2=80=9D, which gen-unicode-ctype.c would have done. gen-unicode-ctype.c contains the comment /* Consider all the non-ASCII digits as alphabetic. ISO C 99 forbids us to have them in category "digit", but we want iswalnum to return true on them. */ which sounds reasonable. 2) it does not put characters like lower: Missing: =C7=85 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z= WITH CARON into lower. This is actually title case, not lower case, but glibc does have only =E2=80=9Clower=E2=80=9D and =E2=80=9Cupper=E2=80= =9D, not =E2=80=9Ctitle=E2=80=9D. Although it has =E2=80=9Ctoupper=E2=80=9D, =E2=80=9Ctolower=E2=80=9D, and = =E2=80=9Ctotitle=E2=80=9D. gen-unicode-ctype.c puts characters which change when =E2=80=9Ctoupper=E2= =80=9D is applied into =E2=80=9Clower=E2=80=9D and characters which change when = =E2=80=9Ctolower=E2=80=9D is applied into =E2=80=9Cupper=E2=80=9D. Therefore, gen-unicode-ctype.c puts title case characters like =C7=85 0x1c5 into *both*, =E2=80=9Cupper=E2= =80=9D *and* =E2=80=9Clower=E2=80=9D. Which seems reasonable if glibc has no =E2=80=9Cti= tle=E2=80=9D. 3) it does not put some characters like: upper: Missing: =E1=BE=88 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI = AND PROSGEGRAMMENI into =E2=80=9Cupper=E2=80=9D. Surprisingly, =E2=80=9CU+1F88 =E1=BE=88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEG= RAMMENI=E2=80=9D is *not* listed as =E2=80=9CUppercase=E2=80=9D in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345;;;;N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F= 08 0345;;;;N;;;;1F80; So this might be a bug in DerivedCoreProperties.txt. Generating =E2=80=9Cupper=E2=80=9D and =E2=80=9Clower=E2=80=9D the way gen-= unicode-ctype.c does, i.e. just using UnicodeData.txt and check whether characters change when mapping them to upper or to lower does not produce this error. I think the approach gen-unicode-ctype.c uses for =E2=80=9Cupper=E2= =80=9D and =E2=80=9Clower=E2=80=9D is fine, it is not necessary to use DerivedCore= Properties.txt for this. 4) *many* characters end up being in =E2=80=9Calpha=E2=80=9D *and* =E2=80= =9Cpunct=E2=80=9D For example: error: =E2=B7=B6 0x2df6 is alpha and punct gen-unicode-ctype.c has the comment: /* alpha restriction: "No character specified for the keywords cntrl, digit, punct or space shall be specified." */ This restriction is violated because the the second script gen-unicode-ctype-dcp.py used in Pravin=E2=80=99s 2-pass approach does not check whether gen-unicode-ctype.c has already put a character into =E2=80=9Cpunct=E2=80=9D before putting it into =E2=80=9Calpha=E2=80=9D. The character =E2=80=9C=E2=B7=B6 U+2df6 COMBINING CYRILLIC LETTER A=E2=80= =9D is =E2=80=9CAlphabetic=E2=80=9D according to DerivedCoreProperties.txt: 2DE0..2DFF ; Alphabetic # Mn [32] COMBINING CYRILLIC LETTER BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS So Pravin=E2=80=99s script does rightly put it in to =E2=80=9Calpha=E2=80= =9D. But looking at this, it seems not a good idea to have two independent programs generating the file in 2 independent passes. Verifications like gen-unicode-ctype.c does: /* toupper restriction: "Only characters specified for the keywords lower and upper shall be specified. */ ...=20=20 /* tolower restriction: "Only characters specified for the keywords lower and upper shall be specified. */ ... /* alpha restriction: "Characters classified as either upper or lower shall automatically belong to this class. */ ... /* alpha restriction: "No character specified for the keywords cntrl, digit, punct or space shall be specified." */ ... /* space restriction: "No character specified for the keywords upper, lower, alpha, digit, graph or xdigit shall be specified." upper, lower, alpha already checked above. */ ... /* cntrl restriction: "No character specified for the keywords upper, lower, alpha, digit, punct, graph, print or xdigit shall be specified." upper, lower, alpha already checked above. */ ... can be done much easier when using a single program. --=20 You are receiving this mail because: You are on the CC list for the bug. >>From glibc-bugs-return-26533-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 11:56:09 2014 Return-Path: Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 3091 invoked by alias); 6 Nov 2014 11:56:09 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 2807 invoked by uid 48); 6 Nov 2014 11:56:03 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 11:56:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00025.txt.bz2 Content-length: 967 https://sourceware.org/bugzilla/show_bug.cgi?id=3D14094 --- Comment #24 from Mike FABIAN --- So I think we should do either: 1) improve gen-unicode-ctype.c and make it use DerivedCoreProperties.txt for =E2=80=9Calpha=E2=80=9D or: 2) rewrite gen-unicode-ctype.c to Python First a rewrite which produces *exactly* the same output as gen-unicode-ctype.c, then add code to use DerivedCoreProperties.txt for =E2=80=9Calpha=E2=80=9D No matter whether extending the C-Program or writing a Python program, it should be a single program to be able to verify the restrictions mentioned easily. It would be nice of course to make the program read in the old i18n file and replace the characters classes and write out a new file which keeps the rest of the original file so that no manual copy&paste of the generated character classes is necessary. --=20 You are receiving this mail because: You are on the CC list for the bug. >>From glibc-bugs-return-26534-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Nov 06 12:00:30 2014 Return-Path: Delivered-To: listarch-glibc-bugs@sources.redhat.com Received: (qmail 7412 invoked by alias); 6 Nov 2014 12:00:29 -0000 Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org Delivered-To: mailing list glibc-bugs@sourceware.org Received: (qmail 7338 invoked by uid 48); 6 Nov 2014 12:00:26 -0000 From: "maiku.fabian at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0 Date: Thu, 06 Nov 2014 12:00:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.21 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-11/txt/msg00026.txt.bz2 Content-length: 539 https://sourceware.org/bugzilla/show_bug.cgi?id=14094 --- Comment #25 from Mike FABIAN --- (In reply to Mike FABIAN from comment #24) > No matter whether extending the C-Program or writing a Python program, > it should be a single program to be able to verify the restrictions > mentioned easily. And as a 2nd pass, after the single program to generate the character class data, use ctype-compatibility.py as a "test-suite". -- You are receiving this mail because: You are on the CC list for the bug.