From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 5429 invoked by alias); 8 Feb 2016 15:39:36 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 87124 invoked by uid 48); 8 Feb 2016 14:47:02 -0000 From: "carlos at redhat dot com" To: libc-locales@sourceware.org Subject: [Bug localedata/19575] Status of GB18030 tables Date: Mon, 08 Feb 2016 15:39:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.24 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: carlos at redhat dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2016-q1/txt/msg00080.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D19575 Carlos O'Donell changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |carlos at redhat dot com --- Comment #6 from Carlos O'Donell --- (In reply to Andreas Schwab from comment #5) > ICU says that these characters are not roundtrip mappings. >=20 > \xA6\xD9 |1 > \xA6\xDA |1 > \xA6\xDB |1 > \xA6\xDC |1 > \xA6\xDD |1 > \xA6\xDE |1 > \xA6\xDF |1 > \xA6\xEC |1 > \xA6\xED |1 > \xA6\xF3 |1 >=20 > Most likely the characters did not exist yet in Unicode in 2012. They are not roundtrip, I assume, because those are PUA code points. The GB 18030-2005 standard still-uses some PUA code points for some idiogra= ms. The glibc non-PUA code point usage (which differ from the published standar= d) is correct for GB 18030-2005 compliance. The PUA code points, in Unicode 4.= 1 or newer, can be used as non-PUA equivalents. It is highly recommended that the Unicode 4.1 code-points be used for anyone mapping GB 18030-2005 to UTF-8 a= nd is best-practice as documented in "CJKV Processing" by Dr. Ken Lunde. The ICU implementation complies with the old GB 18030-2000 standard, and do= es not use the newer Unicode 4.1 equivalent code points. My opinion is that th= is is simply a bug in ICU and Emacs, both should get update. I feel like we need to install some explanatory patch like this in glibc: diff --git a/localedata/charmaps/GB18030 b/localedata/charmaps/GB18030 index 863a123..c48276e 100644 --- a/localedata/charmaps/GB18030 +++ b/localedata/charmaps/GB18030 @@ -57234,6 +57234,12 @@ CHARMAP /xa6/xbe /xa6/xbf /xa6/xc0 +% The newest GB 18030-2005 standard still uses some private use area +% code points. Any implementation which has Unicode 4.1 or newer +% support should not use these PUA code points, and instead should +% map these entries to their equivalent non-PUA code points which +% in this case map from to . This recommendation is +% based on "CJKV Processing" by Dr. Ken Lunde. % /xa6/xd9 % /xa6/xda % /xa6/xdb @@ -62997,6 +63003,10 @@ CHARMAP /x84/x31/x82/x33 VARIATION SELECTOR-14 /x84/x31/x82/x34 VARIATION SELECTOR-15 /x84/x31/x82/x35 VARIATION SELECTOR-16 +% The code points from to are a adjustment +% of the GB 18030-2005 standard to account for the fact that +% with Unicode 4.1 support we can now correctly represent those +% entries, which in the standard, used PUA code points. /xa6/xd9 PRESENTATION FORM FOR VERTICAL COMMA /xa6/xdb PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC CO= MMA /xa6/xda PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FU= LL STOP I've reached out to Dr. Ken Lunde to clarify if this is correct. --=20 You are receiving this mail because: You are on the CC list for the bug.