public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings
@ 2016-11-25  4:49 arthur200126 at gmail dot com
  2016-11-29  8:15 ` [Bug localedata/20865] charmaps: " fweimer at redhat dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: arthur200126 at gmail dot com @ 2016-11-25  4:49 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

            Bug ID: 20865
           Summary: iconv: cp950 does not contain EUDC/PUA mappings
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: arthur200126 at gmail dot com
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Microsoft's cp950 mapping contains sequential mappings from Big5's Extended
User-defined Characters (EUDC) to Unicode PUA. Such mappings are used by a
number of Big5 extensions, including HKSCS which uses these PUA code points
when a character is not yet available in the target UCS version.

The following sessions come from GNU bash running in a UTF-8 console. $''
denotes bash's ANSI C-style quoting, where \xhh generates a raw hex byte and
\uhhhh generates the representation of U+hhhh under current locale.

Currently glibc's cp950 implementation does not contain these mappings:

# iconv (Ubuntu GLIBC 2.23-0ubuntu4) 2.23
ubuntu$ iconv -f cp950 -t utf-32le <<< $'\x81\x40' | hexdump -C
iconv: illegal input sequence at position 0
ubuntu$ iconv -t cp950 -f utf-8 <<< $'\ueeb8' | hexdump -C
iconv: illegal input sequence at position 0

The desired behavior for decoding can be seen in libiconv:

# iconv (GNU libiconv 1.14)
cygwin$ iconv -f cp950 -t utf-32le <<< $'\x81\x40' | hexdump -C
00000000  b8 ee 00 00 0a 00 00 00                           |........|
00000008

Note that libiconv is not interested in doing the reverse:

cygwin$ iconv -t cp950 -f utf-8 <<< $'\ueeb8' | hexdump -C
iconv: illegal input sequence at position 0

libiconv's mapping:
http://git.savannah.gnu.org/cgit/libiconv.git/tree/lib/cp950.h#n72

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug localedata/20865] charmaps: cp950 does not contain EUDC/PUA mappings
  2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
@ 2016-11-29  8:15 ` fweimer at redhat dot com
  2016-11-30  1:03 ` arthur200126 at gmail dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: fweimer at redhat dot com @ 2016-11-29  8:15 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|iconv: cp950 does not       |charmaps: cp950 does not
                   |contain EUDC/PUA mappings   |contain EUDC/PUA mappings

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

--- Comment #1 from Mike Frysinger <vapier at gentoo dot org> ---
we currently alias CP950 to BIG5.  i don't think we want to add all these
mappings to BIG5 since it explicitly carves out that spaces as "reserved":
  https://en.wikipedia.org/wiki/Big5#A_more_detailed_look_at_the_organization

which means we need to copy BIG5 to CP950 and add the MS extensions.  then drop
the alias of CP950->BIG5.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug localedata/20865] charmaps: cp950 does not contain EUDC/PUA mappings
  2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
  2016-11-29  8:15 ` [Bug localedata/20865] charmaps: " fweimer at redhat dot com
@ 2016-11-30  1:03 ` arthur200126 at gmail dot com
  2016-11-30 21:25 ` arthur200126 at gmail dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: arthur200126 at gmail dot com @ 2016-11-30  1:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

--- Comment #2 from Mingye Wang <arthur200126 at gmail dot com> ---
> I don't think we want to add all these mappings to BIG5 since it
> explicitly carves out that spaces as "reserved":
If you consider how GB 18030's user-defined areas are mapped, "reserved for
user-defined characters" may be acceptable for PUA. The "reserved, not for
user-defined" part is not mapped to PUA in cp950 and big5-2003.


> which means we need to copy BIG5 to CP950 and add the MS extensions
I went back to libiconv's charts and read a few lines up too see all of MS's
modifications & extensions; now it does sound messy enough for a split. (0xA1F2
and 0xA1F3 are entertainingly weird.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug localedata/20865] charmaps: cp950 does not contain EUDC/PUA mappings
  2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
  2016-11-29  8:15 ` [Bug localedata/20865] charmaps: " fweimer at redhat dot com
  2016-11-30  1:03 ` arthur200126 at gmail dot com
@ 2016-11-30 21:25 ` arthur200126 at gmail dot com
  2017-10-21  8:27 ` maiku.fabian at gmail dot com
  2017-10-28 18:22 ` [Bug localedata/20865] charmaps: cp950 needs Windows " arthur200126 at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: arthur200126 at gmail dot com @ 2016-11-30 21:25 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

--- Comment #3 from Mingye Wang <arthur200126 at gmail dot com> ---
> now it does sound messy enough for a split.

Hmm, glibc is already using the "MICSFT/WINDOWS" cp950 definition for BIG5's
charmap. So it's not that bad...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug localedata/20865] charmaps: cp950 does not contain EUDC/PUA mappings
  2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
                   ` (2 preceding siblings ...)
  2016-11-30 21:25 ` arthur200126 at gmail dot com
@ 2017-10-21  8:27 ` maiku.fabian at gmail dot com
  2017-10-28 18:22 ` [Bug localedata/20865] charmaps: cp950 needs Windows " arthur200126 at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-10-21  8:27 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug localedata/20865] charmaps: cp950 needs Windows EUDC/PUA mappings
  2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
                   ` (3 preceding siblings ...)
  2017-10-21  8:27 ` maiku.fabian at gmail dot com
@ 2017-10-28 18:22 ` arthur200126 at gmail dot com
  4 siblings, 0 replies; 6+ messages in thread
From: arthur200126 at gmail dot com @ 2017-10-28 18:22 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20865

Mingye Wang <arthur200126 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|charmaps: cp950 does not    |charmaps: cp950 needs
                   |contain EUDC/PUA mappings   |Windows EUDC/PUA mappings

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-10-28 18:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-25  4:49 [Bug localedata/20865] New: iconv: cp950 does not contain EUDC/PUA mappings arthur200126 at gmail dot com
2016-11-29  8:15 ` [Bug localedata/20865] charmaps: " fweimer at redhat dot com
2016-11-30  1:03 ` arthur200126 at gmail dot com
2016-11-30 21:25 ` arthur200126 at gmail dot com
2017-10-21  8:27 ` maiku.fabian at gmail dot com
2017-10-28 18:22 ` [Bug localedata/20865] charmaps: cp950 needs Windows " arthur200126 at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).