public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/13063] Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
@ 2011-08-06 19:28 ` an.euroford at gmail dot com
  2011-08-07 20:46 ` [Bug localedata/13063] 'sort -u' will erase some Chinese characters an.euroford at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: an.euroford at gmail dot com @ 2011-08-06 19:28 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #1 from An Yang <an.euroford at gmail dot com> 2011-08-06 17:24:33 UTC ---
Created attachment 5880
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5880
example characters in CJK extension A.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D
@ 2011-08-06 19:28 an.euroford at gmail dot com
  2011-08-06 19:28 ` [Bug localedata/13063] " an.euroford at gmail dot com
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: an.euroford at gmail dot com @ 2011-08-06 19:28 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13063

           Summary: Can not 'sort -u' all Chinese characters in CJK
                    UNIFIED IDEOGRAPH EXTENSION A/B/C/D
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: critical
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales@sources.redhat.com
        ReportedBy: an.euroford@gmail.com


Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH
and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".


Regards,
An Yang

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
  2011-08-06 19:28 ` [Bug localedata/13063] " an.euroford at gmail dot com
@ 2011-08-07 20:46 ` an.euroford at gmail dot com
  2011-08-07 20:47 ` an.euroford at gmail dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: an.euroford at gmail dot com @ 2011-08-07 20:46 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #2 from An Yang <an.euroford at gmail dot com> 2011-08-07 17:42:44 UTC ---
I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but
they are included in CJK unified Ideographs/ExtA/B/C/D.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
  2011-08-06 19:28 ` [Bug localedata/13063] " an.euroford at gmail dot com
  2011-08-07 20:46 ` [Bug localedata/13063] 'sort -u' will erase some Chinese characters an.euroford at gmail dot com
@ 2011-08-07 20:47 ` an.euroford at gmail dot com
  2011-08-08 16:56 ` an.euroford at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: an.euroford at gmail dot com @ 2011-08-07 20:47 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13063

An Yang <an.euroford at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Can not 'sort -u' all       |'sort -u' will erase some
                   |Chinese characters in CJK   |Chinese characters
                   |UNIFIED IDEOGRAPH EXTENSION |
                   |A/B/C/D                     |

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (2 preceding siblings ...)
  2011-08-07 20:47 ` an.euroford at gmail dot com
@ 2011-08-08 16:56 ` an.euroford at gmail dot com
  2014-05-07  8:20 ` bluebat at member dot fsf.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: an.euroford at gmail dot com @ 2011-08-08 16:56 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #3 from An Yang <an.euroford at gmail dot com> 2011-08-08 16:54:28 UTC ---
There are 25496 Chinese characters in iso14651_t1_pinyin, most of them
distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension
A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the
iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (3 preceding siblings ...)
  2011-08-08 16:56 ` an.euroford at gmail dot com
@ 2014-05-07  8:20 ` bluebat at member dot fsf.org
  2014-06-13 15:12 ` fweimer at redhat dot com
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bluebat at member dot fsf.org @ 2014-05-07  8:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

趙惟倫 <bluebat at member dot fsf.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bluebat at member dot fsf.org

--- Comment #4 from 趙惟倫 <bluebat at member dot fsf.org> ---
as BZ#15616 report confirmed.
BZ#16905 is another approach but untested.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (4 preceding siblings ...)
  2014-05-07  8:20 ` bluebat at member dot fsf.org
@ 2014-06-13 15:12 ` fweimer at redhat dot com
  2014-11-19  4:14 ` bluebat at member dot fsf.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: fweimer at redhat dot com @ 2014-06-13 15:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (5 preceding siblings ...)
  2014-06-13 15:12 ` fweimer at redhat dot com
@ 2014-11-19  4:14 ` bluebat at member dot fsf.org
  2017-01-22 23:58 ` arthur200126 at gmail dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bluebat at member dot fsf.org @ 2014-11-19  4:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #5 from Wei-Lun Chao <bluebat at member dot fsf.org> ---
Tested with patch from bug 17563 and get pass.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (6 preceding siblings ...)
  2014-11-19  4:14 ` bluebat at member dot fsf.org
@ 2017-01-22 23:58 ` arthur200126 at gmail dot com
  2017-07-19 16:16 ` maiku.fabian at gmail dot com
  2017-07-20  8:02 ` maiku.fabian at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: arthur200126 at gmail dot com @ 2017-01-22 23:58 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

Mingye Wang <arthur200126 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |arthur200126 at gmail dot com

--- Comment #6 from Mingye Wang <arthur200126 at gmail dot com> ---
This bug is not only seen with extA characters, but also seen with simple
punctuations and/or kanas. 

$ printf '%s\n' , 。 : ¥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort -u
,
:
.
$
,
a
b
c

(uniq does the same thing.)

It seems that glibc is just eating away anything not on that list. (What kind
of equivalence assumption is that?)

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (7 preceding siblings ...)
  2017-01-22 23:58 ` arthur200126 at gmail dot com
@ 2017-07-19 16:16 ` maiku.fabian at gmail dot com
  2017-07-20  8:02 ` maiku.fabian at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-07-19 16:16 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/13063] 'sort -u' will erase some Chinese characters
  2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
                   ` (8 preceding siblings ...)
  2017-07-19 16:16 ` maiku.fabian at gmail dot com
@ 2017-07-20  8:02 ` maiku.fabian at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-07-20  8:02 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13063

--- Comment #7 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mingye Wang from comment #6)
> This bug is not only seen with extA characters, but also seen with simple
> punctuations and/or kanas. 
> 
> $ printf '%s\n' , 。 : ¥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort
> -u
> ,
> :
> .
> $
> ,
> a
> b
> c
> 
> (uniq does the same thing.)
> 
> It seems that glibc is just eating away anything not on that list. (What
> kind of equivalence assumption is that?)

This is caused by the collation symbol UNDEFINED not working correctly,
see:

https://sourceware.org/bugzilla/show_bug.cgi?id=18978

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-07-20  8:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06 19:28 [Bug localedata/13063] New: Can not 'sort -u' all Chinese characters in CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D an.euroford at gmail dot com
2011-08-06 19:28 ` [Bug localedata/13063] " an.euroford at gmail dot com
2011-08-07 20:46 ` [Bug localedata/13063] 'sort -u' will erase some Chinese characters an.euroford at gmail dot com
2011-08-07 20:47 ` an.euroford at gmail dot com
2011-08-08 16:56 ` an.euroford at gmail dot com
2014-05-07  8:20 ` bluebat at member dot fsf.org
2014-06-13 15:12 ` fweimer at redhat dot com
2014-11-19  4:14 ` bluebat at member dot fsf.org
2017-01-22 23:58 ` arthur200126 at gmail dot com
2017-07-19 16:16 ` maiku.fabian at gmail dot com
2017-07-20  8:02 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).