public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/14038] New: strcoll sorting order
@ 2012-05-01  4:39 ndrwrdck at gmail dot com
  2012-05-01  4:40 ` [Bug localedata/14038] " ndrwrdck at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: ndrwrdck at gmail dot com @ 2012-05-01  4:39 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

             Bug #: 14038
           Summary: strcoll sorting order
           Product: glibc
           Version: 2.13
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: unassigned@sourceware.org
        ReportedBy: ndrwrdck@gmail.com
                CC: libc-locales@sources.redhat.com
    Classification: Unclassified


(not sure if that's an implementation or documentation issue)

In utf8 locales, some string comparisons depend on the length of the strings,
not sure if that's supposed to work that way (if so, it would be good to have a
reference to a standard defining these rules in the docs) or it is just a bug.

For example, if strcoll is used as a comparison function, these strings will be
sorted as follows:

あ
a
あa
aa
あaa
aaa
あaaa

I'd expect the following order to be correct:

あ
あa
あaa
あaaa
a
aa
aaa

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
@ 2012-05-01  4:40 ` ndrwrdck at gmail dot com
  2012-05-01  7:43 ` schwab@linux-m68k.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: ndrwrdck at gmail dot com @ 2012-05-01  4:40 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

--- Comment #1 from Andrzej <ndrwrdck at gmail dot com> 2012-05-01 03:58:14 UTC ---
Just to clarify, I run into this issue(?) when we tried to optimize sorting in
our application.

Our assumption was that, knowing that the first character of two strings are
different, comparing just these characters is as good as comparing the whole
strings, that is if 'あ' < 'a' then 'あaaa' < 'aa'. This assumption fails with
the current design of strcoll.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
  2012-05-01  4:40 ` [Bug localedata/14038] " ndrwrdck at gmail dot com
@ 2012-05-01  7:43 ` schwab@linux-m68k.org
  2012-05-01  7:43 ` schwab@linux-m68k.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2012-05-01  7:43 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

--- Comment #2 from Andreas Schwab <schwab@linux-m68k.org> 2012-05-01 06:48:47 UTC ---
This is a bad assumption in any case because the sorting algorithm may ignore
some characters in the first pass.

The common

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
  2012-05-01  4:40 ` [Bug localedata/14038] " ndrwrdck at gmail dot com
  2012-05-01  7:43 ` schwab@linux-m68k.org
@ 2012-05-01  7:43 ` schwab@linux-m68k.org
  2012-05-01  9:50 ` pasky at ucw dot cz
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2012-05-01  7:43 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

--- Comment #3 from Andreas Schwab <schwab@linux-m68k.org> 2012-05-01 06:54:12 UTC ---
The common sorting weights from iso14651_t1_common has no entry for japanese
characters, so they are ignored in the first pass.  The ja_JP locale sorts them
after the latin characters.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
                   ` (2 preceding siblings ...)
  2012-05-01  7:43 ` schwab@linux-m68k.org
@ 2012-05-01  9:50 ` pasky at ucw dot cz
  2012-05-01 11:32 ` ndrwrdck at gmail dot com
  2014-06-25 12:09 ` fweimer at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: pasky at ucw dot cz @ 2012-05-01  9:50 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

Petr Baudis <pasky at ucw dot cz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |pasky at ucw dot cz
         Resolution|                            |INVALID

--- Comment #4 from Petr Baudis <pasky at ucw dot cz> 2012-05-01 08:58:15 UTC ---
Marking as INVALID, thanks to Andreas for taking care to explain. Indeed, the
sorting is locale-dependent and may ignore various (usually the unknown)
characters. Set LC_COLLATE to POSIX if you want "programmer-friendly" sorting
order. Andrzej, feel free to reopen if you have more questions.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
                   ` (3 preceding siblings ...)
  2012-05-01  9:50 ` pasky at ucw dot cz
@ 2012-05-01 11:32 ` ndrwrdck at gmail dot com
  2014-06-25 12:09 ` fweimer at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: ndrwrdck at gmail dot com @ 2012-05-01 11:32 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=14038

--- Comment #5 from Andrzej <ndrwrdck at gmail dot com> 2012-05-01 10:44:21 UTC ---
Just wanted to ask if there is any plan of adding Japanese definition to
iso14651_t1_common file. The current behavior doesn't seems particularly
useful.

Also, the documentation issue is still valid - for a nontrivial function like
this, there should be at least some hints about where to find the comparison
rules or what standards does it comply with.

(I'm satisfied with your explanation so I don't reopen the bug. Please feel
free to reopen/reassign it if you think the above issues need to be addressed.)

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/14038] strcoll sorting order
  2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
                   ` (4 preceding siblings ...)
  2012-05-01 11:32 ` ndrwrdck at gmail dot com
@ 2014-06-25 12:09 ` fweimer at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: fweimer at redhat dot com @ 2014-06-25 12:09 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=14038

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-06-25 12:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-01  4:39 [Bug localedata/14038] New: strcoll sorting order ndrwrdck at gmail dot com
2012-05-01  4:40 ` [Bug localedata/14038] " ndrwrdck at gmail dot com
2012-05-01  7:43 ` schwab@linux-m68k.org
2012-05-01  7:43 ` schwab@linux-m68k.org
2012-05-01  9:50 ` pasky at ucw dot cz
2012-05-01 11:32 ` ndrwrdck at gmail dot com
2014-06-25 12:09 ` fweimer at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).