[Bug locale/18927] New: Different strings should never collate as equal

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug locale/18927] New: Different strings should never collate as equal
@ 2015-09-06 22:21 egmont at gmail dot com
  2015-09-07 12:17 ` [Bug locale/18927] " fweimer at redhat dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-06 22:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

            Bug ID: 18927
           Summary: Different strings should never collate as equal
           Product: glibc
           Version: 2.21
            Status: NEW
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: egmont at gmail dot com
  Target Milestone: ---

Bug 13547 manually fixed a case where two distinct strings collated as equal.
Bug 16527 is another, currently unresolved case. Probably there are other, yet
undiscovered cases as well, and new ones might appear in the future.

This causes confusion with programs such as sort (the order is undefined, might
vary from run to run), or uniq (different lines being reported as equal).

I think there should be a safeguard code so that no locale definition can
result in this ever happening.

One possible approach I can imagine: Change the current strxfrm() magic to
produce an output that's restricted to bytes in the 2-255 range. Then append a
0x01 byte followed by the original string's literal copy.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
@ 2015-09-07 12:17 ` fweimer at redhat dot com
  2015-09-09  7:21 ` fweimer at redhat dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: fweimer at redhat dot com @ 2015-09-07 12:17 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
  2015-09-07 12:17 ` [Bug locale/18927] " fweimer at redhat dot com
@ 2015-09-09  7:21 ` fweimer at redhat dot com
  2015-09-09  8:13 ` egmont at gmail dot com
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: fweimer at redhat dot com @ 2015-09-09  7:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #1 from Florian Weimer <fweimer at redhat dot com> ---
What does strcoll do in the cases you mention?  Do the strings compare as
equal?  Then we cannot have strxfrm produce different strings for them.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
  2015-09-07 12:17 ` [Bug locale/18927] " fweimer at redhat dot com
  2015-09-09  7:21 ` fweimer at redhat dot com
@ 2015-09-09  8:13 ` egmont at gmail dot com
  2015-09-09  8:15 ` egmont at gmail dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09  8:13 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #2 from Egmont Koblinger <egmont at gmail dot com> ---
Sorry, I wasn't explicit on this. I meant to modify both strcoll() and
strxfrm(), in accordance with each other. That is, strcoll() never to return 0
on strings that differ, and strxfrm() never to produce the same output for
different inputs.

The issue was originally discovered with the "uniq" utility which omitted
certain lines it shouldn't have omitted. E.g. in Hungarian for the input lines
"ssz" and "szsz" it only produced one output line, causing quite a headache to
the guy who discovered it.

The manual of uniq explicitly states that it honours LC_COLLATE. On the other
hand, this utility shouldn't care about sorting, it should only care about
equalness of strings. This implicly suggests that uniq's authors assume that
different strings collating as equal is a valid case and they deliberately wish
to drop these variants from the output. Otherwise they could've just gone with
strcmp(). (By the way I haven't checked its implementation, can't tell if it
uses strcoll() or strxfrm(), but it shouldn't matter.)

On the other hand, I believe this approach is error-prone, and guaranteeing
different collation for unequal strings would result in a more robust locale
system, with fewer unexpected user-facing behaviors.

Maybe we should loop in the coreutils folks to hear their opinion. Although
"uniq" probably not the only tool influenced by this.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (2 preceding siblings ...)
  2015-09-09  8:13 ` egmont at gmail dot com
@ 2015-09-09  8:15 ` egmont at gmail dot com
  2015-09-09  8:37 ` egmont at gmail dot com
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09  8:15 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #3 from Egmont Koblinger <egmont at gmail dot com> ---
<off>

Could someone please guide me what's the proper English usage? Collate as
equal; collate to equal; collate equally; something else? Thanks!

</off>

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (3 preceding siblings ...)
  2015-09-09  8:15 ` egmont at gmail dot com
@ 2015-09-09  8:37 ` egmont at gmail dot com
  2015-09-09 10:21 ` joseph at codesourcery dot com
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09  8:37 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #4 from Egmont Koblinger <egmont at gmail dot com> ---
Going further with the example that strcoll("ssz", "szsz") == 0 (as it used to
be in hu_HU):

It's possible that "sort" prints multiple occurrences of these strings in mixed
order. E.g. it might legally output this:

szsz
ssz
szsz
ssz
ssz
szsz
(etc.)

based on random, or stableness. In turn, when this is piped to "uniq", it might
make some sense to print the first entry only (although it's still random
whether that'll be "ssz" or "szsz"). Piping it to "LC_ALL=C uniq" instead
wouldn't make much more sense either, its output (even the number of lines)
would also be random.

Trying to think of the big picture rather than the details, in my opinion this
unexpected behavior should ideally be stopped at the very core, that is, in the
locale implementation of strcoll() and strxfrm(), so that "sort" cannot produce
the output shown above. Then there's no subsequent problem with "uniq"
whatsoever.

Otherwise, the only reliable way to print each line exactly once would be
"LC_ALL=C sort | LC_ALL=C uniq" (or "LC_ALL=C sort -u" for short), and if you
wanted to have the output sorted according to your current locale, you'd have
to issue "LC_ALL=C sort | LC_ALL=C uniq | sort" (or "LC_ALL=C sort -u | sort"
for short).

Currently if you issue the obvious locale-dependent "sort | uniq" or "sort -u",
you could never be sure whether the underlying locale's definition might cause
unexpected (i.e. faulty - as seen by the user) behavior.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (4 preceding siblings ...)
  2015-09-09  8:37 ` egmont at gmail dot com
@ 2015-09-09 10:21 ` joseph at codesourcery dot com
  2015-09-09 11:10 ` egmont at gmail dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: joseph at codesourcery dot com @ 2015-09-09 10:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #5 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
The POSIX issues are:

http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=948
http://austingroupbugs.net/view.php?id=963

The stated direction is that implementation-provided locales without an 
'@' modifier should provide a total ordering of all strings without 
different strings collating as equal.  That indicates to me that for glibc 
this is primarily a data issue - we need to arrange for such a total 
ordering to be present in all installed locales, and if possible have a 
way to check this (e.g. a localedef option to give an error if the locale 
doesn't provide a total ordering, used by localedata/install-locales, or 
some other fairly foolproof way of making sure all glibc-provided locales 
have this property).  But the code would still allow locales to be defined 
that do not provide a total ordering - glibc just wouldn't ship with such 
locale definitions.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (5 preceding siblings ...)
  2015-09-09 10:21 ` joseph at codesourcery dot com
@ 2015-09-09 11:10 ` egmont at gmail dot com
  2015-09-09 11:41 ` fweimer at redhat dot com
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09 11:10 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #6 from Egmont Koblinger <egmont at gmail dot com> ---
Joseph, thanks a lot for your research and the pointers!

Tests in "make tests" checking for the data is indeed another possible
approach.

It'd have the advantage that it points locale maintainers to actual ambiguous
cases and let them explicitly decide on the ordering, rather than this
remaining unnoticed and caught silently by the safeguard code.

It'd have some disadvantage on the other hand:
- Probably several current locales would fail this test; who would fix them?
- It would probably not catch bugs introduced by custom downstream patches to
locales (unless of course downstream packager runs "make tests").
- Sounds magnitudes harder for me to implement than my proposal; actually I
have no idea at this moment how to approach this problem. (E.g. for hu_HU I
_hope_ that I came up with a total ordering, but I'm not entirely certain.) It
sounds like kind of an automated mathematical proof for a rather complex
question.

That being said, I'd be fine with either approach.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (6 preceding siblings ...)
  2015-09-09 11:10 ` egmont at gmail dot com
@ 2015-09-09 11:41 ` fweimer at redhat dot com
  2015-09-09 13:58 ` joseph at codesourcery dot com
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: fweimer at redhat dot com @ 2015-09-09 11:41 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #7 from Florian Weimer <fweimer at redhat dot com> ---
(In reply to joseph@codesourcery.com from comment #5)
> The stated direction is that implementation-provided locales without an 
> '@' modifier should provide a total ordering of all strings without 
> different strings collating as equal.

I find it extremely surprising that strcoll is not to supposed to perform some
form of normalization in UTF-8 and  similar locales.  Is this really the
intent?

Is there a reason not to use the Unicode Collation Algorithm?

  <http://www.unicode.org/reports/tr10/>

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (7 preceding siblings ...)
  2015-09-09 11:41 ` fweimer at redhat dot com
@ 2015-09-09 13:58 ` joseph at codesourcery dot com
  2015-09-09 15:28 ` egmont at gmail dot com
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: joseph at codesourcery dot com @ 2015-09-09 13:58 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #8 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Wed, 9 Sep 2015, fweimer at redhat dot com wrote:

> I find it extremely surprising that strcoll is not to supposed to perform some
> form of normalization in UTF-8 and  similar locales.  Is this really the
> intent?

The intent is that, to avoid various surprising effects discussed in those 
issues (and the previous discussions on the Austin Group mailing list), 
byte-distinct strings do not collate the same (although if they normalize 
the same, I'd expect them to collate together relative to all other 
strings - differences in normalization being of the lowest precedence in 
collation).

> Is there a reason not to use the Unicode Collation Algorithm?
> 
>   <http://www.unicode.org/reports/tr10/>

Well, our collation data is based on ISO 14651, which is meant to be 
equivalent, but updating it requires understanding just how the existing 
files relate to an old version of ISO 14651 and which local changes are or 
are not still relevant when updating to a newer version.  See bug 14095.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (8 preceding siblings ...)
  2015-09-09 13:58 ` joseph at codesourcery dot com
@ 2015-09-09 15:28 ` egmont at gmail dot com
  2015-09-09 19:23 ` egmont at gmail dot com
  2023-05-31 16:57 ` carenas at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09 15:28 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #9 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to joseph@codesourcery.com from comment #8)
> On Wed, 9 Sep 2015, fweimer at redhat dot com wrote:

> The intent is that, to avoid various surprising effects discussed in those 
> issues (and the previous discussions on the Austin Group mailing list), 
> byte-distinct strings do not collate the same (although if they normalize 
> the same, I'd expect them to collate together relative to all other 
> strings - differences in normalization being of the lowest precedence in 
> collation).

I'd love to see it, this is what this bugreport is about :)

tr10's A.3.2 shows the wrappers that turn a non-deterministic coll/xfrm methods
into deterministic ones - pretty much what I outlined here, although they
forget to mention that SEPARATOR needs to sort before any possible byte within
old_sort_key. The wrapper around strcoll() is more obvious.

The fact that they talk about these wrappers as possible external methods,
rather than having to be build inside the collate implementation, makes me
uncertain whether my request is in align with the standard. (I'm yet to read
the whole docs.)

Btw, a side question: What happens, and what should happen, if the input to
strcoll() or strxfrm() is not valid UTF-8?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (9 preceding siblings ...)
  2015-09-09 15:28 ` egmont at gmail dot com
@ 2015-09-09 19:23 ` egmont at gmail dot com
  2023-05-31 16:57 ` carenas at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: egmont at gmail dot com @ 2015-09-09 19:23 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

--- Comment #10 from Egmont Koblinger <egmont at gmail dot com> ---
The 0x01 byte, bytes of an invalid UTF-8, and bytes of unrecognized Unicode
codepoints (e.g. U+AC00) all get converted to the exact same token, that is,
e.g. any two of "가" (U+AC00), "각" (U+AC01), "\x01\x01\x01" (^A^A^A),
"\x80\x80\x80" (invalid), "\xd0\xfe\xff" (invalid) etc. collate the same.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-29713-listarch-glibc-bugs=sources.redhat.com@sourceware.org Wed Sep 09 19:46:35 2015
Return-Path: <glibc-bugs-return-29713-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 122125 invoked by alias); 9 Sep 2015 19:46:35 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 122089 invoked by uid 48); 9 Sep 2015 19:46:31 -0000
From: "egmont at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug locale/18927] Different strings should never collate as equal
Date: Wed, 09 Sep 2015 19:46:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: locale
X-Bugzilla-Version: 2.21
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: egmont at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-18927-131-PtygkkYPlP@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-18927-131@http.sourceware.org/bugzilla/>
References: <bug-18927-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-09/txt/msg00112.txt.bz2
Content-length: 737

https://sourceware.org/bugzilla/show_bug.cgi?id\x18927

--- Comment #11 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to joseph@codesourcery.com from comment #8)

> [...] (although if they normalize
> the same, I'd expect them to collate together relative to all other
> strings - differences in normalization being of the lowest precedence in
> collation).

All the currently available UTF-8 unittests fail if converted to NFD. It would
be nice to have what you described, but apparently there's no generic solution
for that, and addressing them individually in locale definitions is probably a
no-go. Filed a low-prio bug 18943 for that.

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug locale/18927] Different strings should never collate as equal
  2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
                   ` (10 preceding siblings ...)
  2015-09-09 19:23 ` egmont at gmail dot com
@ 2023-05-31 16:57 ` carenas at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: carenas at gmail dot com @ 2023-05-31 16:57 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18927

Carlo Marcelo Arenas Belón <carenas at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carenas at gmail dot com

--- Comment #17 from Carlo Marcelo Arenas Belón <carenas at gmail dot com> ---
could this be fixed with 2.37? not sure exactly which release but debian 12
with 2.36 would seem to also not be affected.

at least, I don't seem to find different characters/strings collating equally
from the samples described.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-05-31 16:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-06 22:21 [Bug locale/18927] New: Different strings should never collate as equal egmont at gmail dot com
2015-09-07 12:17 ` [Bug locale/18927] " fweimer at redhat dot com
2015-09-09  7:21 ` fweimer at redhat dot com
2015-09-09  8:13 ` egmont at gmail dot com
2015-09-09  8:15 ` egmont at gmail dot com
2015-09-09  8:37 ` egmont at gmail dot com
2015-09-09 10:21 ` joseph at codesourcery dot com
2015-09-09 11:10 ` egmont at gmail dot com
2015-09-09 11:41 ` fweimer at redhat dot com
2015-09-09 13:58 ` joseph at codesourcery dot com
2015-09-09 15:28 ` egmont at gmail dot com
2015-09-09 19:23 ` egmont at gmail dot com
2023-05-31 16:57 ` carenas at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).