public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/13547] New: Different strings collate as equal in Hungarian
@ 2012-01-03 0:18 egmont at gmail dot com
2012-01-03 0:28 ` [Bug localedata/13547] " egmont at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2012-01-03 0:18 UTC (permalink / raw)
To: glibc-bugs
http://sourceware.org/bugzilla/show_bug.cgi?id=13547
Bug #: 13547
Summary: Different strings collate as equal in Hungarian
Product: glibc
Version: 2.14
Status: NEW
Severity: normal
Priority: P2
Component: localedata
AssignedTo: libc-locales@sources.redhat.com
ReportedBy: egmont@gmail.com
Classification: Unclassified
Created attachment 6139
--> http://sourceware.org/bugzilla/attachment.cgi?id=6139
collate fix for Hungarian
Please apply the attached patch to the Hungarian locale definition.
Using the current definition, certain strings collate as equal, e.g.
strcoll("ccs", "cscs") returns zero. This causes confusion with programs such
as sort (the order is undefined, might vary from run to run), or uniq
(different lines being reported as equal).
The given patch addresses this problem and makes them collate as different,
without modifying the actual sorting order of valid Hungarian words.
The problem in more detail:
We have compound letters, such as "sh" in English, e.g. we have "cs". Whenever
such a letter is pronounced long, we write it using a shorthand "ccs" notation
(only the first letter is duplicated), rather than "cscs".
Currently "ccs" is tokenized as <cs><cs>, which is correct, but "cscs" (not
used in valid Hungarian words, but might occur in text files anyways) is also
tokenized as <cs><cs>, hence they collate equal.
The solution is to tokenize "ccs" as <c_or_cs><cs>, and reorder the tokens like
<a> <b> <c> <c_or_cs> <cs> <d> ...
The problem was originally discovered at http://hup.hu/node/110267 (forum in
Hungarian).
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/13547] Different strings collate as equal in Hungarian
2012-01-03 0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
@ 2012-01-03 0:28 ` egmont at gmail dot com
2012-01-07 16:05 ` drepper.fsp at gmail dot com
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2012-01-03 0:28 UTC (permalink / raw)
To: glibc-bugs
http://sourceware.org/bugzilla/show_bug.cgi?id=13547
Egmont Koblinger <egmont at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #6139|0 |1
is obsolete| |
--- Comment #1 from Egmont Koblinger <egmont at gmail dot com> 2012-01-03 00:28:36 UTC ---
Created attachment 6140
--> http://sourceware.org/bugzilla/attachment.cgi?id=6140
collate fix for Hungarian
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/13547] Different strings collate as equal in Hungarian
2012-01-03 0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03 0:28 ` [Bug localedata/13547] " egmont at gmail dot com
@ 2012-01-07 16:05 ` drepper.fsp at gmail dot com
2014-06-27 11:20 ` fweimer at redhat dot com
2015-09-08 8:35 ` egmont at gmail dot com
3 siblings, 0 replies; 5+ messages in thread
From: drepper.fsp at gmail dot com @ 2012-01-07 16:05 UTC (permalink / raw)
To: glibc-bugs
http://sourceware.org/bugzilla/show_bug.cgi?id=13547
Ulrich Drepper <drepper.fsp at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
CC| |drepper.fsp at gmail dot
| |com
Resolution| |FIXED
--- Comment #2 from Ulrich Drepper <drepper.fsp at gmail dot com> 2012-01-07 16:05:07 UTC ---
I added the patch.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/13547] Different strings collate as equal in Hungarian
2012-01-03 0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03 0:28 ` [Bug localedata/13547] " egmont at gmail dot com
2012-01-07 16:05 ` drepper.fsp at gmail dot com
@ 2014-06-27 11:20 ` fweimer at redhat dot com
2015-09-08 8:35 ` egmont at gmail dot com
3 siblings, 0 replies; 5+ messages in thread
From: fweimer at redhat dot com @ 2014-06-27 11:20 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=13547
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Flags| |security-
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/13547] Different strings collate as equal in Hungarian
2012-01-03 0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
` (2 preceding siblings ...)
2014-06-27 11:20 ` fweimer at redhat dot com
@ 2015-09-08 8:35 ` egmont at gmail dot com
3 siblings, 0 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2015-09-08 8:35 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=13547
--- Comment #3 from Egmont Koblinger <egmont at gmail dot com> ---
Please note that the patch applied here was incorrect. It fixed a corner case,
while broke a more generic one.
By tokenizing "ssz" as <s_or_sz><sz> rather than <sz><sz>, and ordering the
tokens as <s> < <s_or_sz> < <sz>, the corner case when the only difference in
the two words is "ssz" vs. "szsz" is fixed.
However, sorting of e.g. "kasza" <k><a><sz><a> vs. "kassza"
<k><a><s_or_sz><sz><a> became broken. The correct ordering would be "kasza" <
"kassza" (since it's actually <k><a><sz><sz><a>), but with the current solution
they're ordered backwards (due to <s_or_sz> preceding <sz>).
The solution is to tokenize both "ssz" and "szsz" as <sz><sz> (as we did
before), but apply something weaker, something along the lines of a "fake
accent" (SINGLE-OR-COMPOUND vs. COMPOUND) on top of them that might distinguish
later.
Let's leave this bug closed. A fix is available in bug 18934.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-09-08 8:35 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-03 0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03 0:28 ` [Bug localedata/13547] " egmont at gmail dot com
2012-01-07 16:05 ` drepper.fsp at gmail dot com
2014-06-27 11:20 ` fweimer at redhat dot com
2015-09-08 8:35 ` egmont at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).