public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/13547] New: Different strings collate as equal in Hungarian
@ 2012-01-03  0:18 egmont at gmail dot com
  2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2012-01-03  0:18 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

             Bug #: 13547
           Summary: Different strings collate as equal in Hungarian
           Product: glibc
           Version: 2.14
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales@sources.redhat.com
        ReportedBy: egmont@gmail.com
    Classification: Unclassified


Created attachment 6139
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6139
collate fix for Hungarian

Please apply the attached patch to the Hungarian locale definition.

Using the current definition, certain strings collate as equal, e.g.
strcoll("ccs", "cscs") returns zero. This causes confusion with programs such
as sort (the order is undefined, might vary from run to run), or uniq
(different lines being reported as equal).

The given patch addresses this problem and makes them collate as different,
without modifying the actual sorting order of valid Hungarian words.

The problem in more detail:

We have compound letters, such as "sh" in English, e.g. we have "cs". Whenever
such a letter is pronounced long, we write it using a shorthand "ccs" notation
(only the first letter is duplicated), rather than "cscs".

Currently "ccs" is tokenized as <cs><cs>, which is correct, but "cscs" (not
used in valid Hungarian words, but might occur in text files anyways) is also
tokenized as <cs><cs>, hence they collate equal.

The solution is to tokenize "ccs" as <c_or_cs><cs>, and reorder the tokens like
<a> <b> <c> <c_or_cs> <cs> <d> ...

The problem was originally discovered at http://hup.hu/node/110267 (forum in
Hungarian).

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
@ 2012-01-03  0:28 ` egmont at gmail dot com
  2012-01-07 16:05 ` drepper.fsp at gmail dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2012-01-03  0:28 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

Egmont Koblinger <egmont at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #6139|0                           |1
        is obsolete|                            |

--- Comment #1 from Egmont Koblinger <egmont at gmail dot com> 2012-01-03 00:28:36 UTC ---
Created attachment 6140
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6140
collate fix for Hungarian

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
  2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
@ 2012-01-07 16:05 ` drepper.fsp at gmail dot com
  2014-06-27 11:20 ` fweimer at redhat dot com
  2015-09-08  8:35 ` egmont at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: drepper.fsp at gmail dot com @ 2012-01-07 16:05 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

Ulrich Drepper <drepper.fsp at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |drepper.fsp at gmail dot
                   |                            |com
         Resolution|                            |FIXED

--- Comment #2 from Ulrich Drepper <drepper.fsp at gmail dot com> 2012-01-07 16:05:07 UTC ---
I added the patch.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
  2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
  2012-01-07 16:05 ` drepper.fsp at gmail dot com
@ 2014-06-27 11:20 ` fweimer at redhat dot com
  2015-09-08  8:35 ` egmont at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: fweimer at redhat dot com @ 2014-06-27 11:20 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
                   ` (2 preceding siblings ...)
  2014-06-27 11:20 ` fweimer at redhat dot com
@ 2015-09-08  8:35 ` egmont at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2015-09-08  8:35 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

--- Comment #3 from Egmont Koblinger <egmont at gmail dot com> ---
Please note that the patch applied here was incorrect. It fixed a corner case,
while broke a more generic one.

By tokenizing "ssz" as <s_or_sz><sz> rather than <sz><sz>, and ordering the
tokens as <s> < <s_or_sz> < <sz>, the corner case when the only difference in
the two words is "ssz" vs. "szsz" is fixed.

However, sorting of e.g. "kasza" <k><a><sz><a> vs. "kassza"
<k><a><s_or_sz><sz><a> became broken. The correct ordering would be "kasza" <
"kassza" (since it's actually <k><a><sz><sz><a>), but with the current solution
they're ordered backwards (due to <s_or_sz> preceding <sz>).

The solution is to tokenize both "ssz" and "szsz" as <sz><sz> (as we did
before), but apply something weaker, something along the lines of a "fake
accent" (SINGLE-OR-COMPOUND vs. COMPOUND) on top of them that might distinguish
later.

Let's leave this bug closed. A fix is available in bug 18934.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-08  8:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
2012-01-07 16:05 ` drepper.fsp at gmail dot com
2014-06-27 11:20 ` fweimer at redhat dot com
2015-09-08  8:35 ` egmont at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).