public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/13547] New: Different strings collate as equal in Hungarian
@ 2012-01-03  0:18 egmont at gmail dot com
  2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: egmont at gmail dot com @ 2012-01-03  0:18 UTC (permalink / raw)
  To: glibc-bugs

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

             Bug #: 13547
           Summary: Different strings collate as equal in Hungarian
           Product: glibc
           Version: 2.14
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales@sources.redhat.com
        ReportedBy: egmont@gmail.com
    Classification: Unclassified


Created attachment 6139
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6139
collate fix for Hungarian

Please apply the attached patch to the Hungarian locale definition.

Using the current definition, certain strings collate as equal, e.g.
strcoll("ccs", "cscs") returns zero. This causes confusion with programs such
as sort (the order is undefined, might vary from run to run), or uniq
(different lines being reported as equal).

The given patch addresses this problem and makes them collate as different,
without modifying the actual sorting order of valid Hungarian words.

The problem in more detail:

We have compound letters, such as "sh" in English, e.g. we have "cs". Whenever
such a letter is pronounced long, we write it using a shorthand "ccs" notation
(only the first letter is duplicated), rather than "cscs".

Currently "ccs" is tokenized as <cs><cs>, which is correct, but "cscs" (not
used in valid Hungarian words, but might occur in text files anyways) is also
tokenized as <cs><cs>, hence they collate equal.

The solution is to tokenize "ccs" as <c_or_cs><cs>, and reorder the tokens like
<a> <b> <c> <c_or_cs> <cs> <d> ...

The problem was originally discovered at http://hup.hu/node/110267 (forum in
Hungarian).

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-08  8:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-03  0:18 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03  0:28 ` [Bug localedata/13547] " egmont at gmail dot com
2012-01-07 16:05 ` drepper.fsp at gmail dot com
2014-06-27 11:20 ` fweimer at redhat dot com
2015-09-08  8:35 ` egmont at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).