public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
@ 2012-01-03  1:11 ` egmont at gmail dot com
  2012-01-07 17:08 ` drepper.fsp at gmail dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: egmont at gmail dot com @ 2012-01-03  1:11 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

Egmont Koblinger <egmont at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #6139|0                           |1
        is obsolete|                            |

--- Comment #1 from Egmont Koblinger <egmont at gmail dot com> 2012-01-03 00:28:36 UTC ---
Created attachment 6140
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6140
collate fix for Hungarian

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] New: Different strings collate as equal in Hungarian
@ 2012-01-03  1:11 egmont at gmail dot com
  2012-01-03  1:11 ` [Bug localedata/13547] " egmont at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: egmont at gmail dot com @ 2012-01-03  1:11 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

             Bug #: 13547
           Summary: Different strings collate as equal in Hungarian
           Product: glibc
           Version: 2.14
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales@sources.redhat.com
        ReportedBy: egmont@gmail.com
    Classification: Unclassified


Created attachment 6139
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6139
collate fix for Hungarian

Please apply the attached patch to the Hungarian locale definition.

Using the current definition, certain strings collate as equal, e.g.
strcoll("ccs", "cscs") returns zero. This causes confusion with programs such
as sort (the order is undefined, might vary from run to run), or uniq
(different lines being reported as equal).

The given patch addresses this problem and makes them collate as different,
without modifying the actual sorting order of valid Hungarian words.

The problem in more detail:

We have compound letters, such as "sh" in English, e.g. we have "cs". Whenever
such a letter is pronounced long, we write it using a shorthand "ccs" notation
(only the first letter is duplicated), rather than "cscs".

Currently "ccs" is tokenized as <cs><cs>, which is correct, but "cscs" (not
used in valid Hungarian words, but might occur in text files anyways) is also
tokenized as <cs><cs>, hence they collate equal.

The solution is to tokenize "ccs" as <c_or_cs><cs>, and reorder the tokens like
<a> <b> <c> <c_or_cs> <cs> <d> ...

The problem was originally discovered at http://hup.hu/node/110267 (forum in
Hungarian).

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
  2012-01-03  1:11 ` [Bug localedata/13547] " egmont at gmail dot com
@ 2012-01-07 17:08 ` drepper.fsp at gmail dot com
  2014-06-27 11:21 ` fweimer at redhat dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: drepper.fsp at gmail dot com @ 2012-01-07 17:08 UTC (permalink / raw)
  To: libc-locales

http://sourceware.org/bugzilla/show_bug.cgi?id=13547

Ulrich Drepper <drepper.fsp at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |drepper.fsp at gmail dot
                   |                            |com
         Resolution|                            |FIXED

--- Comment #2 from Ulrich Drepper <drepper.fsp at gmail dot com> 2012-01-07 16:05:07 UTC ---
I added the patch.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
  2012-01-03  1:11 ` [Bug localedata/13547] " egmont at gmail dot com
  2012-01-07 17:08 ` drepper.fsp at gmail dot com
@ 2014-06-27 11:21 ` fweimer at redhat dot com
  2015-09-08  8:39 ` egmont at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: fweimer at redhat dot com @ 2014-06-27 11:21 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
                   ` (2 preceding siblings ...)
  2014-06-27 11:21 ` fweimer at redhat dot com
@ 2015-09-08  8:39 ` egmont at gmail dot com
  2016-04-21  5:36 ` vapier at gentoo dot org
  2017-03-28 21:34 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: egmont at gmail dot com @ 2015-09-08  8:39 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

--- Comment #3 from Egmont Koblinger <egmont at gmail dot com> ---
Please note that the patch applied here was incorrect. It fixed a corner case,
while broke a more generic one.

By tokenizing "ssz" as <s_or_sz><sz> rather than <sz><sz>, and ordering the
tokens as <s> < <s_or_sz> < <sz>, the corner case when the only difference in
the two words is "ssz" vs. "szsz" is fixed.

However, sorting of e.g. "kasza" <k><a><sz><a> vs. "kassza"
<k><a><s_or_sz><sz><a> became broken. The correct ordering would be "kasza" <
"kassza" (since it's actually <k><a><sz><sz><a>), but with the current solution
they're ordered backwards (due to <s_or_sz> preceding <sz>).

The solution is to tokenize both "ssz" and "szsz" as <sz><sz> (as we did
before), but apply something weaker, something along the lines of a "fake
accent" (SINGLE-OR-COMPOUND vs. COMPOUND) on top of them that might distinguish
later.

Let's leave this bug closed. A fix is available in bug 18934.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
                   ` (3 preceding siblings ...)
  2015-09-08  8:39 ` egmont at gmail dot com
@ 2016-04-21  5:36 ` vapier at gentoo dot org
  2017-03-28 21:34 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: vapier at gentoo dot org @ 2016-04-21  5:36 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://sourceware.org/bugz
                   |                            |illa/show_bug.cgi?id=18934

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/13547] Different strings collate as equal in Hungarian
  2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
                   ` (4 preceding siblings ...)
  2016-04-21  5:36 ` vapier at gentoo dot org
@ 2017-03-28 21:34 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2017-03-28 21:34 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=13547

--- Comment #4 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  ea1898dded26316e2e73adfb409224e864ffaa8b (commit)
      from  78c05814320cdc3377347f8e5fdbaa7cf5abf5b5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ea1898dded26316e2e73adfb409224e864ffaa8b

commit ea1898dded26316e2e73adfb409224e864ffaa8b
Author: Egmont Koblinger <egmont@gmail.com>
Date:   Wed Mar 22 21:27:30 2017 -0400

    localedata: hu_HU: fix multiple sorting bugs (bug 18934)

    Fix the incorrect sorting order of a digraph and its geminated variant,
    regression introduced by a faulty fix to bug 13547 in commit
    b008d4c85619a753e441d7f473ba8af0db400bd6.

    Fix two inconsistencies in sorting unusual capitalization of digraphs
    (bug #18587).

    Enable DIACRIT_FORWARD to work around bug #17750.

    Sort foreign accents after the Hungarian ones.

    Add extensive unittests containing all the examples from The Rules of
    Hungarian Orthography and many more, including explanatory comments.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                     |    4 +
 localedata/ChangeLog     |    7 +
 localedata/Makefile      |    4 +-
 localedata/hu_HU.in      |  560 ++++++++++++++++++++++++++++++++++++++++++++++
 localedata/locales/hu_HU |  286 ++++++++++++------------
 5 files changed, 716 insertions(+), 145 deletions(-)
 create mode 100644 localedata/hu_HU.in

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-28 21:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-03  1:11 [Bug localedata/13547] New: Different strings collate as equal in Hungarian egmont at gmail dot com
2012-01-03  1:11 ` [Bug localedata/13547] " egmont at gmail dot com
2012-01-07 17:08 ` drepper.fsp at gmail dot com
2014-06-27 11:21 ` fweimer at redhat dot com
2015-09-08  8:39 ` egmont at gmail dot com
2016-04-21  5:36 ` vapier at gentoo dot org
2017-03-28 21:34 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).