public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard
@ 2015-09-17 11:56 maiku.fabian at gmail dot com
  2021-11-02  2:55 ` [Bug locale/18978] " carlos at redhat dot com
  0 siblings, 1 reply; 2+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-09-17 11:56 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18978

            Bug ID: 18978
           Summary: The collation symbol “UNDEFINED” does not work as
                    specified in the standard
           Product: glibc
           Version: 2.22
            Status: NEW
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: maiku.fabian at gmail dot com
  Target Milestone: ---

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

says: 

opengroup> Collation Order
opengroup> 
opengroup> [...]
opengroup> 
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.

Unfortunatly it does not work like that in glibc.

For example:

The Japanese locale source file /usr/share/i18n/locales/ja_JP
has this in the LC_COLLATE section:

    mfabian@ari:/usr/share/i18n/locales
    $ grep -A 8 ^LC_COLLATE ja_JP
    LC_COLLATE
    order_start forward
    %
    % C0
    %
    <U0000>
    <U0001>
    <U0002>
    <U0003>
    mfabian@ari:/usr/share/i18n/locales
    $ grep -B 8 '^END LC_COLLATE' ja_JP
    <U9F97>
    <U9F9E>
    <U9FA1>
    <U9FA2>
    <U9FA3>
    <U9FA5>
    UNDEFINED
    order_end
    END LC_COLLATE
    mfabian@ari:/usr/share/i18n/locales
    $

I.e. it includes the “UNDEFINED” collation symbol at the end.

Now if I choose a character which is *not* specified in
the LC_COLLATE section, neither explicitly nor via the ellipsis
for example:

    ⅞ U+215E VULGAR FRACTION SEVEN EIGHTHS

and check how it sorts, I find:

mfabian@ari:~/testdir
$ LANG=ja_JP.UTF-8 ls
⅞ A  B  C  D  O  U  Z  a  b  c  d  o  u  z  Þ  æ  đ  ı  ß  İ  ä  ö  ü
mfabian@ari:~/testdir
$

I.e. it sorts at the beginning, not at the end (the other non-ASCII
characters in that sort example *are* explicitly specified
in the sort order,  that’s why they appear after “z” which is how
it is specified).

To test this further, I created my own variant of

/usr/share/i18n/locales/POSIX

by removing the

LC_COLLATE
# This is the POSIX Locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII code set.
order_start forward
<U0000>
<U0001>

normal stuff here

modified part follows:

<U0040>         <- @
<U0044>         <- D (moved here make sure I am really using my modified
locale)
<U0041>         <- A
<U0043>         <- C 
UNDEFINED       <- B is *not* specified any more! Therefore it should go here!
<U0045>         <- E
<U0046>         <- F

more normal stuff here

<U007E>
<U007F>
order_end
#
END LC_COLLATE

And when testing this (I installed this modified POSIX locale
using localedef under the name "POSIXMIKE"):

mfabian@ari:~/testdir
$ LANG=POSIXMIKE ls
B  ??  ??  ??  ??  ??  ??  ??  ??  ??  ???  D  A  C  O  U  Z  a  b  c  d  o  u 
z
mfabian@ari:~/testdir
$

So the now unspecified “B” is sorted at the beginning and *not*
after “C” where the “UNDEFINED” collation symbol is.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-29826-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Sep 17 12:01:03 2015
Return-Path: <glibc-bugs-return-29826-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 27009 invoked by alias); 17 Sep 2015 12:01:03 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 26562 invoked by uid 48); 17 Sep 2015 12:00:55 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug locale/18978]=?UTF-8?Q? The collation symbol “UNDEFINED” does not work as specified in the standard?Date: Thu, 17 Sep 2015 12:01:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: locale
X-Bugzilla-Version: 2.22
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-18978-131-3NpkoDhWh5@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-18978-131@http.sourceware.org/bugzilla/>
References: <bug-18978-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-09/txt/msg00225.txt.bz2
Content-length: 471

https://sourceware.org/bugzilla/show_bug.cgi?id\x18978

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com,
                   |                            |myllynen at redhat dot com

--
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug locale/18978] The collation symbol “UNDEFINED” does not work as specified in the standard
  2015-09-17 11:56 [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard maiku.fabian at gmail dot com
@ 2021-11-02  2:55 ` carlos at redhat dot com
  0 siblings, 0 replies; 2+ messages in thread
From: carlos at redhat dot com @ 2021-11-02  2:55 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=18978

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://sourceware.org/bugz
                   |                            |illa/show_bug.cgi?id=28526

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-11-02  2:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17 11:56 [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard maiku.fabian at gmail dot com
2021-11-02  2:55 ` [Bug locale/18978] " carlos at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).