public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard
@ 2015-09-17 11:56 maiku.fabian at gmail dot com
2021-11-02 2:55 ` [Bug locale/18978] " carlos at redhat dot com
0 siblings, 1 reply; 2+ messages in thread
From: maiku.fabian at gmail dot com @ 2015-09-17 11:56 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=18978
Bug ID: 18978
Summary: The collation symbol “UNDEFINED” does not work as
specified in the standard
Product: glibc
Version: 2.22
Status: NEW
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: maiku.fabian at gmail dot com
Target Milestone: ---
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
says:
opengroup> Collation Order
opengroup>
opengroup> [...]
opengroup>
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.
Unfortunatly it does not work like that in glibc.
For example:
The Japanese locale source file /usr/share/i18n/locales/ja_JP
has this in the LC_COLLATE section:
mfabian@ari:/usr/share/i18n/locales
$ grep -A 8 ^LC_COLLATE ja_JP
LC_COLLATE
order_start forward
%
% C0
%
<U0000>
<U0001>
<U0002>
<U0003>
mfabian@ari:/usr/share/i18n/locales
$ grep -B 8 '^END LC_COLLATE' ja_JP
<U9F97>
<U9F9E>
<U9FA1>
<U9FA2>
<U9FA3>
<U9FA5>
UNDEFINED
order_end
END LC_COLLATE
mfabian@ari:/usr/share/i18n/locales
$
I.e. it includes the “UNDEFINED” collation symbol at the end.
Now if I choose a character which is *not* specified in
the LC_COLLATE section, neither explicitly nor via the ellipsis
for example:
⅞ U+215E VULGAR FRACTION SEVEN EIGHTHS
and check how it sorts, I find:
mfabian@ari:~/testdir
$ LANG=ja_JP.UTF-8 ls
⅞ A B C D O U Z a b c d o u z Þ æ đ ı ß İ ä ö ü
mfabian@ari:~/testdir
$
I.e. it sorts at the beginning, not at the end (the other non-ASCII
characters in that sort example *are* explicitly specified
in the sort order, that’s why they appear after “z” which is how
it is specified).
To test this further, I created my own variant of
/usr/share/i18n/locales/POSIX
by removing the
LC_COLLATE
# This is the POSIX Locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII code set.
order_start forward
<U0000>
<U0001>
normal stuff here
modified part follows:
<U0040> <- @
<U0044> <- D (moved here make sure I am really using my modified
locale)
<U0041> <- A
<U0043> <- C
UNDEFINED <- B is *not* specified any more! Therefore it should go here!
<U0045> <- E
<U0046> <- F
more normal stuff here
<U007E>
<U007F>
order_end
#
END LC_COLLATE
And when testing this (I installed this modified POSIX locale
using localedef under the name "POSIXMIKE"):
mfabian@ari:~/testdir
$ LANG=POSIXMIKE ls
B ?? ?? ?? ?? ?? ?? ?? ?? ?? ??? D A C O U Z a b c d o u
z
mfabian@ari:~/testdir
$
So the now unspecified “B” is sorted at the beginning and *not*
after “C” where the “UNDEFINED” collation symbol is.
--
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-return-29826-listarch-glibc-bugs=sources.redhat.com@sourceware.org Thu Sep 17 12:01:03 2015
Return-Path: <glibc-bugs-return-29826-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs@sources.redhat.com
Received: (qmail 27009 invoked by alias); 17 Sep 2015 12:01:03 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
Delivered-To: mailing list glibc-bugs@sourceware.org
Received: (qmail 26562 invoked by uid 48); 17 Sep 2015 12:00:55 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug locale/18978]=?UTF-8?Q? The collation symbol “UNDEFINED” does not work as specified in the standard?Date: Thu, 17 Sep 2015 12:01:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: locale
X-Bugzilla-Version: 2.22
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-18978-131-3NpkoDhWh5@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-18978-131@http.sourceware.org/bugzilla/>
References: <bug-18978-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-09/txt/msg00225.txt.bz2
Content-length: 471
https://sourceware.org/bugzilla/show_bug.cgi?id\x18978
Mike FABIAN <maiku.fabian at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |carlos at redhat dot com,
| |myllynen at redhat dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Bug locale/18978] The collation symbol “UNDEFINED” does not work as specified in the standard
2015-09-17 11:56 [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard maiku.fabian at gmail dot com
@ 2021-11-02 2:55 ` carlos at redhat dot com
0 siblings, 0 replies; 2+ messages in thread
From: carlos at redhat dot com @ 2021-11-02 2:55 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=18978
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://sourceware.org/bugz
| |illa/show_bug.cgi?id=28526
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2021-11-02 2:55 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17 11:56 [Bug locale/18978] New: The collation symbol “UNDEFINED” does not work as specified in the standard maiku.fabian at gmail dot com
2021-11-02 2:55 ` [Bug locale/18978] " carlos at redhat dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).