* [Bug localedata/25036] New: Update collation order for Swedish
@ 2019-09-25 22:38 sebras at gmail dot com
2019-09-25 22:40 ` [Bug localedata/25036] " sebras at gmail dot com
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: sebras at gmail dot com @ 2019-09-25 22:38 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
Bug ID: 25036
Summary: Update collation order for Swedish
Product: glibc
Version: 2.32
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: localedata
Assignee: unassigned at sourceware dot org
Reporter: sebras at gmail dot com
CC: libc-locales at sourceware dot org
Target Milestone: ---
[TL;DR This is about an issue with/proposed patch for Swedish localedata]
Today I tried to compile llpp git HEAD (3371788 of [1]) but one if it's
dependencies, ocaml-4.09 [2] failed to link properly due to undefined
references, see below.
[1] https://github.com/moosotc/llpp.git
[2] https://caml.inria.fr/pub/distrib/ocaml-4.09/ocaml-4.09.0.tar.xz
....
gcc -O2 -fno-strict-aliasing -fwrapv -Wall -fno-tree-vrp -g
-D_FILE_OFFSET_BITS=64 -D_REENTRANT -DCAML_NAME_SPACE
-DOCAML_STDLIB_DIR='"/usr/local/lib/ocaml"' -Wl,-E -o ocamlrun prims.o
libcamlrun.a -lm -ldl -lpthread
/usr/bin/ld: prims.o:(.data.rel+0x0): undefined reference to `caml_'
/usr/bin/ld: prims.o:(.data.rel+0x1d0): undefined reference to `caml_bs'
/usr/bin/ld: prims.o:(.data.rel+0x278): undefined reference to
`caml_convert_ra'
....
In ocaml-4.09/runtime there is a shell script named gen_primitives.sh which as
input takes a set of .c-files implementing the ocaml primitives, and as output
generates a list of symbol names of all the primitives. This list is then used
to generate a .c-file named prims.c, referencing each of the primitives.
prims.c is later compiled and linked with a library compiled from the original
set of .c-files. If the list generated by gen_primitives.sh contains spurious
symbols you will get the link error I encountered above.
gen_primitives.sh uses this sed command to collect the symbols from a .c-file:
sed -n -e "s/^CAMLprim value \([a-z0-9_][a-z0-9_]*\).*/\1/p" <.c-file name
here>
When the script processes ints.c it encounters (among others) these two lines,
which result in the symbols caml_bs and caml_int_compare being listed as
present:
CAMLprim value caml_bswap16(value v)
CAMLprim value caml_int_compare(value v1, value v2)
But that is surprising since the expected result would be caml_bswap16 and
caml_in_compare! I managed to reproduce this independently of the script and
the input files by running:
echo -e 'CAMLprim value caml_bswap16(value v)\nCAMLprim value
caml_int_compare(value v1, value v2)' \
| sed -n -e "s/^CAMLprim value \([a-z0-9_][a-z0-9_]*\).*/\1/p"
Cutting down the text and regexes used results in a smaller example:
echo -e 'bsw\nint' | sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p"
which gives the outputs "bs" and "int". Minimizing the test case further:
echo -e 'w\ni' | sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p"
yields something surprising: only "i" is printed! Next I tested with the
entire alphabet (example cut down for brevity):
echo -e 's\nt\nu\nv\nw\nx\ny\nz' | sed -n -e 's/^[a-z]$/&/p'
echo -e 's\nt\nu\nv\nw\nx\ny\nz' | sed -n -e 's/^[a-z]$/&/p'
The problem only appeared for "w" -- why? At this point the author of llpp
suggested that it might be due to locale settings. My only locale settings are:
LANG=en_US.UTF-8
LC_COLLATE=sv_SE.UTF-8
Despite me using a predominantly English (US) desktop I do prefer if e.g. music
files whose names contain Swedish characters are sorted according to Swedish
alphabetical order. The same goes for sort and other pieces of software.
The same command with a different collation setting yielded the expected
result:
echo -e 's\nt\nu\nv\nw\nx\ny\nz' | LC_COLLATE=sv_SE.UTF-8 sed -n -e
's/^[a-z]$/&/p'
echo -e 's\nt\nu\nv\nw\nx\ny\nz' | LC_COLLATE= sed -n -e 's/^[a-z]$/&/p'
The same is true also for the prior command:
echo -e 'w\ni' | LC_COLLATE= sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p"
which now prints "w" and "i". Unsetting LC_COLLATE allowed me to compile ocaml
and
thus llpp successfully which resolved my original issue.
This is not the end of the story though, and does not explain why I'm filing a
bug report with glibc.
I wanted to understand why LC_COLLATE influence the regexes in sed, and why
specifically for "w"? I dug through the sed documentation [3]. I can (barely)
accept that LC_COLLATE influence regular expressions this way.
[3] http://git.savannah.gnu.org/cgit/sed.git/tree/doc/sed.texi#n5669
The problem for me being a native Swedish speaker is that w and v are
considered separate characters [4] and that Swedish alphabetical order sorts w
after v [5].
[4] https://en.wikipedia.org/wiki/Swedish_alphabet#Uncommon_letters
[5]
https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions
That suggested that my localedata must have problems, so I read the contents of
/usr/share/i18n/locales/sv_SE. I found one section discussing whether v and w
are distinct characters [6] and another mentioning alphabetical sorting order
[7].
[6]
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/sv_SE;h=aa28c23776408e593890883ebb4c8d70b971fe15;hb=HEAD#l106
[7]
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/sv_SE;h=aa28c23776408e593890883ebb4c8d70b971fe15;hb=HEAD#l63
The part mentioning v and w being the same character [6] was added to glibc's
localedata in a commit in 2000 [8], before the reform (according to [5]) in
2006. The part referring to the standard collation order [7] was added in a
commit in 2017 [9].
[8]
https://sourceware.org/git/?p=glibc.git;a=commit;h=cf365734e59f708b1610815196ee2303174308cf
[9]
https://sourceware.org/git/?p=glibc.git;a=commit;h=159738548130d5ac4fe6178977e940ed5f8cfdc4
It seems even the latter commit [9] failed to take notice of the Unicode
organization's CLDR-project [10] adding a draft of the collation order reform
back in 2005 [11][12], officially published in CLDR 1.4.0 [13]. Two years later
they also changed the default collation order [14][15], officially published in
CLDR 1.5.0 [16].
[10] https://github.com/unicode-org/cldr
[11]
https://github.com/unicode-org/cldr/commit/6daf16a4fb81ffbfc91de3dd12fbdd270ca89cfd
[12]
https://github.com/unicode-org/cldr/commit/d1354eaaa147aacf9fc2e7e30c563ed5be7d8227
[13] http://cldr.unicode.org/index/downloads/cldr-1-4
[14]
https://github.com/unicode-org/cldr/commit/aaa134f4f74db91b781be1c8b9d9e9f1fd123c0e
[15]
https://github.com/unicode-org/cldr/commit/ffe1610149231d8feb5e4d9ba4bb0857f224869d
[16] http://cldr.unicode.org/index/downloads/cldr-1-5
I'd like for glibc's localedata to refer to the reformed collation order.
Mainly because of the language reform, but of course also because this would
have prevented me encountering the original build issues in llpp. :) In the
attached patch I try to address this, if there are any issues with it or
something is unclear, please let me know.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
@ 2019-09-25 22:40 ` sebras at gmail dot com
2019-09-26 21:33 ` sebras at gmail dot com
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: sebras at gmail dot com @ 2019-09-25 22:40 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
--- Comment #1 from sebras at gmail dot com ---
Created attachment 12004
--> https://sourceware.org/bugzilla/attachment.cgi?id=12004&action=edit
Proposed patch updating collation order.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
2019-09-25 22:40 ` [Bug localedata/25036] " sebras at gmail dot com
@ 2019-09-26 21:33 ` sebras at gmail dot com
2019-09-28 1:26 ` sebras at gmail dot com
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: sebras at gmail dot com @ 2019-09-26 21:33 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
--- Comment #2 from sebras at gmail dot com ---
It came to my attention that I should point out that sed 4.7-1 is configured
with --without-included-regex in my distribution, and so relies on the glibc
regex implementation as far as I can tell. If your distribution does not supply
this flag when compiling sed you will be unable to reproduce the issues I saw.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
2019-09-25 22:40 ` [Bug localedata/25036] " sebras at gmail dot com
2019-09-26 21:33 ` sebras at gmail dot com
@ 2019-09-28 1:26 ` sebras at gmail dot com
2019-10-04 21:47 ` fweimer at redhat dot com
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: sebras at gmail dot com @ 2019-09-28 1:26 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
sebras at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #12004|0 |1
is obsolete| |
--- Comment #3 from sebras at gmail dot com ---
Created attachment 12008
--> https://sourceware.org/bugzilla/attachment.cgi?id=12008&action=edit
Proposed patch updating collation order.
This version of my proposed patch also addresses the tests, making 'make check'
PASS localedata/sort-test.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
` (2 preceding siblings ...)
2019-09-28 1:26 ` sebras at gmail dot com
@ 2019-10-04 21:47 ` fweimer at redhat dot com
2021-01-07 17:58 ` tfn50 at hotmail dot com
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: fweimer at redhat dot com @ 2019-10-04 21:47 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fweimer at redhat dot com
Flags| |security-
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
` (3 preceding siblings ...)
2019-10-04 21:47 ` fweimer at redhat dot com
@ 2021-01-07 17:58 ` tfn50 at hotmail dot com
2021-02-02 21:36 ` carlos at redhat dot com
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: tfn50 at hotmail dot com @ 2021-01-07 17:58 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
tfn50 at hotmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |tfn50 at hotmail dot com
--- Comment #4 from tfn50 at hotmail dot com ---
I just had to create an account here to add my comment to this, the ordering
which is done with W not counting as a proper letter is no longer the
recommended Swedish as used in Sweden. Svenska Akademins ordlista (SAOL) have
since the year 2006 considered W to be its own proper letter and to be sorted
after V (just as it is it for instance English).
As for reference I present one old DN.se news article written in Swedish of
course (now only available through web archive):
https://web.archive.org/web/20071001091738/http://www.dn.se/DNet/jsp/polopoly.jsp?d=1058&a=538744&previousRenderType=3
Could we please get the locale data updated?
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
` (4 preceding siblings ...)
2021-01-07 17:58 ` tfn50 at hotmail dot com
@ 2021-02-02 21:36 ` carlos at redhat dot com
2021-02-03 0:10 ` sebras at gmail dot com
2021-04-06 16:55 ` carlos at redhat dot com
7 siblings, 0 replies; 9+ messages in thread
From: carlos at redhat dot com @ 2021-02-02 21:36 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |carlos at redhat dot com
--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to tfn50 from comment #4)
> I just had to create an account here to add my comment to this, the ordering
> which is done with W not counting as a proper letter is no longer the
> recommended Swedish as used in Sweden. Svenska Akademins ordlista (SAOL)
> have since the year 2006 considered W to be its own proper letter and to be
> sorted after V (just as it is it for instance English).
>
> As for reference I present one old DN.se news article written in Swedish of
> course (now only available through web archive):
> https://web.archive.org/web/20071001091738/http://www.dn.se/DNet/jsp/
> polopoly.jsp?d=1058&a=538744&previousRenderType=3
>
> Could we please get the locale data updated?
Thanks for reaching out. Yes, we've noticed this issue, and I think the
recommended patch from 2019 is the right direction forward. We'll try get this
fixed for glibc 2.34.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
` (5 preceding siblings ...)
2021-02-02 21:36 ` carlos at redhat dot com
@ 2021-02-03 0:10 ` sebras at gmail dot com
2021-04-06 16:55 ` carlos at redhat dot com
7 siblings, 0 replies; 9+ messages in thread
From: sebras at gmail dot com @ 2021-02-03 0:10 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
--- Comment #6 from sebras at gmail dot com ---
(In reply to Carlos O'Donell from comment #5)
> I think the recommended patch from 2019 is the right direction forward.
> We'll try get this fixed for glibc 2.34.
To be honest I had forgotten about this bug by now, but I'm happy to hear
that it is moving forward. :)
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/25036] Update collation order for Swedish
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
` (6 preceding siblings ...)
2021-02-03 0:10 ` sebras at gmail dot com
@ 2021-04-06 16:55 ` carlos at redhat dot com
7 siblings, 0 replies; 9+ messages in thread
From: carlos at redhat dot com @ 2021-04-06 16:55 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=25036
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |2.34
Resolution|--- |FIXED
Status|UNCONFIRMED |RESOLVED
--- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
Fixed for glibc 2.34.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-04-06 16:55 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25 22:38 [Bug localedata/25036] New: Update collation order for Swedish sebras at gmail dot com
2019-09-25 22:40 ` [Bug localedata/25036] " sebras at gmail dot com
2019-09-26 21:33 ` sebras at gmail dot com
2019-09-28 1:26 ` sebras at gmail dot com
2019-10-04 21:47 ` fweimer at redhat dot com
2021-01-07 17:58 ` tfn50 at hotmail dot com
2021-02-02 21:36 ` carlos at redhat dot com
2021-02-03 0:10 ` sebras at gmail dot com
2021-04-06 16:55 ` carlos at redhat dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).