From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 36630 invoked by alias); 25 Sep 2019 22:38:56 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 36432 invoked by uid 48); 25 Sep 2019 22:38:51 -0000 From: "sebras at gmail dot com" To: libc-locales@sourceware.org Subject: [Bug localedata/25036] New: Update collation order for Swedish Date: Wed, 25 Sep 2019 22:38:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.32 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: sebras at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2019-q3/txt/msg00050.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25036 Bug ID: 25036 Summary: Update collation order for Swedish Product: glibc Version: 2.32 Status: UNCONFIRMED Severity: normal Priority: P2 Component: localedata Assignee: unassigned at sourceware dot org Reporter: sebras at gmail dot com CC: libc-locales at sourceware dot org Target Milestone: --- [TL;DR This is about an issue with/proposed patch for Swedish localedata] Today I tried to compile llpp git HEAD (3371788 of [1]) but one if it's dependencies, ocaml-4.09 [2] failed to link properly due to undefined references, see below. [1] https://github.com/moosotc/llpp.git [2] https://caml.inria.fr/pub/distrib/ocaml-4.09/ocaml-4.09.0.tar.xz .... gcc -O2 -fno-strict-aliasing -fwrapv -Wall -fno-tree-vrp -g -D_FILE_OFFSET_BITS=3D64 -D_REENTRANT -DCAML_NAME_SPACE=20 -DOCAML_STDLIB_DIR=3D'"/usr/local/lib/ocaml"' -Wl,-E -o ocamlrun prims.o libcamlrun.a -lm -ldl -lpthread /usr/bin/ld: prims.o:(.data.rel+0x0): undefined reference to `caml_' /usr/bin/ld: prims.o:(.data.rel+0x1d0): undefined reference to `caml_bs' /usr/bin/ld: prims.o:(.data.rel+0x278): undefined reference to `caml_convert_ra' .... In ocaml-4.09/runtime there is a shell script named gen_primitives.sh which= as input takes a set of .c-files implementing the ocaml primitives, and as out= put generates a list of symbol names of all the primitives. This list is then u= sed to generate a .c-file named prims.c, referencing each of the primitives. prims.c is later compiled and linked with a library compiled from the origi= nal set of .c-files. If the list generated by gen_primitives.sh contains spurio= us symbols you will get the link error I encountered above. gen_primitives.sh uses this sed command to collect the symbols from a .c-fi= le: sed -n -e "s/^CAMLprim value \([a-z0-9_][a-z0-9_]*\).*/\1/p" <.c-file name here> When the script processes ints.c it encounters (among others) these two lin= es, which result in the symbols caml_bs and caml_int_compare being listed as present: CAMLprim value caml_bswap16(value v) CAMLprim value caml_int_compare(value v1, value v2) But that is surprising since the expected result would be caml_bswap16 and caml_in_compare! I managed to reproduce this independently of the script and the input files by running: echo -e 'CAMLprim value caml_bswap16(value v)\nCAMLprim value caml_int_compare(value v1, value v2)' \ | sed -n -e "s/^CAMLprim value \([a-z0-9_][a-z0-9_]*\).*/\1/p" Cutting down the text and regexes used results in a smaller example: echo -e 'bsw\nint' | sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p" which gives the outputs "bs" and "int". Minimizing the test case further: echo -e 'w\ni' | sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p" yields something surprising: only "i" is printed! Next I tested with the entire alphabet (example cut down for brevity): echo -e 's\nt\nu\nv\nw\nx\ny\nz' | sed -n -e 's/^[a-z]$/&/p' echo -e 's\nt\nu\nv\nw\nx\ny\nz' | sed -n -e 's/^[a-z]$/&/p' The problem only appeared for "w" -- why? At this point the author of llpp= =20 suggested that it might be due to locale settings. My only locale settings = are: LANG=3Den_US.UTF-8 LC_COLLATE=3Dsv_SE.UTF-8 Despite me using a predominantly English (US) desktop I do prefer if e.g. m= usic files whose names contain Swedish characters are sorted according to Swedish alphabetical order. The same goes for sort and other pieces of software. The same command with a different collation setting yielded the expected result: echo -e 's\nt\nu\nv\nw\nx\ny\nz' | LC_COLLATE=3Dsv_SE.UTF-8 sed -n -e 's/^[a-z]$/&/p' echo -e 's\nt\nu\nv\nw\nx\ny\nz' | LC_COLLATE=3D sed -n -e 's/^[a-z]$/&/p' The same is true also for the prior command: echo -e 'w\ni' | LC_COLLATE=3D sed -n -e "s/^\([a-z0-9_][a-z0-9_]*\).*/\1/p" which now prints "w" and "i". Unsetting LC_COLLATE allowed me to compile oc= aml and thus llpp successfully which resolved my original issue. This is not the end of the story though, and does not explain why I'm filin= g a bug report with glibc. I wanted to understand why LC_COLLATE influence the regexes in sed, and why specifically for "w"? I dug through the sed documentation [3]. I can (barel= y) accept that LC_COLLATE influence regular expressions this way. [3] http://git.savannah.gnu.org/cgit/sed.git/tree/doc/sed.texi#n5669 The problem for me being a native Swedish speaker is that w and v are considered separate characters [4] and that Swedish alphabetical order sort= s w after v [5]. [4] https://en.wikipedia.org/wiki/Swedish_alphabet#Uncommon_letters [5] https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventi= ons That suggested that my localedata must have problems, so I read the content= s of /usr/share/i18n/locales/sv_SE. I found one section discussing whether v and= w are distinct characters [6] and another mentioning alphabetical sorting ord= er [7]. [6] https://sourceware.org/git/?p=3Dglibc.git;a=3Dblob;f=3Dlocaledata/locales/s= v_SE;h=3Daa28c23776408e593890883ebb4c8d70b971fe15;hb=3DHEAD#l106 [7] https://sourceware.org/git/?p=3Dglibc.git;a=3Dblob;f=3Dlocaledata/locales/s= v_SE;h=3Daa28c23776408e593890883ebb4c8d70b971fe15;hb=3DHEAD#l63 The part mentioning v and w being the same character [6] was added to glibc= 's localedata in a commit in 2000 [8], before the reform (according to [5]) in 2006. The part referring to the standard collation order [7] was added in a commit in 2017 [9]. [8] https://sourceware.org/git/?p=3Dglibc.git;a=3Dcommit;h=3Dcf365734e59f708b16= 10815196ee2303174308cf [9] https://sourceware.org/git/?p=3Dglibc.git;a=3Dcommit;h=3D159738548130d5ac4f= e6178977e940ed5f8cfdc4 It seems even the latter commit [9] failed to take notice of the Unicode organization's CLDR-project [10] adding a draft of the collation order refo= rm back in 2005 [11][12], officially published in CLDR 1.4.0 [13]. Two years l= ater they also changed the default collation order [14][15], officially publishe= d in CLDR 1.5.0 [16]. [10] https://github.com/unicode-org/cldr [11] https://github.com/unicode-org/cldr/commit/6daf16a4fb81ffbfc91de3dd12fbdd27= 0ca89cfd [12] https://github.com/unicode-org/cldr/commit/d1354eaaa147aacf9fc2e7e30c563ed5= be7d8227 [13] http://cldr.unicode.org/index/downloads/cldr-1-4 [14] https://github.com/unicode-org/cldr/commit/aaa134f4f74db91b781be1c8b9d9e9f1= fd123c0e [15] https://github.com/unicode-org/cldr/commit/ffe1610149231d8feb5e4d9ba4bb0857= f224869d [16] http://cldr.unicode.org/index/downloads/cldr-1-5 I'd like for glibc's localedata to refer to the reformed collation order. Mainly because of the language reform, but of course also because this would have prevented me encountering the original build issues in llpp. :) In the attached patch I try to address this, if there are any issues with it or something is unclear, please let me know. --=20 You are receiving this mail because: You are on the CC list for the bug.