public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules
@ 2004-09-05 20:49 pablo at mandrakesoft dot com
2004-09-05 20:50 ` [Bug localedata/368] " pablo at mandrakesoft dot com
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: pablo at mandrakesoft dot com @ 2004-09-05 20:49 UTC (permalink / raw)
To: glibc-bugs
I reached what seems to be a limitation in the numlber of LC_COLLATE collating-elements.
I was trying to build a dz_BT locale (Dzongkha language, Buthan);
the sorting rules are quite special, as for example nexy to <ka> entry are words starting with prefix attached to ka radical, eg: <da>-<ka>, <ba>-<ka> etc, come just after words starting with <ka>, and not with words starting with <da>, <ba>, etc.
Said otherwise, the base collating elements are the 30 base letters, plus 103 prefix-radical collating elements.
Now, it is even more complex that that; some letter sequences are prefix-radical or not depending on what follows them; eg for <da>-<ga> it is a prefix if followed with <ga>, <nga>, <da>,... but not otherwise.
That is, it is needed to define collating elements comprising of the prefix element and the next char, which are then sorted as a digraph; eg:
collating-element <rad-ga-d-ga> from "<U0F51><U0F42><U0F42>"
...
<rad-ga-d-ga> "<TIB-GA-R_D><TIB-GA>";....
That mens there are a lot of collating-elements to define; 303 in total.
But ifI use more than 265 the locale doesn't compile (localedef just runs forever taking 90% of cpu ressources doing nothing); while if I comment some of them to have no more than 265 in use, then it compiles nicely.
I attach the preliminary dz_BT locale I was working on; some entries are commented with %%%% (four percent signs), so that the file can compile; but to have the rules complete, all those lines commented out with "%%%%" should be enabled as well.
--
Summary: localedef fails with coplex LC_COLLATE rules
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: localedata
AssignedTo: pere at hungry dot com
ReportedBy: pablo at mandrakesoft dot com
CC: glibc-bugs at sources dot redhat dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
@ 2004-09-05 20:50 ` pablo at mandrakesoft dot com
2005-01-02 23:26 ` barbier at linuxfr dot org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pablo at mandrakesoft dot com @ 2004-09-05 20:50 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From pablo at mandrakesoft dot com 2004-09-05 20:50 -------
Created an attachment (id=187)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=187&action=view)
sample dz_BT locale (with several lines commented out with "%%%%" that should
be enabled
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
2004-09-05 20:50 ` [Bug localedata/368] " pablo at mandrakesoft dot com
@ 2005-01-02 23:26 ` barbier at linuxfr dot org
2005-01-17 21:38 ` barbier at linuxfr dot org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-02 23:26 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-02 23:26 -------
Created an attachment (id=332)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=332&action=view)
allow more than 256 collating-element definitions
I could not find why elem_size has to be less than 257, and thus dropped
this constraint. Then elem_size had to be computed more accurately in
order to prevent allocation of large unused data.
But your dz_BT file still did not compile because the secondary hashing
function seems to do a poor job: iter was null and there is an endless
loop. A better secondary hashing function is to add 1 to the current
one, but the functions which read collation data would need to be fixed
too. Instead, I chose to enlarge the table when such a loop is
encountered.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
2004-09-05 20:50 ` [Bug localedata/368] " pablo at mandrakesoft dot com
2005-01-02 23:26 ` barbier at linuxfr dot org
@ 2005-01-17 21:38 ` barbier at linuxfr dot org
2005-01-17 22:33 ` barbier at linuxfr dot org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 21:38 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 21:38 -------
As this patch only changes the multi-byte sequence, we can check
whether wide-char and multi-byte collations give the same results,
in which case this patch is certainly right.
I created a file containing sequences of 2 Tibetan characters:
$ for i in `seq 0x0F00 0x0FCF`; do
for j in `seq 0x0F00 0x0FCF`; do
printf "0: %08x %08x 0000000a " $i $j | xxd -r -g4
done
done | iconv -f ucs4 -t utf8 > input_file
Then ran
$ LC_ALL=en_US.UTF-8 ./tst-wcscoll < input_file > out.wc-en_US
$ LC_ALL=en_US.UTF-8 ./tst-strcoll < input_file > out.mb-en_US
$ cmp out.wc-en_US out.mb-en_US
$
So results are exactly similar. But to show that this patch allows
more than 256 collating elements, we need to check with more complex
LC_COLLATE sections. I took Pablo's locale file, s/^%%%%</</ to have
more than 256 collating elements, and re-ran this test:
$ export LOCPATH=`mktemp -d /tmp/test.XXXXXX`
$ localedef.patched -i dz_BT -f UTF-8 $LOCPATH/dz_BT
$ LC_ALL=dz_BT ./tst-wcscoll < input_file > out.wc-dz_BT
$ LC_ALL=dz_BT ./tst-strcoll < input_file > out.mb-dz_BT
$ cmp out.wc-dz_BT out.mb-dz_BT
$
Looks good.
Note that tst-strcoll is much slower than tst-wcscoll, which seems
quite logical since the primary key is the first UTF-8 byte and does
not change in the range 0x0F00-0x0FCF.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (3 preceding siblings ...)
2005-01-17 22:33 ` barbier at linuxfr dot org
@ 2005-01-17 22:33 ` barbier at linuxfr dot org
2005-08-03 10:54 ` cfynn at gmx dot net
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 22:33 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 22:33 -------
Created an attachment (id=372)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=372&action=view)
C source file for the tst-strcoll program
This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (2 preceding siblings ...)
2005-01-17 21:38 ` barbier at linuxfr dot org
@ 2005-01-17 22:33 ` barbier at linuxfr dot org
2005-01-17 22:33 ` barbier at linuxfr dot org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 22:33 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 22:33 -------
Created an attachment (id=373)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=373&action=view)
C source file for the tst-wcscoll program
This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (4 preceding siblings ...)
2005-01-17 22:33 ` barbier at linuxfr dot org
@ 2005-08-03 10:54 ` cfynn at gmx dot net
2005-08-03 11:37 ` cfynn at gmx dot net
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: cfynn at gmx dot net @ 2005-08-03 10:54 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From cfynn at gmx dot net 2005-08-03 10:54 -------
localedef *still* only handles only 256 collating-element definitions.
Cultrually correct (standard dictionary order) of langages like Dzongkha (dz_BT)
and Tibetan (bo_CN) *require* over 350 ellements in LC_COLLATE
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (5 preceding siblings ...)
2005-08-03 10:54 ` cfynn at gmx dot net
@ 2005-08-03 11:37 ` cfynn at gmx dot net
2005-10-14 21:11 ` drepper at redhat dot com
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: cfynn at gmx dot net @ 2005-08-03 11:37 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From cfynn at gmx dot net 2005-08-03 11:37 -------
Created an attachment (id=567)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=567&action=view)
dz_BT Collation - generated automatically from CLDR
*
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (6 preceding siblings ...)
2005-08-03 11:37 ` cfynn at gmx dot net
@ 2005-10-14 21:11 ` drepper at redhat dot com
2005-10-14 22:56 ` drepper at redhat dot com
2005-10-14 22:57 ` drepper at redhat dot com
9 siblings, 0 replies; 11+ messages in thread
From: drepper at redhat dot com @ 2005-10-14 21:11 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From drepper at redhat dot com 2005-10-14 21:11 -------
*** Bug 307 has been marked as a duplicate of this bug. ***
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |cfynn at gmx dot net
http://sourceware.org/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (7 preceding siblings ...)
2005-10-14 21:11 ` drepper at redhat dot com
@ 2005-10-14 22:56 ` drepper at redhat dot com
2005-10-14 22:57 ` drepper at redhat dot com
9 siblings, 0 replies; 11+ messages in thread
From: drepper at redhat dot com @ 2005-10-14 22:56 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From drepper at redhat dot com 2005-10-14 22:56 -------
The ld-collate patch is wrong. I fixed it myself.
I checked in the first locale. The second one is completely useless. If there
are bugs in the file in CVS file a new bug and justify the change.
As for the test programs: they work just fine the way they are.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug localedata/368] localedef fails with coplex LC_COLLATE rules
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
` (8 preceding siblings ...)
2005-10-14 22:56 ` drepper at redhat dot com
@ 2005-10-14 22:57 ` drepper at redhat dot com
9 siblings, 0 replies; 11+ messages in thread
From: drepper at redhat dot com @ 2005-10-14 22:57 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From drepper at redhat dot com 2005-10-14 22:57 -------
The ld-collate patch is wrong. I fixed it myself.
I checked in the first locale. The second one is completely useless. If there
are bugs in the file in CVS file a new bug and justify the change.
As for the test programs: they work just fine the way they are.
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |drepper at redhat dot com
Status|NEW |RESOLVED
Resolution| |FIXED
http://sourceware.org/bugzilla/show_bug.cgi?id=368
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-10-14 22:57 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-05 20:49 [Bug localedata/368] New: localedef fails with coplex LC_COLLATE rules pablo at mandrakesoft dot com
2004-09-05 20:50 ` [Bug localedata/368] " pablo at mandrakesoft dot com
2005-01-02 23:26 ` barbier at linuxfr dot org
2005-01-17 21:38 ` barbier at linuxfr dot org
2005-01-17 22:33 ` barbier at linuxfr dot org
2005-01-17 22:33 ` barbier at linuxfr dot org
2005-08-03 10:54 ` cfynn at gmx dot net
2005-08-03 11:37 ` cfynn at gmx dot net
2005-10-14 21:11 ` drepper at redhat dot com
2005-10-14 22:56 ` drepper at redhat dot com
2005-10-14 22:57 ` drepper at redhat dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).