public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/31658] New: request for Non-Ignorable Variable Weighting Unicode collation
@ 2024-04-19 17:13 pertusus at free dot fr
  0 siblings, 0 replies; only message in thread
From: pertusus at free dot fr @ 2024-04-19 17:13 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=31658

            Bug ID: 31658
           Summary: request for Non-Ignorable Variable Weighting Unicode
                    collation
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: pertusus at free dot fr
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

For index sorting, the Non-Ignorable Variable Weighting option for unicode
collation looks better
(http://www.unicode.org/reports/tr10/#Variable_Weighting), as it allows to have
spaces and punctuation marks sort before letters. In tests with strxfrm_l this
is not the sorting obtained. It would be nice to have a way to select
Non-Ignorable Variable Weighting, and even better to be able to select a
specific option among the possible four Variable Weighting code points options
proposed in tr10.

Some more information on the context of the request, that do not add
information on the bug itself. This request comes in the context of the GNU
Texinfo texi2any program, in which indices are sorted for some output formats
(Info, HTML). We currently use the Unicode::Collate Perl module, with
'variable' option set to 'Non-Ignorable' (we also have the option to sort
according to Unicode codepoints). In the development version, we can also use
Unicode::Collate::Locale for linguistic tailoring. In the development version,
we translated some Perl code to C, including index sorting code. In the default
case, we call Perl code from C to get the sort keys. There is a development
only option which uses strxfrm_l with a locale specified on the command line to
test the result obtained with strxfrm_l (we use the document/a specified locale
language and not the current locale in general). The collation obtained with
strxfrm_l is good for words sorting and allows for linguistic tailoring, but
without the Non-Ignorable Variable Weighting gives a result that is not really
usable, hence we stick for now to calling Perl. We translate to C for
performance reasons, to be faster than Perl only code.

The main thread on that in the bug-texinfo mailing list (probably not very
interesting) is at:
https://lists.gnu.org/archive/html/bug-texinfo/2024-01/msg00053.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-04-19 17:13 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-19 17:13 [Bug libc/31658] New: request for Non-Ignorable Variable Weighting Unicode collation pertusus at free dot fr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).