public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Initial C.UTF-8 support.
@ 2020-06-29  4:07 Carlos O'Donell
  2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29  4:07 UTC (permalink / raw)
  To: libc-alpha

The initial C.UTF-8 support is something I've been working on for a
long time now, and I can see why nobody has tried to get this working
well in the past.

There was least one ellipsis bug (bug 22668) which needed fixing first
and was immediately obvious when you go through the code. It is
surprising that we don't use more ellipsis, but like all things if
it doesn't work then the authors just change the source file format
to work around the issue.  In this case I don't want to work around
the issue because it makes the C locale source file ungainly (requires
listing all collation elements one-by-one).

Next I had to learn that the POSIX collation ellipsis rules don't
work for UTF-8 and that we have no error reporting for such corruption.
Rather than have to break up the UTF-8 charmap into 64-symbol chunks
to avoid the POSIX rules breaking the UTF-8 output, I just added special
processing for UTF-8 input. That is to say if the charmap is called
"UTF-8" then special code for generating the multi-byte sequences are
used. This special code can generate the output quickly and efficiently,
and compiling the locale is very fast. I will in the future add a special
warning pass here to cross check between gconv and the data in the charmap
since such a cross-check would have revealed the problem. Then when we
do our own testing we can run such cross checks.

In the end we succeed in implementing C.UTF-8, but it's 28MiB in size and
there is no easy way to short-circuit the weights table. What we need is
to remove the collation weights and instead just use strcmp internally
since UTF-8 was designed this way.

I would still like to commit C.UTF-8, without adding it to SUPPORTED,
and without adding it to test-input. I want to do this to make incremental
progress and allow other developers the opportunity to work on some of
the changes.

The first patch in the series fixes the ellipsis range handling.

The second patch implements C.UTF-8.

The next steps after inclusion would be:
- Enable sort-test for C.UTF-8 by parallelizing the testing to hide the
  ~5-7 minutes of testing required for C.UTF-8 full code point sorting.
- Add code to remove C.UTF-8 collation weights and just use strcmp instead.
- Enable C.UTF-8 in SUPPORTED.
- Add warning pass to collation support to look for corrupt output.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-06-29 22:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29  4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
2020-06-29  8:13   ` Florian Weimer
2020-06-29 19:42     ` Carlos O'Donell
2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
2020-06-29  7:54   ` Andreas Schwab
2020-06-29  9:42   ` Florian Weimer
2020-06-29 19:47     ` Carlos O'Donell
2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).