public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Carlos O'Donell <carlos@redhat.com>
To: libc-alpha <libc-alpha@sourceware.org>
Subject: [PATCH 0/2] Initial C.UTF-8 support.
Date: Mon, 29 Jun 2020 00:07:06 -0400	[thread overview]
Message-ID: <75d21bd8-2698-2e25-969c-4e086c90abd9@redhat.com> (raw)

The initial C.UTF-8 support is something I've been working on for a
long time now, and I can see why nobody has tried to get this working
well in the past.

There was least one ellipsis bug (bug 22668) which needed fixing first
and was immediately obvious when you go through the code. It is
surprising that we don't use more ellipsis, but like all things if
it doesn't work then the authors just change the source file format
to work around the issue.  In this case I don't want to work around
the issue because it makes the C locale source file ungainly (requires
listing all collation elements one-by-one).

Next I had to learn that the POSIX collation ellipsis rules don't
work for UTF-8 and that we have no error reporting for such corruption.
Rather than have to break up the UTF-8 charmap into 64-symbol chunks
to avoid the POSIX rules breaking the UTF-8 output, I just added special
processing for UTF-8 input. That is to say if the charmap is called
"UTF-8" then special code for generating the multi-byte sequences are
used. This special code can generate the output quickly and efficiently,
and compiling the locale is very fast. I will in the future add a special
warning pass here to cross check between gconv and the data in the charmap
since such a cross-check would have revealed the problem. Then when we
do our own testing we can run such cross checks.

In the end we succeed in implementing C.UTF-8, but it's 28MiB in size and
there is no easy way to short-circuit the weights table. What we need is
to remove the collation weights and instead just use strcmp internally
since UTF-8 was designed this way.

I would still like to commit C.UTF-8, without adding it to SUPPORTED,
and without adding it to test-input. I want to do this to make incremental
progress and allow other developers the opportunity to work on some of
the changes.

The first patch in the series fixes the ellipsis range handling.

The second patch implements C.UTF-8.

The next steps after inclusion would be:
- Enable sort-test for C.UTF-8 by parallelizing the testing to hide the
  ~5-7 minutes of testing required for C.UTF-8 full code point sorting.
- Add code to remove C.UTF-8 collation weights and just use strcmp instead.
- Enable C.UTF-8 in SUPPORTED.
- Add warning pass to collation support to look for corrupt output.

-- 
Cheers,
Carlos.


             reply	other threads:[~2020-06-29  4:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29  4:07 Carlos O'Donell [this message]
2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
2020-06-29  8:13   ` Florian Weimer
2020-06-29 19:42     ` Carlos O'Donell
2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
2020-06-29  7:54   ` Andreas Schwab
2020-06-29  9:42   ` Florian Weimer
2020-06-29 19:47     ` Carlos O'Donell
2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=75d21bd8-2698-2e25-969c-4e086c90abd9@redhat.com \
    --to=carlos@redhat.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).