From: Carlos O'Donell <carlos@redhat.com>
To: Florian Weimer <fweimer@redhat.com>,
Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
Date: Mon, 29 Jun 2020 15:47:02 -0400 [thread overview]
Message-ID: <010a6804-eabd-d818-4200-a4b63fdf7f19@redhat.com> (raw)
In-Reply-To: <877dvq1a3u.fsf@oldenburg2.str.redhat.com>
On 6/29/20 5:42 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
>
>> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
>> index c23e50944f..d89d788a9b 100644
>> --- a/locale/programs/charmap.c
>> +++ b/locale/programs/charmap.c
>> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
>
>> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
>> enum token_t ellipsis = 0;
>> int step = 1;
>>
>> + /* POSIX explicitly requires that ellipsis processing do the
>> + following: "Bytes shall be treated as unsigned octets, and carry
>> + shall be propagated between the bytes as necessary to represent the
>> + range." It then goes on to say that such a declaration should
>> + never be specified because it creates NULL bytes. Therefore we
>> + error on this condition (see charmap_new_char). However this still
>> + leaves a problem for encodings which use less than the full 8-bits,
>> + like UTF-8, and in such encodings you can use an ellipsis to
>> + silently and accidentally create invalid ranges. In UTF-8 you have
>> + only the first 6-bits of the first byte and if your ellipsis covers
>> + a code point range larger than this 64 code point block the output
>> + is going to be an invalid non-UTF-8 multi-byte sequence. Thus for
>> + UTF-8 we add a speical ellipsis handling loop that can increment
>> + UTF-8 multi-byte output effectively and for UTF-8 we allow larger
>> + ellipsis ranges without error. There may still be other encodings
>> + for which the ellipsis will still generate invalid multi-byte
>> + output, but not for UTF-8. The only alternative would be to call
>> + gconv for each Unicode code point in the loop to convert it to the
>> + appropriate multi-byte output, but that would be slow. */
>
> Typo: speical
>
>
>> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
>> for (cnt = from_nr; cnt <= to_nr; cnt += step)
>> {
>> char *name_end;
>> + unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
>> obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
>> prefix_len, from, len1 - prefix_len, cnt);
>> obstack_1grow (ob, '\0');
>> name_end = obstack_finish (ob);
>>
>> + /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
>> + above), or we have a non-UTF-8 charmap and we follow POSIX rules as
>> + further below for incrementing the bytes in an ellipsis. */
>> + if (is_utf8)
>> + {
>> + int nubytes;
>> +
>> + /* Direclty convert the code point to the UTF-8 encoded bytes. */
>> + nubytes = output_utf8_bytes (cnt, 4, ubytes);
>
> Typo: Direclty
>
> There are some overlong linese here, please fix.
>
>> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
>> new file mode 100644
>> index 0000000000..70ab2bbac7
>> --- /dev/null
>> +++ b/localedata/C.UTF-8.in
>> @@ -0,0 +1,852388 @@
>
> I do not think it's a good idea to check in this file. It's large and
> it's dormant during regular builds.
I accept that. Until we enable C.UTF-8 more broadly we won't be using it.
My worry here is that as soon as we enable this in debian and fedora
we'll start getting working C.UTF-8 that consumes 28MiB installed.
Should we limit collation to ASCII only for C.UTF-8 until we've fixed
the collation table size?
* Submit a C.UTF-8.in with just ASCII in LC_COLLATE.
* Add C.UTF-8 to SUPPORTED.
* Test C.UTF-8.
--
Cheers,
Carlos.
next prev parent reply other threads:[~2020-06-29 19:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-29 4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
2020-06-29 4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
2020-06-29 8:13 ` Florian Weimer
2020-06-29 19:42 ` Carlos O'Donell
2020-06-29 4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
2020-06-29 7:54 ` Andreas Schwab
2020-06-29 9:42 ` Florian Weimer
2020-06-29 19:47 ` Carlos O'Donell [this message]
2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=010a6804-eabd-d818-4200-a4b63fdf7f19@redhat.com \
--to=carlos@redhat.com \
--cc=fweimer@redhat.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).