Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Florian Weimer <fweimer@redhat.com>
To: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318)
Date: Thu, 11 Mar 2021 20:05:17 +0100	[thread overview]
Message-ID: <87v99xcxzm.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <20210219032748.564216-1-carlos@redhat.com> (Carlos O'Donell via Libc-alpha's message of "Thu, 18 Feb 2021 22:27:48 -0500")

* Carlos O'Donell via Libc-alpha:

> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
> index 3d51e702dc..77085cff72 100644
> --- a/locale/programs/charmap.c
> +++ b/locale/programs/charmap.c
> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,

> +  /* POSIX explicitly requires that ellipsis processing do the
> +     following: "Bytes shall be treated as unsigned octets, and carry
> +     shall be propagated between the bytes as necessary to represent the
> +     range."  It then goes on to say that such a declaration should
> +     never be specified because it creates NULL bytes.  Therefore we

NUL or null, I think.

> +     error on this condition (see charmap_new_char).  However this still
> +     leaves a problem for encodings which use less than the full 8-bits,
> +     like UTF-8, and in such encodings you can use an ellipsis to
> +     silently and accidentally create invalid ranges.  In UTF-8 you have
> +     only the first 6-bits of the first byte and if your ellipsis covers

UTF-8 is variable length even in the leader byte, so “only the first
6-bits of the first byte” seems wrong.

> +/* This function takes the Unicode code point CP and encodes it into
> +   a UTF-8 byte stream that must be NBYTES long and is stored into
> +   the unsigned character array at BYTES.
> +
> +   If CP requires more than NBYTES to be encoded then we return an
> +   error of -1.
> +
> +   If CP is not within any of the valid Unicode code point ranges
> +   then we return an error of -2.
> +
> +   Otherwise we return the number of bytes encoded.  */
> +static int
> +output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes)
> +{
> +  /* We need at least 1 byte.  */
> +  if (nbytes < 1)
> +    return -1;
> +
> +  /* One byte range.  */
> +  if (cp >= 0x0 && cp <= 0x7f)
> +    {
> +      bytes[0] = cp & 0x7f;
> +      return 1;
> +    }

0x7f is superfluous and confusing here, as discussed before.

> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index 8cce47cd97..c70d359744 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -895,12 +895,14 @@ CHARMAP

> +<UD800>..<UDB7F> /xed/xa0/x80 <Non Private Use High Surrogate>
> +<UDB80>..<UDBFF> /xed/xae/x80 <Private Use High Surrogate>
> +<UDC00>..<UDFFF> /xed/xb0/x80 <Low Surrogate>
> +<UE000>..<UF8FF> /xee/x80/x80 <Private Use>

Technically this isn't right.  We don't want mappings for those
characters because it might introduce in other locale files that use
those characters.  But may be just need to be careful.

I'm surprised that this doesn't lead to testsuite failures because it's
inconsistent with the gconv converters.  Maybe we don't use this
anywhere?

The other invalid-ish Unicode codepoints (U+FFFE, U+FFFF) are actually
valid UTF-8 and handled by gconv, so including them seems okay.

> diff --git a/localedata/locales/C b/localedata/locales/C
> new file mode 100644
> index 0000000000..418e7c90a5
> --- /dev/null
> +++ b/localedata/locales/C
> @@ -0,0 +1,192 @@

> +% One rule, sort forward, for all code points to give code point
> +% order sorting for Unicode.
> +LC_COLLATE
> +order_start forward
> +<U00000000>
> +..
> +<U0000007F>
> +<U00000080>
> +..
> +<U000007FF>
> +<U00000800>
> +..
> +<U0000FFFF>
> +<U00010000>
> +..
> +<U0010FFFF>
> +UNDEFINED
> +order_end
> +END LC_COLLATE

Why are multiple ranges required here?

> diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype
> index c63e0790fc..c92bb95148 100644
> --- a/localedata/locales/i18n_ctype
> +++ b/localedata/locales/i18n_ctype
> @@ -26,7 +26,7 @@ fax       ""
>  language  ""
>  territory "Earth"
>  revision  "13.0.0"
> -date      "2020-06-25"
> +date      "2021-02-17"
>  category  "i18n:2012";LC_CTYPE
>  END LC_IDENTIFICATION

Those date changes seem spurious.  Is this no-op file regeneration
really needed?

> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 899840923a..42fc5efcb9 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py

>  def convert_to_hex(code_point):
>      '''Converts a code point to a hexadecimal UTF-8 representation
> +    like /x**/x**/x** without using any python library functions.
> +    This avoids problems with the encode function, including an
> +    inability to output the surrogate code points.

You can use chr(code_point).encode('UTF-8', 'surrogatepass') and the
Python encoder.

I reviewed the other changes and spot-checked the generated charmap.
Those parts look okay.

The question is whether we actually need a UTF-8 charmap.  If not, we
can teach charmap.c to generate the UTF-8 data on the fly in
charmap_find_value and charmap_find_symbol.  But I consider this part of
the data representation changes we discussed earlier.

Thanks,
Florian

next prev parent reply	other threads:[~2021-03-11 19:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-19  3:27 Carlos O'Donell
2021-02-19  3:34 ` Carlos O'Donell
2021-02-19  8:47 ` Andreas Schwab
2021-03-11 19:05 ` Florian Weimer [this message]
2021-03-11 19:19   ` Florian Weimer
2021-03-11 22:55     ` Carlos O'Donell
2021-04-28 11:54   ` Carlos O'Donell
2021-04-28 17:36     ` Joseph Myers
2021-04-29 11:43       ` Carlos O'Donell
2021-04-29 11:57       ` Carlos O'Donell
  -- strict thread matches above, loose matches on Subject: below --
2021-02-19  3:26 Carlos O'Donell
2021-02-19  3:22 Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v99xcxzm.fsf@oldenburg.str.redhat.com \
    --to=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).