[PATCH 0/2] Initial C.UTF-8 support.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Initial C.UTF-8 support.
@ 2020-06-29  4:07 Carlos O'Donell
  2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29  4:07 UTC (permalink / raw)
  To: libc-alpha

The initial C.UTF-8 support is something I've been working on for a
long time now, and I can see why nobody has tried to get this working
well in the past.

There was least one ellipsis bug (bug 22668) which needed fixing first
and was immediately obvious when you go through the code. It is
surprising that we don't use more ellipsis, but like all things if
it doesn't work then the authors just change the source file format
to work around the issue.  In this case I don't want to work around
the issue because it makes the C locale source file ungainly (requires
listing all collation elements one-by-one).

Next I had to learn that the POSIX collation ellipsis rules don't
work for UTF-8 and that we have no error reporting for such corruption.
Rather than have to break up the UTF-8 charmap into 64-symbol chunks
to avoid the POSIX rules breaking the UTF-8 output, I just added special
processing for UTF-8 input. That is to say if the charmap is called
"UTF-8" then special code for generating the multi-byte sequences are
used. This special code can generate the output quickly and efficiently,
and compiling the locale is very fast. I will in the future add a special
warning pass here to cross check between gconv and the data in the charmap
since such a cross-check would have revealed the problem. Then when we
do our own testing we can run such cross checks.

In the end we succeed in implementing C.UTF-8, but it's 28MiB in size and
there is no easy way to short-circuit the weights table. What we need is
to remove the collation weights and instead just use strcmp internally
since UTF-8 was designed this way.

I would still like to commit C.UTF-8, without adding it to SUPPORTED,
and without adding it to test-input. I want to do this to make incremental
progress and allow other developers the opportunity to work on some of
the changes.

The first patch in the series fixes the ellipsis range handling.

The second patch implements C.UTF-8.

The next steps after inclusion would be:
- Enable sort-test for C.UTF-8 by parallelizing the testing to hide the
  ~5-7 minutes of testing required for C.UTF-8 full code point sorting.
- Add code to remove C.UTF-8 collation weights and just use strcmp instead.
- Enable C.UTF-8 in SUPPORTED.
- Add warning pass to collation support to look for corrupt output.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668)
  2020-06-29  4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
@ 2020-06-29  4:08 ` Carlos O'Donell
  2020-06-29  8:13   ` Florian Weimer
  2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
  2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers
  2 siblings, 1 reply; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29  4:08 UTC (permalink / raw)
  To: libc-alpha, Hanataka Shinya

During ellipsis processing the collation cursor was not correctly
moved to the end of the ellipsis after processing.  This meant that
the cursor was left, usually, at the second to last entry.
Subsequent operations end up unlinking the ellipsis end entry or
just leaving it in the list dangling from the end.  This kind of
dangling is immediately visible in C.UTF-8 with the following
sorting from strcoll:
<U0010FFFF>
<U0000FFFF>
<U000007FF>
<U0000007F>
With the cursor correctly adjusted the end entry is correctly given
the right location and thus the right weight.

No regressions on x86_64 and i686.

Co-authored-by: Carlos O'Donell <carlos@redhat.com>
---
 locale/programs/ld-collate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index feb1a11258..a8ba2f07f0 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -1483,6 +1483,9 @@ order for `%.*s' already defined at %s:%Zu"),
 	    }
 	}
     }
+  /* Move the cursor to the last entry in the ellipsis.
+     Subsequent operations need to start from the last entry.  */
+  collate->cursor = endp;
 }

-- 
2.26.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668)
  2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
@ 2020-06-29  8:13   ` Florian Weimer
  2020-06-29 19:42     ` Carlos O'Donell
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2020-06-29  8:13 UTC (permalink / raw)
  To: Carlos O'Donell via Libc-alpha; +Cc: Hanataka Shinya, Carlos O'Donell

* Carlos O'Donell via Libc-alpha:

> During ellipsis processing the collation cursor was not correctly
> moved to the end of the ellipsis after processing.  This meant that
> the cursor was left, usually, at the second to last entry.
> Subsequent operations end up unlinking the ellipsis end entry or
> just leaving it in the list dangling from the end.  This kind of
> dangling is immediately visible in C.UTF-8 with the following
> sorting from strcoll:
> <U0010FFFF>
> <U0000FFFF>
> <U000007FF>
> <U0000007F>
> With the cursor correctly adjusted the end entry is correctly given
> the right location and thus the right weight.
>
> No regressions on x86_64 and i686.
>
> Co-authored-by: Carlos O'Donell <carlos@redhat.com>
> ---
>  locale/programs/ld-collate.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
> index feb1a11258..a8ba2f07f0 100644
> --- a/locale/programs/ld-collate.c
> +++ b/locale/programs/ld-collate.c
> @@ -1483,6 +1483,9 @@ order for `%.*s' already defined at %s:%Zu"),
>  	    }
>  	}
>      }
> +  /* Move the cursor to the last entry in the ellipsis.
> +     Subsequent operations need to start from the last entry.  */
> +  collate->cursor = endp;
>  }

Can't endp be NULL at this point?

Besides that, why does the change make a difference at all?  There
already is an assignment to collate->cursor in both “Enqueue the new
element” cases.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668)
  2020-06-29  8:13   ` Florian Weimer
@ 2020-06-29 19:42     ` Carlos O'Donell
  0 siblings, 0 replies; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29 19:42 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell via Libc-alpha; +Cc: Hanataka Shinya

On 6/29/20 4:13 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> During ellipsis processing the collation cursor was not correctly
>> moved to the end of the ellipsis after processing.  This meant that
>> the cursor was left, usually, at the second to last entry.
>> Subsequent operations end up unlinking the ellipsis end entry or
>> just leaving it in the list dangling from the end.  This kind of
>> dangling is immediately visible in C.UTF-8 with the following
>> sorting from strcoll:
>> <U0010FFFF>
>> <U0000FFFF>
>> <U000007FF>
>> <U0000007F>
>> With the cursor correctly adjusted the end entry is correctly given
>> the right location and thus the right weight.
>>
>> No regressions on x86_64 and i686.
>>
>> Co-authored-by: Carlos O'Donell <carlos@redhat.com>
>> ---
>>  locale/programs/ld-collate.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
>> index feb1a11258..a8ba2f07f0 100644
>> --- a/locale/programs/ld-collate.c
>> +++ b/locale/programs/ld-collate.c
>> @@ -1483,6 +1483,9 @@ order for `%.*s' already defined at %s:%Zu"),
>>  	    }
>>  	}
>>      }
>> +  /* Move the cursor to the last entry in the ellipsis.
>> +     Subsequent operations need to start from the last entry.  */
>> +  collate->cursor = endp;
>>  }
> 
> Can't endp be NULL at this point?

Yes, but it's an error condition.

If it's NULL and symstr != NULL, then we are in an error condition
and will trigger the "symbolic range ellipsis must not be directly 
followed by `order_end'", and we will set the cursor to NULL.

It is common to set the cursor to NULL after an error to avoid any
subsequent unlink_element() from working due to the error. In practice
it looks like we'd crash if you called unlink_element() with the cursor
set to NULL e.g. NULL deref in unlink_element(). You shouldn't be doing
that anyway, you should be shutting down.

I could clarify the error condition in a comment?

> Besides that, why does the change make a difference at all?  There
> already is an assignment to collate->cursor in both “Enqueue the new
> element” cases.

When we get called we have the following:

                                [cursor]
... element_t <-> element_t <-> element_t
                  "<U0000>"     "..."

The cursor points to a double-linked list of collation elements.
We enter with the pseudo-entry pointing at the ellipsis. Next we
unlink the ellipsis, and the cursor is back at <U0000>.

                  [cursor]
... element_t <-> element_t
                  "<U0000>"

The value of symstr is the ending symbol of the ellipsis so it's <U007F>.
We add that onto the list:

                                [cursor] 
... element_t <-> element_t <-> element_t
                  "<U0000>"     "<U007F>"

However, then we set the cursor on purpose because we want to insert
symbols between the start and end elements:

1092   /* Reset the cursor.  */
1093   collate->cursor = startp;

                  [cursor] 
... element_t <-> element_t <-> element_t
                  "<U0000>"     "<U007F>"
                  startp        endp

Then we proceed to add elements *after* the cursor. We iterate from
U0001 to U007E, adding entries.

1451		      /* Enqueue the new element.  */
1452		      elem->last = collate->cursor;
1453		      elem->next = collate->cursor->next;
1454		      elem->last->next = elem;
1455		      if (elem->next != NULL)
1456			elem->next->last = elem;
1457		      collate->cursor = elem;

This code inserts the new entry after the cursor, but before the
real end of the ellipsis:

                                [cursor] 
... element_t <-> element_t <-> element_t <-> element_t
                  "<U0000>"     "<U0001>"     "<U007F>"
                  startp                      endp

At the end of the function we have:

                  [cursor] 
... element_t <-> element_t <-> element_t
                  "<U007E>"     "<U007F>"
                                endp

The cursor should be pointing at endp, the last element in the
double linked list, otherwise when we come back to the caller we
will start inserting the next line after <U007E>.

The stack at this point looks like this:
#0  handle_ellipsis (ldfile=<optimized out>, ldfile@entry=0x5555555a2e80, symstr=<optimized out>, 
    symstr@entry=0x7fffffffd080 "U0000007F", symlen=<optimized out>, symlen@entry=9, ellipsis=<optimized out>, 
    ellipsis@entry=tok_ellipsis2, charmap=<optimized out>, charmap@entry=0x5555555a2fd0, repertoire=<optimized out>, 
    repertoire@entry=0x0, result=<optimized out>) at programs/ld-collate.c:1488
#1  0x000055555557d424 in collate_read (ldfile=<optimized out>, result=0x7fffffffd170, charmap=0x5555555a2fd0, repertoire_name=0x0, 
    ignore_content=0) at programs/ld-collate.c:3670
#2  0x000055555558436d in locfile_read (result=0x7fffffffd170, charmap=0x5555555a2fd0) at programs/locfile.c:180
#3  0x000055555555da3f in main (argc=<optimized out>, argv=0x7fffffffd3f8) at programs/localedef.c:263

We return to collate_read, break, and read more data:
3963       /* Prepare for the next round.  */
3964       now = lr_token (ldfile, charmap, result, NULL, verbose);
3965       nowtok = now->tok;

We read the *next* ellipsis start symbol, remember that.
We read the *next* ellipsis "..." (tok_ellipsis2), remember that.
We read the *next* ellipsis end symbol, and this triggers handle_ellipsis.
All this time insert_value()->insert_weights() is using collate->cursor to insert
values immediately after <U007E>, and the <U007F> (end of ellipsis) is still at
the end of the doubly-linked list.

When I finish parsing C.UTF-8 I have a list that ends like this:

<U0000> <-> ... <-> <U10FFFF> <-> <UFFFF> <-> <U07FF> <-> <U007F>

With each tok_ellipsis2 failing to reset the cursor, and so leaving the
trailing end of the last ellipsis in the doubly-linked list to get assigned
a weight in that order.

Does that clarify the fix?

Shall I add more comments about the cursor handling?

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
  2020-06-29  4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
  2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
@ 2020-06-29  4:22 ` Carlos O'Donell
  2020-06-29  7:54   ` Andreas Schwab
  2020-06-29  9:42   ` Florian Weimer
  2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers
  2 siblings, 2 replies; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29  4:22 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

Patch is an xz compressed attachment because otherwise is is ~15MiB
of test data that contains almost all valid UTF-8 characters.

8< --- 8< --- 8<
We add a new C.UTF-8 locale.  This locale is not builtin to glibc,
but is provided as a distinct locale.  The locale provides full
support for UTF-8 and this includes full code point sorting via
collation.  Unfortuantely given the present implementation in glibc
this results in 28MiB of LC_COLLATE data for all possible Unicode
code points.  Future improvements may reduce this size. Such
improvements likely require a shortcut for the collation data that
relies on C.UTF-8 single-byte sorting being equivalent to strcmp.

The new locale is NOT added to SUPPORTED.  Test data for almost all
code points (minus those not supported by collate-test) is provided
in C.UTF-8.in, and this verifies full code point sorting is working.
The next step is to reduce LC_COLLATE to a manageable size before we
enable the locale in SUPPORTED. Currently the C.UTF-8 testing can
add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
twice) so we don't enable this either until we can parallelize the
sort-test test. Testing sort-test with C.UTF-8 passes cleanly.

No regressions on x86_64 or i686.
---
 locale/programs/charmap.c              |    170 +-
 localedata/C.UTF-8.in                  | 852388 ++++++++++++++++++++++
 localedata/charmaps/UTF-8              |   4396 +-
 localedata/locales/C                   |    192 +
 localedata/locales/i18n_ctype          |      2 +-
 localedata/locales/tr_TR               |      2 +-
 localedata/locales/translit_circle     |      2 +-
 localedata/locales/translit_cjk_compat |      2 +-
 localedata/locales/translit_combining  |      2 +-
 localedata/locales/translit_compat     |      2 +-
 localedata/locales/translit_font       |      2 +-
 localedata/locales/translit_fraction   |      2 +-
 localedata/unicode-gen/utf8_gen.py     |    174 +-
 13 files changed, 853557 insertions(+), 3779 deletions(-)
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C

[-- Attachment #2: 0002-Add-new-C.UTF-8-locale-Bug-17318.patch.xz --]
[-- Type: application/x-xz, Size: 815252 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
  2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
@ 2020-06-29  7:54   ` Andreas Schwab
  2020-06-29  9:42   ` Florian Weimer
  1 sibling, 0 replies; 9+ messages in thread
From: Andreas Schwab @ 2020-06-29  7:54 UTC (permalink / raw)
  To: Carlos O'Donell via Libc-alpha

On Jun 29 2020, Carlos O'Donell via Libc-alpha wrote:

> @@ -125,67 +146,122 @@ def process_charmap(flines, outfile):
>  
>      <U0010>     /x10 DATA LINK ESCAPE
>      <U3400>..<U343F>     /xe3/x90/x80 <CJK Ideograph Extension A>
> -    %<UD800>     /xed/xa0/x80 <Non Private Use High Surrogate, First>
> -    %<UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
> +    <UD800>     /xed/xa0/x80 <Non Private Use High Surrogate, First>
> +    <UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
>      <U0010FFC0>..<U0010FFFD>     /xf4/x8f/xbf/x80 <Plane 16 Private Use>
>  
> +    Note that old glibc UTF-8 charmap left the surrogates commented out.
> +    We keep the surrogate entries because we want to be able to sort the
> +    invalid values into a consistent location.
> +
>      '''
>      fields_start = []
> +    fields_end = []
>      for line in flines:
>          fields = line.split(";")
> -         # Some characters have “<control>” as their name. We try to
> -         # use the “Unicode 1.0 Name” (10th field in
> -         # UnicodeData.txt) for them.
> -         #
> -         # The Characters U+0080, U+0081, U+0084 and U+0099 have
> -         # “<control>” as their name but do not even have aa
> -         # ”Unicode 1.0 Name”. We could write code to take their
> -         # alternate names from NameAliases.txt.
> +        # Some characters have “<control>” as their name. We try to
> +        # use the “Unicode 1.0 Name” (10th field in
> +        # UnicodeData.txt) for them.
> +        #
> +        # The Characters U+0080, U+0081, U+0084 and U+0099 have
> +        # “<control>” as their name but do not even have aa

s/aa/a/

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
  2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
  2020-06-29  7:54   ` Andreas Schwab
@ 2020-06-29  9:42   ` Florian Weimer
  2020-06-29 19:47     ` Carlos O'Donell
  1 sibling, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2020-06-29  9:42 UTC (permalink / raw)
  To: Carlos O'Donell via Libc-alpha

* Carlos O'Donell via Libc-alpha:

> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
> index c23e50944f..d89d788a9b 100644
> --- a/locale/programs/charmap.c
> +++ b/locale/programs/charmap.c
> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,

> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
>    enum token_t ellipsis = 0;
>    int step = 1;
>  
> +  /* POSIX explicitly requires that ellipsis processing do the
> +     following: "Bytes shall be treated as unsigned octets, and carry
> +     shall be propagated between the bytes as necessary to represent the
> +     range."  It then goes on to say that such a declaration should
> +     never be specified because it creates NULL bytes.  Therefore we
> +     error on this condition (see charmap_new_char).  However this still
> +     leaves a problem for encodings which use less than the full 8-bits,
> +     like UTF-8, and in such encodings you can use an ellipsis to
> +     silently and accidentally create invalid ranges.  In UTF-8 you have
> +     only the first 6-bits of the first byte and if your ellipsis covers
> +     a code point range larger than this 64 code point block the output
> +     is going to be an invalid non-UTF-8 multi-byte sequence.  Thus for
> +     UTF-8 we add a speical ellipsis handling loop that can increment
> +     UTF-8 multi-byte output effectively and for UTF-8 we allow larger
> +     ellipsis ranges without error.  There may still be other encodings
> +     for which the ellipsis will still generate invalid multi-byte
> +     output, but not for UTF-8.  The only alternative would be to call
> +     gconv for each Unicode code point in the loop to convert it to the
> +     appropriate multi-byte output, but that would be slow.  */

Typo: speical


> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
>    for (cnt = from_nr; cnt <= to_nr; cnt += step)
>      {
>        char *name_end;
> +      unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
>        obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
>  		      prefix_len, from, len1 - prefix_len, cnt);
>        obstack_1grow (ob, '\0');
>        name_end = obstack_finish (ob);
>  
> +      /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
> +	 above), or we have a non-UTF-8 charmap and we follow POSIX rules as
> +	 further below for incrementing the bytes in an ellipsis.  */
> +      if (is_utf8)
> +	{
> +	  int nubytes;
> +
> +	  /* Direclty convert the code point to the UTF-8 encoded bytes.  */
> +	  nubytes = output_utf8_bytes (cnt, 4, ubytes);

Typo: Direclty

There are some overlong linese here, please fix.

> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
> new file mode 100644
> index 0000000000..70ab2bbac7
> --- /dev/null
> +++ b/localedata/C.UTF-8.in
> @@ -0,0 +1,852388 @@

I do not think it's a good idea to check in this file.  It's large and
it's dormant during regular builds.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
  2020-06-29  9:42   ` Florian Weimer
@ 2020-06-29 19:47     ` Carlos O'Donell
  0 siblings, 0 replies; 9+ messages in thread
From: Carlos O'Donell @ 2020-06-29 19:47 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell via Libc-alpha

On 6/29/20 5:42 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
>> index c23e50944f..d89d788a9b 100644
>> --- a/locale/programs/charmap.c
>> +++ b/locale/programs/charmap.c
>> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
> 
>> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
>>    enum token_t ellipsis = 0;
>>    int step = 1;
>>  
>> +  /* POSIX explicitly requires that ellipsis processing do the
>> +     following: "Bytes shall be treated as unsigned octets, and carry
>> +     shall be propagated between the bytes as necessary to represent the
>> +     range."  It then goes on to say that such a declaration should
>> +     never be specified because it creates NULL bytes.  Therefore we
>> +     error on this condition (see charmap_new_char).  However this still
>> +     leaves a problem for encodings which use less than the full 8-bits,
>> +     like UTF-8, and in such encodings you can use an ellipsis to
>> +     silently and accidentally create invalid ranges.  In UTF-8 you have
>> +     only the first 6-bits of the first byte and if your ellipsis covers
>> +     a code point range larger than this 64 code point block the output
>> +     is going to be an invalid non-UTF-8 multi-byte sequence.  Thus for
>> +     UTF-8 we add a speical ellipsis handling loop that can increment
>> +     UTF-8 multi-byte output effectively and for UTF-8 we allow larger
>> +     ellipsis ranges without error.  There may still be other encodings
>> +     for which the ellipsis will still generate invalid multi-byte
>> +     output, but not for UTF-8.  The only alternative would be to call
>> +     gconv for each Unicode code point in the loop to convert it to the
>> +     appropriate multi-byte output, but that would be slow.  */
> 
> Typo: speical
> 
> 
>> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
>>    for (cnt = from_nr; cnt <= to_nr; cnt += step)
>>      {
>>        char *name_end;
>> +      unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
>>        obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
>>  		      prefix_len, from, len1 - prefix_len, cnt);
>>        obstack_1grow (ob, '\0');
>>        name_end = obstack_finish (ob);
>>  
>> +      /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
>> +	 above), or we have a non-UTF-8 charmap and we follow POSIX rules as
>> +	 further below for incrementing the bytes in an ellipsis.  */
>> +      if (is_utf8)
>> +	{
>> +	  int nubytes;
>> +
>> +	  /* Direclty convert the code point to the UTF-8 encoded bytes.  */
>> +	  nubytes = output_utf8_bytes (cnt, 4, ubytes);
> 
> Typo: Direclty
> 
> There are some overlong linese here, please fix.
> 
>> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
>> new file mode 100644
>> index 0000000000..70ab2bbac7
>> --- /dev/null
>> +++ b/localedata/C.UTF-8.in
>> @@ -0,0 +1,852388 @@
> 
> I do not think it's a good idea to check in this file.  It's large and
> it's dormant during regular builds.

I accept that. Until we enable C.UTF-8 more broadly we won't be using it.

My worry here is that as soon as we enable this in debian and fedora
we'll start getting working C.UTF-8 that consumes 28MiB installed.

Should we limit collation to ASCII only for C.UTF-8 until we've fixed
the collation table size?

* Submit a C.UTF-8.in with just ASCII in LC_COLLATE.
* Add C.UTF-8 to SUPPORTED.
* Test C.UTF-8.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] Initial C.UTF-8 support.
  2020-06-29  4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
  2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
  2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
@ 2020-06-29 22:50 ` Joseph Myers
  2 siblings, 0 replies; 9+ messages in thread
From: Joseph Myers @ 2020-06-29 22:50 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

On Mon, 29 Jun 2020, Carlos O'Donell via Libc-alpha wrote:

> - Add code to remove C.UTF-8 collation weights and just use strcmp instead.

And presumably strcpy / wcscpy for strxfrm / wcsxfrm.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-06-29 22:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29  4:07 [PATCH 0/2] Initial C.UTF-8 support Carlos O'Donell
2020-06-29  4:08 ` [PATCH 1/2] LC_COLLATE: Fix handling of last character in ellipsis, (Bug 22668) Carlos O'Donell
2020-06-29  8:13   ` Florian Weimer
2020-06-29 19:42     ` Carlos O'Donell
2020-06-29  4:22 ` [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell
2020-06-29  7:54   ` Andreas Schwab
2020-06-29  9:42   ` Florian Weimer
2020-06-29 19:47     ` Carlos O'Donell
2020-06-29 22:50 ` [PATCH 0/2] Initial C.UTF-8 support Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).