C.UTF-8 review

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* C.UTF-8 review
@ 2021-07-14 11:57 Florian Weimer
  2021-07-14 23:06 ` Paul Eggert
  2021-07-21  3:11 ` Carlos O'Donell
  0 siblings, 2 replies; 6+ messages in thread
From: Florian Weimer @ 2021-07-14 11:57 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha

Carlos,

I reviewed your changes on the codonell/c-utf8 branch and have some
comments.

I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
not U+003F QUESTION MARK:

+% Include the neutral transliterations.  The builtin C and
+% POSIX locales have +1600 transliterations that are built into
+% the locales, and these are a superset of those.
+translit_start
+include "translit_neutral";""
+default_missing <U003F>
+translit_end

I guess this menas that C and C.UTF-8 LC_CTYPE diverge.

The strcmp_collation keyword is reasonably explicit.  I would appreciate
if we could produce an error if it is used along with other collation
directives.  Right now, those are silently ignored.

There is a trailing newline in localedata/locales/C.

But overall it's very nice.  We should have shipped this many years ago.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: C.UTF-8 review
  2021-07-14 11:57 C.UTF-8 review Florian Weimer
@ 2021-07-14 23:06 ` Paul Eggert
  2021-07-15  8:56   ` Florian Weimer
  2021-07-21  3:11 ` Carlos O'Donell
  1 sibling, 1 reply; 6+ messages in thread
From: Paul Eggert @ 2021-07-14 23:06 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell; +Cc: libc-alpha

Dumb question. Will C.UTF-8 have the same worst-case strcoll performance 
that en_US.UTF-8 does? I'm asking because I wonder whether we can 
recommend C.UTF-8 as a workaround for the strcoll performance bug, in 
cases where plain C is not appropriate.

https://sourceware.org/bugzilla/show_bug.cgi?id=18441

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: C.UTF-8 review
  2021-07-14 23:06 ` Paul Eggert
@ 2021-07-15  8:56   ` Florian Weimer
  2021-07-19 15:33     ` Carlos O'Donell
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Weimer @ 2021-07-15  8:56 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha

* Paul Eggert:

> Dumb question. Will C.UTF-8 have the same worst-case strcoll
> performance that en_US.UTF-8 does? I'm asking because I wonder whether
> we can recommend C.UTF-8 as a workaround for the strcoll performance
> bug, in cases where plain C is not appropriate.
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340

I'm not sure if that advice is correct …

With the new C.UTF-8 implementation, strcoll automatically switches to
strcmp, so it will be as fast as it can be.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: C.UTF-8 review
  2021-07-15  8:56   ` Florian Weimer
@ 2021-07-19 15:33     ` Carlos O'Donell
  2021-07-19 15:37       ` Florian Weimer
  0 siblings, 1 reply; 6+ messages in thread
From: Carlos O'Donell @ 2021-07-19 15:33 UTC (permalink / raw)
  To: Florian Weimer, Paul Eggert; +Cc: libc-alpha

On 7/15/21 4:56 AM, Florian Weimer wrote:
> * Paul Eggert:
> 
>> Dumb question. Will C.UTF-8 have the same worst-case strcoll
>> performance that en_US.UTF-8 does? I'm asking because I wonder whether
>> we can recommend C.UTF-8 as a workaround for the strcoll performance
>> bug, in cases where plain C is not appropriate.
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>>
>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
> 
> I'm not sure if that advice is correct …

Could you expand on this a bit more please?

> With the new C.UTF-8 implementation, strcoll automatically switches to
> strcmp, so it will be as fast as it can be.

Agreed.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: C.UTF-8 review
  2021-07-19 15:33     ` Carlos O'Donell
@ 2021-07-19 15:37       ` Florian Weimer
  0 siblings, 0 replies; 6+ messages in thread
From: Florian Weimer @ 2021-07-19 15:37 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Paul Eggert, libc-alpha

* Carlos O'Donell:

> On 7/15/21 4:56 AM, Florian Weimer wrote:
>> * Paul Eggert:
>> 
>>> Dumb question. Will C.UTF-8 have the same worst-case strcoll
>>> performance that en_US.UTF-8 does? I'm asking because I wonder whether
>>> we can recommend C.UTF-8 as a workaround for the strcoll performance
>>> bug, in cases where plain C is not appropriate.
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>>>
>>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
>> 
>> I'm not sure if that advice is correct …
>
> Could you expand on this a bit more please?

Using C.UTF-8 with collation rules is unlikely to provide the intended
speed benefit.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: C.UTF-8 review
  2021-07-14 11:57 C.UTF-8 review Florian Weimer
  2021-07-14 23:06 ` Paul Eggert
@ 2021-07-21  3:11 ` Carlos O'Donell
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos O'Donell @ 2021-07-21  3:11 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-alpha

On 7/14/21 7:57 AM, Florian Weimer wrote:
> Carlos,
> 
> I reviewed your changes on the codonell/c-utf8 branch and have some
> comments.
> 
> I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
> not U+003F QUESTION MARK:

Unfortunately we can't use U+FFFD.

During the final transliteration (__gconv_transliterate()) we use
_NL_CTYPE_TRANSLIT_DEFAULT_MISSING to replace characters that have no
transliteration.

When you convert from UTF-8 to ASCII, even with TRANSLIT,IGNORE the framework
attempts to use U+FFFD for the ASCII output and that immediately fails with
a conversion error.

So "default_missing" must be valid in all possible output encodings, which
means U+FFFD is not usable (today), and likely why U+003F is being used.

To fix this we would have to attempt a final transliteration of default_missing
within __gconv_transliterate, but I feel like that's a follow-on change we
could do in the future.

> +% Include the neutral transliterations.  The builtin C and
> +% POSIX locales have +1600 transliterations that are built into
> +% the locales, and these are a superset of those.
> +translit_start
> +include "translit_neutral";""
> +default_missing <U003F>
> +translit_end
> 
> I guess this menas that C and C.UTF-8 LC_CTYPE diverge.

Yes, in locale/C-ctype.c we have very minimal data.

For C.UTF-8 the LC_CTYPE is much broader and I think this meets
user expectations for developers and harmonizes across what downstreams
have learned from using C.UTF-8.

> The strcmp_collation keyword is reasonably explicit.  I would appreciate
> if we could produce an error if it is used along with other collation
> directives.  Right now, those are silently ignored.
> 
> There is a trailing newline in localedata/locales/C.

Fixed.

> But overall it's very nice.  We should have shipped this many years ago.

Agreed. All my fault here.

I'll update my branch tomorrow with a new test for C.UTF-8, I found a
defect in the current branch.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-21  3:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 11:57 C.UTF-8 review Florian Weimer
2021-07-14 23:06 ` Paul Eggert
2021-07-15  8:56   ` Florian Weimer
2021-07-19 15:33     ` Carlos O'Donell
2021-07-19 15:37       ` Florian Weimer
2021-07-21  3:11 ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).