* C.UTF-8 review
@ 2021-07-14 11:57 Florian Weimer
2021-07-14 23:06 ` Paul Eggert
2021-07-21 3:11 ` Carlos O'Donell
0 siblings, 2 replies; 6+ messages in thread
From: Florian Weimer @ 2021-07-14 11:57 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: libc-alpha
Carlos,
I reviewed your changes on the codonell/c-utf8 branch and have some
comments.
I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
not U+003F QUESTION MARK:
+% Include the neutral transliterations. The builtin C and
+% POSIX locales have +1600 transliterations that are built into
+% the locales, and these are a superset of those.
+translit_start
+include "translit_neutral";""
+default_missing <U003F>
+translit_end
I guess this menas that C and C.UTF-8 LC_CTYPE diverge.
The strcmp_collation keyword is reasonably explicit. I would appreciate
if we could produce an error if it is used along with other collation
directives. Right now, those are silently ignored.
There is a trailing newline in localedata/locales/C.
But overall it's very nice. We should have shipped this many years ago.
Thanks,
Florian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-14 11:57 C.UTF-8 review Florian Weimer
@ 2021-07-14 23:06 ` Paul Eggert
2021-07-15 8:56 ` Florian Weimer
2021-07-21 3:11 ` Carlos O'Donell
1 sibling, 1 reply; 6+ messages in thread
From: Paul Eggert @ 2021-07-14 23:06 UTC (permalink / raw)
To: Florian Weimer, Carlos O'Donell; +Cc: libc-alpha
Dumb question. Will C.UTF-8 have the same worst-case strcoll performance
that en_US.UTF-8 does? I'm asking because I wonder whether we can
recommend C.UTF-8 as a workaround for the strcoll performance bug, in
cases where plain C is not appropriate.
https://sourceware.org/bugzilla/show_bug.cgi?id=18441
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-14 23:06 ` Paul Eggert
@ 2021-07-15 8:56 ` Florian Weimer
2021-07-19 15:33 ` Carlos O'Donell
0 siblings, 1 reply; 6+ messages in thread
From: Florian Weimer @ 2021-07-15 8:56 UTC (permalink / raw)
To: Paul Eggert; +Cc: Carlos O'Donell, libc-alpha
* Paul Eggert:
> Dumb question. Will C.UTF-8 have the same worst-case strcoll
> performance that en_US.UTF-8 does? I'm asking because I wonder whether
> we can recommend C.UTF-8 as a workaround for the strcoll performance
> bug, in cases where plain C is not appropriate.
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
I'm not sure if that advice is correct …
With the new C.UTF-8 implementation, strcoll automatically switches to
strcmp, so it will be as fast as it can be.
Thanks,
Florian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-15 8:56 ` Florian Weimer
@ 2021-07-19 15:33 ` Carlos O'Donell
2021-07-19 15:37 ` Florian Weimer
0 siblings, 1 reply; 6+ messages in thread
From: Carlos O'Donell @ 2021-07-19 15:33 UTC (permalink / raw)
To: Florian Weimer, Paul Eggert; +Cc: libc-alpha
On 7/15/21 4:56 AM, Florian Weimer wrote:
> * Paul Eggert:
>
>> Dumb question. Will C.UTF-8 have the same worst-case strcoll
>> performance that en_US.UTF-8 does? I'm asking because I wonder whether
>> we can recommend C.UTF-8 as a workaround for the strcoll performance
>> bug, in cases where plain C is not appropriate.
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>>
>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
>
> I'm not sure if that advice is correct …
Could you expand on this a bit more please?
> With the new C.UTF-8 implementation, strcoll automatically switches to
> strcmp, so it will be as fast as it can be.
Agreed.
--
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-19 15:33 ` Carlos O'Donell
@ 2021-07-19 15:37 ` Florian Weimer
0 siblings, 0 replies; 6+ messages in thread
From: Florian Weimer @ 2021-07-19 15:37 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: Paul Eggert, libc-alpha
* Carlos O'Donell:
> On 7/15/21 4:56 AM, Florian Weimer wrote:
>> * Paul Eggert:
>>
>>> Dumb question. Will C.UTF-8 have the same worst-case strcoll
>>> performance that en_US.UTF-8 does? I'm asking because I wonder whether
>>> we can recommend C.UTF-8 as a workaround for the strcoll performance
>>> bug, in cases where plain C is not appropriate.
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=18441
>>>
>>> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
>>
>> I'm not sure if that advice is correct …
>
> Could you expand on this a bit more please?
Using C.UTF-8 with collation rules is unlikely to provide the intended
speed benefit.
Thanks,
Florian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-14 11:57 C.UTF-8 review Florian Weimer
2021-07-14 23:06 ` Paul Eggert
@ 2021-07-21 3:11 ` Carlos O'Donell
1 sibling, 0 replies; 6+ messages in thread
From: Carlos O'Donell @ 2021-07-21 3:11 UTC (permalink / raw)
To: Florian Weimer; +Cc: libc-alpha
On 7/14/21 7:57 AM, Florian Weimer wrote:
> Carlos,
>
> I reviewed your changes on the codonell/c-utf8 branch and have some
> comments.
>
> I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
> not U+003F QUESTION MARK:
Unfortunately we can't use U+FFFD.
During the final transliteration (__gconv_transliterate()) we use
_NL_CTYPE_TRANSLIT_DEFAULT_MISSING to replace characters that have no
transliteration.
When you convert from UTF-8 to ASCII, even with TRANSLIT,IGNORE the framework
attempts to use U+FFFD for the ASCII output and that immediately fails with
a conversion error.
So "default_missing" must be valid in all possible output encodings, which
means U+FFFD is not usable (today), and likely why U+003F is being used.
To fix this we would have to attempt a final transliteration of default_missing
within __gconv_transliterate, but I feel like that's a follow-on change we
could do in the future.
> +% Include the neutral transliterations. The builtin C and
> +% POSIX locales have +1600 transliterations that are built into
> +% the locales, and these are a superset of those.
> +translit_start
> +include "translit_neutral";""
> +default_missing <U003F>
> +translit_end
>
> I guess this menas that C and C.UTF-8 LC_CTYPE diverge.
Yes, in locale/C-ctype.c we have very minimal data.
For C.UTF-8 the LC_CTYPE is much broader and I think this meets
user expectations for developers and harmonizes across what downstreams
have learned from using C.UTF-8.
> The strcmp_collation keyword is reasonably explicit. I would appreciate
> if we could produce an error if it is used along with other collation
> directives. Right now, those are silently ignored.
>
> There is a trailing newline in localedata/locales/C.
Fixed.
> But overall it's very nice. We should have shipped this many years ago.
Agreed. All my fault here.
I'll update my branch tomorrow with a new test for C.UTF-8, I found a
defect in the current branch.
--
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-07-21 3:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 11:57 C.UTF-8 review Florian Weimer
2021-07-14 23:06 ` Paul Eggert
2021-07-15 8:56 ` Florian Weimer
2021-07-19 15:33 ` Carlos O'Donell
2021-07-19 15:37 ` Florian Weimer
2021-07-21 3:11 ` Carlos O'Donell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).