* C.UTF-8 review
@ 2021-07-14 11:57 Florian Weimer
2021-07-14 23:06 ` Paul Eggert
2021-07-21 3:11 ` Carlos O'Donell
0 siblings, 2 replies; 6+ messages in thread
From: Florian Weimer @ 2021-07-14 11:57 UTC (permalink / raw)
To: Carlos O'Donell; +Cc: libc-alpha
Carlos,
I reviewed your changes on the codonell/c-utf8 branch and have some
comments.
I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
not U+003F QUESTION MARK:
+% Include the neutral transliterations. The builtin C and
+% POSIX locales have +1600 transliterations that are built into
+% the locales, and these are a superset of those.
+translit_start
+include "translit_neutral";""
+default_missing <U003F>
+translit_end
I guess this menas that C and C.UTF-8 LC_CTYPE diverge.
The strcmp_collation keyword is reasonably explicit. I would appreciate
if we could produce an error if it is used along with other collation
directives. Right now, those are silently ignored.
There is a trailing newline in localedata/locales/C.
But overall it's very nice. We should have shipped this many years ago.
Thanks,
Florian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-14 11:57 C.UTF-8 review Florian Weimer
@ 2021-07-14 23:06 ` Paul Eggert
2021-07-15 8:56 ` Florian Weimer
2021-07-21 3:11 ` Carlos O'Donell
1 sibling, 1 reply; 6+ messages in thread
From: Paul Eggert @ 2021-07-14 23:06 UTC (permalink / raw)
To: Florian Weimer, Carlos O'Donell; +Cc: libc-alpha
Dumb question. Will C.UTF-8 have the same worst-case strcoll performance
that en_US.UTF-8 does? I'm asking because I wonder whether we can
recommend C.UTF-8 as a workaround for the strcoll performance bug, in
cases where plain C is not appropriate.
https://sourceware.org/bugzilla/show_bug.cgi?id=18441
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=49340
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: C.UTF-8 review
2021-07-14 11:57 C.UTF-8 review Florian Weimer
2021-07-14 23:06 ` Paul Eggert
@ 2021-07-21 3:11 ` Carlos O'Donell
1 sibling, 0 replies; 6+ messages in thread
From: Carlos O'Donell @ 2021-07-21 3:11 UTC (permalink / raw)
To: Florian Weimer; +Cc: libc-alpha
On 7/14/21 7:57 AM, Florian Weimer wrote:
> Carlos,
>
> I reviewed your changes on the codonell/c-utf8 branch and have some
> comments.
>
> I believe the replacement character should U+FFFD REPLACEMENT CHARACTER,
> not U+003F QUESTION MARK:
Unfortunately we can't use U+FFFD.
During the final transliteration (__gconv_transliterate()) we use
_NL_CTYPE_TRANSLIT_DEFAULT_MISSING to replace characters that have no
transliteration.
When you convert from UTF-8 to ASCII, even with TRANSLIT,IGNORE the framework
attempts to use U+FFFD for the ASCII output and that immediately fails with
a conversion error.
So "default_missing" must be valid in all possible output encodings, which
means U+FFFD is not usable (today), and likely why U+003F is being used.
To fix this we would have to attempt a final transliteration of default_missing
within __gconv_transliterate, but I feel like that's a follow-on change we
could do in the future.
> +% Include the neutral transliterations. The builtin C and
> +% POSIX locales have +1600 transliterations that are built into
> +% the locales, and these are a superset of those.
> +translit_start
> +include "translit_neutral";""
> +default_missing <U003F>
> +translit_end
>
> I guess this menas that C and C.UTF-8 LC_CTYPE diverge.
Yes, in locale/C-ctype.c we have very minimal data.
For C.UTF-8 the LC_CTYPE is much broader and I think this meets
user expectations for developers and harmonizes across what downstreams
have learned from using C.UTF-8.
> The strcmp_collation keyword is reasonably explicit. I would appreciate
> if we could produce an error if it is used along with other collation
> directives. Right now, those are silently ignored.
>
> There is a trailing newline in localedata/locales/C.
Fixed.
> But overall it's very nice. We should have shipped this many years ago.
Agreed. All my fault here.
I'll update my branch tomorrow with a new test for C.UTF-8, I found a
defect in the current branch.
--
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-07-21 3:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 11:57 C.UTF-8 review Florian Weimer
2021-07-14 23:06 ` Paul Eggert
2021-07-15 8:56 ` Florian Weimer
2021-07-19 15:33 ` Carlos O'Donell
2021-07-19 15:37 ` Florian Weimer
2021-07-21 3:11 ` Carlos O'Donell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).