Hi!

On Fri, Dec 02, 2022 at 06:36:26PM +0100, Florian Weimer wrote:
> * наб:
> > On Thu, Nov 10, 2022 at 09:10:57AM +0100, Florian Weimer wrote:
> >> Raised on the musl list here:
> >>   Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
> >>   <https://www.openwall.com/lists/musl/2022/11/10/1>
> >
> > That thread seems to've been exhausted (at least I don't see anything
> > fresh in the archive) ‒ should I just resend with the comments for v7
> > applied, or do you have a mapping range you'd rather see given those
> > givens?
> 
> I still can't make up my mind.  I think the options are:
> 
> * Some sort of custom encoding (like you posted).
For which there's Prior Art in both other libcs and implementations
of similar mechanisms in unrelated software, making it just about what
users expect, and the lowest-energy conversion to what POSIX
has mandated for close to 8 years now.

> * Latin-1
Sorry, what? The Latin-1 that's so poorly defined W3C requires,
per spec, that Latin-1 charset specs be ignored? The Latin-1 they got
wrong so badly they made two subsequent standards, only one of which
compatible? The Latin-1 that has some random subset of germanic and
maybe like french if you squint and that's apparently fine? The iron
curtain has fallen, for better or for worse, since the 80s. If that's
the "solution", then leave "C" 7-bit. I'm gonna assume that's a joke.
(Also, people would then try to "use" it, and then (a) you've lost,
 but (b) the collation sequence is gonna be wrong always because
 it no longer represents any given language
 (though apparently it's "fine" if they collate in any random order,
  so it's legal per spec to make it just spanish, I think;
  this is somehow even worse, and I may be misunderstanding
  POSIX 7.3.2.6 because it'd mean that other parts of the standard
  that use and recommend "LC_ALL=C utility ..." to process bytes,
  like for sort, are also wrong?).)

> * UTF-8 with surrogate escape encoding (and encouraging POSIX to change again)
Well it's not gonna ‒ at least I don't think it is ‒ given that I don't
think it /changed/ anything actually? Issue 7-2008 7.2 says
> The tables in Locale Definition describe the characteristics and
> behavior of the POSIX locale for data consisting entirely of
> characters from the portable character set and the control character
> set. For other characters, the behavior is unspecified.

And just TC2 specified what had been unspecified behaviour I think?
Implementations had freedom to do whatever, including UTF-8, until 2016.
Naturally, as we're seeing now, not one has exercised that freedom.
If glibc /did/ do POSIX=C=C.UTF-8 before then,
then maybe we'd see a different result, but it hadn't, so we didn't.

> What argues in favor of the last point is that many, many people are
> using C.UTF-8 nowadays.
Great! They can continue to use C.UTF-8. They have had to opt in to
their preferred encoding like everyone else, and they will continue.
0 changes observed here.

> And effectively disabling wide/multibyte
> conversion until you call setlocale does not seem particularly useful.
"Mangling input data until explicitly disabled" is worse than
"input data is data, and you can make it characters".
Don't take me for not-a-UTF-8-maximalist, but, y'know,
it will never, unfortunately, be all I see,
and being able to completely opt out of additional input processing
would be nice; we're kinda close now with the current hard-7-bit ASCII,
and making it, essentially, I Can't Believe It's Not Just Bytes!,
per pt. 1, would eliminate even more head-aches IME.

Putting pro-verbial KOI-8 or [your grandma's favourite encoding] through
the UTF-8 grinder is much worse than just degrading to strcmp(),
but at this point I think I'm rambling, and spilled enough ink;
your call to make, at the end of the day.

наб