public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: наб <nabijaczleweli@nabijaczleweli.xyz>
To: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Date: Fri, 2 Dec 2022 19:42:00 +0100	[thread overview]
Message-ID: <20221202184200.gfcnjwcfnc75wqqi@tarta.nabijaczleweli.xyz> (raw)
In-Reply-To: <87mt85pv1h.fsf@oldenburg.str.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3921 bytes --]

Hi!

On Fri, Dec 02, 2022 at 06:36:26PM +0100, Florian Weimer wrote:
> * наб:
> > On Thu, Nov 10, 2022 at 09:10:57AM +0100, Florian Weimer wrote:
> >> Raised on the musl list here:
> >>   Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
> >>   <https://www.openwall.com/lists/musl/2022/11/10/1>
> >
> > That thread seems to've been exhausted (at least I don't see anything
> > fresh in the archive) ‒ should I just resend with the comments for v7
> > applied, or do you have a mapping range you'd rather see given those
> > givens?
> 
> I still can't make up my mind.  I think the options are:
> 
> * Some sort of custom encoding (like you posted).
For which there's Prior Art in both other libcs and implementations
of similar mechanisms in unrelated software, making it just about what
users expect, and the lowest-energy conversion to what POSIX
has mandated for close to 8 years now.

> * Latin-1
Sorry, what? The Latin-1 that's so poorly defined W3C requires,
per spec, that Latin-1 charset specs be ignored? The Latin-1 they got
wrong so badly they made two subsequent standards, only one of which
compatible? The Latin-1 that has some random subset of germanic and
maybe like french if you squint and that's apparently fine? The iron
curtain has fallen, for better or for worse, since the 80s. If that's
the "solution", then leave "C" 7-bit. I'm gonna assume that's a joke.
(Also, people would then try to "use" it, and then (a) you've lost,
 but (b) the collation sequence is gonna be wrong always because
 it no longer represents any given language
 (though apparently it's "fine" if they collate in any random order,
  so it's legal per spec to make it just spanish, I think;
  this is somehow even worse, and I may be misunderstanding
  POSIX 7.3.2.6 because it'd mean that other parts of the standard
  that use and recommend "LC_ALL=C utility ..." to process bytes,
  like for sort, are also wrong?).)

> * UTF-8 with surrogate escape encoding (and encouraging POSIX to change again)
Well it's not gonna ‒ at least I don't think it is ‒ given that I don't
think it /changed/ anything actually? Issue 7-2008 7.2 says
> The tables in Locale Definition describe the characteristics and
> behavior of the POSIX locale for data consisting entirely of
> characters from the portable character set and the control character
> set. For other characters, the behavior is unspecified.

And just TC2 specified what had been unspecified behaviour I think?
Implementations had freedom to do whatever, including UTF-8, until 2016.
Naturally, as we're seeing now, not one has exercised that freedom.
If glibc /did/ do POSIX=C=C.UTF-8 before then,
then maybe we'd see a different result, but it hadn't, so we didn't.

> What argues in favor of the last point is that many, many people are
> using C.UTF-8 nowadays.
Great! They can continue to use C.UTF-8. They have had to opt in to
their preferred encoding like everyone else, and they will continue.
0 changes observed here.

> And effectively disabling wide/multibyte
> conversion until you call setlocale does not seem particularly useful.
"Mangling input data until explicitly disabled" is worse than
"input data is data, and you can make it characters".
Don't take me for not-a-UTF-8-maximalist, but, y'know,
it will never, unfortunately, be all I see,
and being able to completely opt out of additional input processing
would be nice; we're kinda close now with the current hard-7-bit ASCII,
and making it, essentially, I Can't Believe It's Not Just Bytes!,
per pt. 1, would eliminate even more head-aches IME.

Putting pro-verbial KOI-8 or [your grandma's favourite encoding] through
the UTF-8 grinder is much worse than just degrading to strcmp(),
but at this point I think I'm rambling, and spilled enough ink;
your call to make, at the end of the day.

наб

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2022-12-02 18:42 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] " наб
2022-09-06 14:06 ` [PATCH v2] " наб
2022-09-06 14:19 ` [PATCH] " Florian Weimer
2022-09-06 18:06   ` наб
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-11-02 17:17           ` [PATCH v6 " наб
2022-11-09 12:49             ` Florian Weimer
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб
2022-11-09 14:20             ` Florian Weimer
2022-11-09 16:14               ` [PATCH v7] " наб
2022-11-10  9:52                 ` Florian Weimer
2023-01-09 15:17                   ` [PATCH v8] " наб
2023-02-07 14:16                     ` [PATCH v9] " наб
2023-02-13 14:52                       ` Florian Weimer
2023-04-26 18:54                         ` наб
2023-04-26 21:27                           ` Florian Weimer
2023-04-27  0:17                             ` [PATCH v10] " наб
2023-04-28 15:43                               ` [PATCH v11] " наб
2023-05-07 22:53                                 ` [PATCH v12] " наб
2023-05-29 13:54                                   ` [PATCH v13] " наб
2022-11-10  8:10               ` [PATCH v6 2/2] " Florian Weimer
2022-11-28 16:24                 ` наб
2022-12-02 17:36                   ` Florian Weimer
2022-12-02 18:42                     ` наб [this message]
2022-09-21 14:01         ` [PATCH v5 " наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-06 18:11     ` [PATCH v3 " наб

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221202184200.gfcnjwcfnc75wqqi@tarta.nabijaczleweli.xyz \
    --to=nabijaczleweli@nabijaczleweli.xyz \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).