From: Florian Weimer <fweimer@redhat.com>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
Date: Wed, 26 Apr 2023 23:27:23 +0200 [thread overview]
Message-ID: <871qk6wczo.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <es43pxh5nu2eqshlx2ujtpl77afmqtef5s3jacdsgzrcd7l6m6@pgrmyzdxsjel>
* наб:
>> I've thought about this some more, and I don't think this is the
>> direction we should be going in.
>>
>> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>> the Python style). It should have the property that it can encode
>> every byte string as a string of wchar_t characters, and convert the
>> result back. It's not entirely trivial because we need to handle
>> partial UTF-8 sequences at the end of the buffer carefully. There
>> might be some warts regarding EILSEQ handling lurking there. Like the
>> Python approach, it is somewhat imperfect because it's not preserving
>> identity under string concatenation, i.e. f(x) || f(y) is not always
>> equal to f(x || y), but that's just unavoidable.
>>
>> * Switch the charset for the default C locale to UTF-8SE. This matches
>> the POSIX requirement that every byte can be encoded.
> The main point of LC_CTYPE=POSIX as specified is that it allows you to
> process paths (which are sequences of bytes, not characters) in a sane
> way ‒ part of that is that collation needs to be correct, so maybe, as a
> smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".
>
> >>> b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape')
> 'Ŀ'
> >>> b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape')
> '\udcc4\udcc0'
> >>>
> >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape'))]
> [319]
> >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape'))]
> [56516, 56512]
> which, I mean, sure, maybe that's sensible (I wouldn't say so), but
> >>> b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape')
> '\uffff'
> >>> b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape')
> '\udcef\udcbf\udcc0'
> >>>
> >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape'))]
> [65535]
> >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape'))]
> [56559, 56511, 56512]
>
> Which means you can't process arbitrary data (pathnames) in a way that
> makes sense. In my opinion this would be /worse/ than the current
> behaviour, behaving erratically in the presence of Some Data instead of
> simply not supporting it.
Sorry for letting this linger for so long from my side, too.
Regarding the above, I'm not sure I find this convincing. That's just
business as usual with collation?
However, after thinking about this some more, my idea (just use a
liberal UTF-8 variant) does not work given the APIs we have, in the
sense that code that works in C.UTF-8 today will stop working under this
hypothetical new locale.
For example, for mbrlen (S, N, PS), we have this requirement:
If the first N bytes possibly form a valid multibyte character but
the character is incomplete, the return value is ‘(size_t) -2’.
Otherwise the multibyte character sequence is invalid and the
return value is ‘(size_t) -1’.
If every byte sequence is a valid, then mbrlen can never return
(size_t) -2. It would have to produce surrogate encoding instead.
But this means that detection of valid but incomplete UTF-8 sequences
(say at buffer boundaries) is no longer possible. And that can't be
good because we would produce unexpected wide characters around
buffer boundaries.
I think this leaves us with a straight byte encoding, so either
ISO-8859-1 for simplicity (and with the cultural bias it brings), or the
musl-style shifted upper half encoding that your patch implements.
In the end, enabling UTF-8 (or some variant) by default is probably not
that important because it directly impacts mostly the wide character
interfaces. Those are not widely used for a variety of reasons (one
probably being that our implementation is so incredibly slow).
Thanks,
Florian
next prev parent reply other threads:[~2023-04-26 21:27 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-30 18:19 [PATCH] " наб
2022-09-06 14:06 ` [PATCH v2] " наб
2022-09-06 14:19 ` [PATCH] " Florian Weimer
2022-09-06 18:06 ` наб
2022-09-06 18:10 ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб
2022-09-14 2:39 ` [PATCH v4 " наб
2022-09-21 14:01 ` [PATCH v5 " наб
2022-11-02 17:17 ` [PATCH v6 " наб
2022-11-09 12:49 ` Florian Weimer
2022-11-02 17:17 ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб
2022-11-09 14:20 ` Florian Weimer
2022-11-09 16:14 ` [PATCH v7] " наб
2022-11-10 9:52 ` Florian Weimer
2023-01-09 15:17 ` [PATCH v8] " наб
2023-02-07 14:16 ` [PATCH v9] " наб
2023-02-13 14:52 ` Florian Weimer
2023-04-26 18:54 ` наб
2023-04-26 21:27 ` Florian Weimer [this message]
2023-04-27 0:17 ` [PATCH v10] " наб
2023-04-28 15:43 ` [PATCH v11] " наб
2023-05-07 22:53 ` [PATCH v12] " наб
2023-05-29 13:54 ` [PATCH v13] " наб
2022-11-10 8:10 ` [PATCH v6 2/2] " Florian Weimer
2022-11-28 16:24 ` наб
2022-12-02 17:36 ` Florian Weimer
2022-12-02 18:42 ` наб
2022-09-21 14:01 ` [PATCH v5 " наб
2022-09-14 2:39 ` [PATCH v4 " наб
2022-09-06 18:11 ` [PATCH v3 " наб
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871qk6wczo.fsf@oldenburg.str.redhat.com \
--to=fweimer@redhat.com \
--cc=libc-alpha@sourceware.org \
--cc=nabijaczleweli@nabijaczleweli.xyz \
--cc=vstinner@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).