Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Florian Weimer <fweimer@redhat.com>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
Date: Wed, 26 Apr 2023 23:27:23 +0200	[thread overview]
Message-ID: <871qk6wczo.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <es43pxh5nu2eqshlx2ujtpl77afmqtef5s3jacdsgzrcd7l6m6@pgrmyzdxsjel>

* наб:

>> I've thought about this some more, and I don't think this is the
>> direction we should be going in.
>> 
>> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>>   the Python style).  It should have the property that it can encode
>>   every byte string as a string of wchar_t characters, and convert the
>>   result back.  It's not entirely trivial because we need to handle
>>   partial UTF-8 sequences at the end of the buffer carefully.  There
>>   might be some warts regarding EILSEQ handling lurking there.  Like the
>>   Python approach, it is somewhat imperfect because it's not preserving
>>   identity under string concatenation, i.e. f(x) || f(y) is not always
>>   equal to f(x || y), but that's just unavoidable.
>> 
>> * Switch the charset for the default C locale to UTF-8SE.  This matches
>>   the POSIX requirement that every byte can be encoded.

> The main point of LC_CTYPE=POSIX as specified is that it allows you to
> process paths (which are sequences of bytes, not characters) in a sane
> way ‒ part of that is that collation needs to be correct, so maybe, as a
> smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".
>
>   >>> b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape')
>   'Ŀ'
>   >>> b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape')
>   '\udcc4\udcc0'
>   >>>
>   >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape'))]
>   [319]
>   >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape'))]
>   [56516, 56512]
> which, I mean, sure, maybe that's sensible (I wouldn't say so), but
>   >>> b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape')
>   '\uffff'
>   >>> b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape')
>   '\udcef\udcbf\udcc0'
>   >>>
>   >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape'))]
>   [65535]
>   >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape'))]
>   [56559, 56511, 56512]
>
> Which means you can't process arbitrary data (pathnames) in a way that
> makes sense. In my opinion this would be /worse/ than the current
> behaviour, behaving erratically in the presence of Some Data instead of
> simply not supporting it.

Sorry for letting this linger for so long from my side, too.

Regarding the above, I'm not sure I find this convincing.  That's just
business as usual with collation?

However, after thinking about this some more, my idea (just use a
liberal UTF-8 variant) does not work given the APIs we have, in the
sense that code that works in C.UTF-8 today will stop working under this
hypothetical new locale.

For example, for mbrlen (S, N, PS), we have this requirement:

     If the first N bytes possibly form a valid multibyte character but
     the character is incomplete, the return value is ‘(size_t) -2’.
     Otherwise the multibyte character sequence is invalid and the
     return value is ‘(size_t) -1’.

If every byte sequence is a valid, then mbrlen can never return
(size_t) -2.  It would have to produce surrogate encoding instead.
But this means that detection of valid but incomplete UTF-8 sequences
(say at buffer boundaries) is no longer possible.  And that can't be
good because we would produce unexpected wide characters around
buffer boundaries.

I think this leaves us with a straight byte encoding, so either
ISO-8859-1 for simplicity (and with the cultural bias it brings), or the
musl-style shifted upper half encoding that your patch implements.

In the end, enabling UTF-8 (or some variant) by default is probably not
that important because it directly impacts mostly the wide character
interfaces.  Those are not widely used for a variety of reasons (one
probably being that our implementation is so incredibly slow).

Thanks,
Florian

next prev parent reply	other threads:[~2023-04-26 21:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] " наб
2022-09-06 14:06 ` [PATCH v2] " наб
2022-09-06 14:19 ` [PATCH] " Florian Weimer
2022-09-06 18:06   ` наб
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-11-02 17:17           ` [PATCH v6 " наб
2022-11-09 12:49             ` Florian Weimer
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб
2022-11-09 14:20             ` Florian Weimer
2022-11-09 16:14               ` [PATCH v7] " наб
2022-11-10  9:52                 ` Florian Weimer
2023-01-09 15:17                   ` [PATCH v8] " наб
2023-02-07 14:16                     ` [PATCH v9] " наб
2023-02-13 14:52                       ` Florian Weimer
2023-04-26 18:54                         ` наб
2023-04-26 21:27                           ` Florian Weimer [this message]
2023-04-27  0:17                             ` [PATCH v10] " наб
2023-04-28 15:43                               ` [PATCH v11] " наб
2023-05-07 22:53                                 ` [PATCH v12] " наб
2023-05-29 13:54                                   ` [PATCH v13] " наб
2022-11-10  8:10               ` [PATCH v6 2/2] " Florian Weimer
2022-11-28 16:24                 ` наб
2022-12-02 17:36                   ` Florian Weimer
2022-12-02 18:42                     ` наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-06 18:11     ` [PATCH v3 " наб

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871qk6wczo.fsf@oldenburg.str.redhat.com \
    --to=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=nabijaczleweli@nabijaczleweli.xyz \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).