From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=IKUs=AR=redhat.com=fweimer@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 955A13858C54
	for <libc-alpha@sourceware.org>; Wed, 26 Apr 2023 21:27:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 955A13858C54
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1682544450;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=xwwPs+C/aRg6sZBNluPEE2bkmy8KbZJYG2Z8yz4qXgE=;
	b=hrHBvplNYxQ9o8Z67YC3/xTCCb3ppEs+C3wW0tdKFkBXoo/CqQwR/DcRRO8xOAjg2hi6FE
	r1o/ct1s4UaK3qyZLzH0Q4MpJPUu2ZB4/8MOCXmz5x6G+ttc9VY2mF+O+xVpCv4LMW7EmA
	3jxyflsAG+AJNswI9aYiI/4utdtEndA=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-508-zPRoXwMTMgKqjSRWKkFRtA-1; Wed, 26 Apr 2023 17:27:26 -0400
X-MC-Unique: zPRoXwMTMgKqjSRWKkFRtA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9B930811E7B;
	Wed, 26 Apr 2023 21:27:26 +0000 (UTC)
Received: from oldenburg.str.redhat.com (unknown [10.2.16.48])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id AA7131121314;
	Wed, 26 Apr 2023 21:27:25 +0000 (UTC)
From: Florian Weimer <fweimer@redhat.com>
To: =?utf-8?B?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz>
	<20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz>
	<87lel1d3e1.fsf@oldenburg.str.redhat.com>
	<es43pxh5nu2eqshlx2ujtpl77afmqtef5s3jacdsgzrcd7l6m6@pgrmyzdxsjel>
Date: Wed, 26 Apr 2023 23:27:23 +0200
Message-ID: <871qk6wczo.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

* =D0=BD=D0=B0=D0=B1:

>> I've thought about this some more, and I don't think this is the
>> direction we should be going in.
>>=20
>> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>>   the Python style).  It should have the property that it can encode
>>   every byte string as a string of wchar_t characters, and convert the
>>   result back.  It's not entirely trivial because we need to handle
>>   partial UTF-8 sequences at the end of the buffer carefully.  There
>>   might be some warts regarding EILSEQ handling lurking there.  Like the
>>   Python approach, it is somewhat imperfect because it's not preserving
>>   identity under string concatenation, i.e. f(x) || f(y) is not always
>>   equal to f(x || y), but that's just unavoidable.
>>=20
>> * Switch the charset for the default C locale to UTF-8SE.  This matches
>>   the POSIX requirement that every byte can be encoded.

> The main point of LC_CTYPE=3DPOSIX as specified is that it allows you to
> process paths (which are sequences of bytes, not characters) in a sane
> way =E2=80=92 part of that is that collation needs to be correct, so mayb=
e, as a
> smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".
>
>   >>> b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape')
>   '=C4=BF'
>   >>> b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape')
>   '\udcc4\udcc0'
>   >>>
>   >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape'))=
]
>   [319]
>   >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape'))=
]
>   [56516, 56512]
> which, I mean, sure, maybe that's sensible (I wouldn't say so), but
>   >>> b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape')
>   '\uffff'
>   >>> b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape')
>   '\udcef\udcbf\udcc0'
>   >>>
>   >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescap=
e'))]
>   [65535]
>   >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescap=
e'))]
>   [56559, 56511, 56512]
>
> Which means you can't process arbitrary data (pathnames) in a way that
> makes sense. In my opinion this would be /worse/ than the current
> behaviour, behaving erratically in the presence of Some Data instead of
> simply not supporting it.

Sorry for letting this linger for so long from my side, too.

Regarding the above, I'm not sure I find this convincing.  That's just
business as usual with collation?

However, after thinking about this some more, my idea (just use a
liberal UTF-8 variant) does not work given the APIs we have, in the
sense that code that works in C.UTF-8 today will stop working under this
hypothetical new locale.

For example, for mbrlen (S, N, PS), we have this requirement:

     If the first N bytes possibly form a valid multibyte character but
     the character is incomplete, the return value is =E2=80=98(size_t) -2=
=E2=80=99.
     Otherwise the multibyte character sequence is invalid and the
     return value is =E2=80=98(size_t) -1=E2=80=99.

If every byte sequence is a valid, then mbrlen can never return
(size_t) -2.  It would have to produce surrogate encoding instead.
But this means that detection of valid but incomplete UTF-8 sequences
(say at buffer boundaries) is no longer possible.  And that can't be
good because we would produce unexpected wide characters around
buffer boundaries.

I think this leaves us with a straight byte encoding, so either
ISO-8859-1 for simplicity (and with the cultural bias it brings), or the
musl-style shifted upper half encoding that your patch implements.

In the end, enabling UTF-8 (or some variant) by default is probably not
that important because it directly impacts mostly the wide character
interfaces.  Those are not widely used for a variety of reasons (one
probably being that our implementation is so incredibly slow).

Thanks,
Florian