From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 955A13858C54 for ; Wed, 26 Apr 2023 21:27:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 955A13858C54 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1682544450; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=xwwPs+C/aRg6sZBNluPEE2bkmy8KbZJYG2Z8yz4qXgE=; b=hrHBvplNYxQ9o8Z67YC3/xTCCb3ppEs+C3wW0tdKFkBXoo/CqQwR/DcRRO8xOAjg2hi6FE r1o/ct1s4UaK3qyZLzH0Q4MpJPUu2ZB4/8MOCXmz5x6G+ttc9VY2mF+O+xVpCv4LMW7EmA 3jxyflsAG+AJNswI9aYiI/4utdtEndA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-508-zPRoXwMTMgKqjSRWKkFRtA-1; Wed, 26 Apr 2023 17:27:26 -0400 X-MC-Unique: zPRoXwMTMgKqjSRWKkFRtA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9B930811E7B; Wed, 26 Apr 2023 21:27:26 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.2.16.48]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AA7131121314; Wed, 26 Apr 2023 21:27:25 +0000 (UTC) From: Florian Weimer To: =?utf-8?B?0L3QsNCx?= Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511] References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz> <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> <87lel1d3e1.fsf@oldenburg.str.redhat.com> Date: Wed, 26 Apr 2023 23:27:23 +0200 Message-ID: <871qk6wczo.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: * =D0=BD=D0=B0=D0=B1: >> I've thought about this some more, and I don't think this is the >> direction we should be going in. >>=20 >> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in >> the Python style). It should have the property that it can encode >> every byte string as a string of wchar_t characters, and convert the >> result back. It's not entirely trivial because we need to handle >> partial UTF-8 sequences at the end of the buffer carefully. There >> might be some warts regarding EILSEQ handling lurking there. Like the >> Python approach, it is somewhat imperfect because it's not preserving >> identity under string concatenation, i.e. f(x) || f(y) is not always >> equal to f(x || y), but that's just unavoidable. >>=20 >> * Switch the charset for the default C locale to UTF-8SE. This matches >> the POSIX requirement that every byte can be encoded. > The main point of LC_CTYPE=3DPOSIX as specified is that it allows you to > process paths (which are sequences of bytes, not characters) in a sane > way =E2=80=92 part of that is that collation needs to be correct, so mayb= e, as a > smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c". > > >>> b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape') > '=C4=BF' > >>> b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape') > '\udcc4\udcc0' > >>> > >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape'))= ] > [319] > >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape'))= ] > [56516, 56512] > which, I mean, sure, maybe that's sensible (I wouldn't say so), but > >>> b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape') > '\uffff' > >>> b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape') > '\udcef\udcbf\udcc0' > >>> > >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescap= e'))] > [65535] > >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescap= e'))] > [56559, 56511, 56512] > > Which means you can't process arbitrary data (pathnames) in a way that > makes sense. In my opinion this would be /worse/ than the current > behaviour, behaving erratically in the presence of Some Data instead of > simply not supporting it. Sorry for letting this linger for so long from my side, too. Regarding the above, I'm not sure I find this convincing. That's just business as usual with collation? However, after thinking about this some more, my idea (just use a liberal UTF-8 variant) does not work given the APIs we have, in the sense that code that works in C.UTF-8 today will stop working under this hypothetical new locale. For example, for mbrlen (S, N, PS), we have this requirement: If the first N bytes possibly form a valid multibyte character but the character is incomplete, the return value is =E2=80=98(size_t) -2= =E2=80=99. Otherwise the multibyte character sequence is invalid and the return value is =E2=80=98(size_t) -1=E2=80=99. If every byte sequence is a valid, then mbrlen can never return (size_t) -2. It would have to produce surrogate encoding instead. But this means that detection of valid but incomplete UTF-8 sequences (say at buffer boundaries) is no longer possible. And that can't be good because we would produce unexpected wide characters around buffer boundaries. I think this leaves us with a straight byte encoding, so either ISO-8859-1 for simplicity (and with the cultural bias it brings), or the musl-style shifted upper half encoding that your patch implements. In the end, enabling UTF-8 (or some variant) by default is probably not that important because it directly impacts mostly the wide character interfaces. Those are not widely used for a variety of reasons (one probably being that our implementation is so incredibly slow). Thanks, Florian