From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id C9275385B539 for ; Mon, 13 Feb 2023 14:52:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C9275385B539 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676299930; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rH0txXV7L7d2yigQ5qrshW9+XQBFUfLN5sfxj5mExBU=; b=SiQQscN4q66QD2B1zPGesxJp5NdN08q3px6fhd00JTAdWS9qCR1jLHl2XBhpi7AP6t8vea cRs5NV/aFaBC4m4C2h8dbxm7UBCMRVG/msraHQLRZxObLOd/xc5RkqmVS/YGgSK0eAQjdg TLOvja5dIfmvv0xk7ymX9mh+aNol1K8= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-303-zxFdI35SNMq7CBSigesRGg-1; Mon, 13 Feb 2023 09:52:09 -0500 X-MC-Unique: zxFdI35SNMq7CBSigesRGg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A344329ABA29; Mon, 13 Feb 2023 14:52:08 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.2.16.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E718F2026D4B; Mon, 13 Feb 2023 14:52:07 +0000 (UTC) From: Florian Weimer To: =?utf-8?B?0L3QsNCx?= Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511] References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz> <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> Date: Mon, 13 Feb 2023 15:52:06 +0100 In-Reply-To: <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> (=?utf-8?B?ItC90LDQsSIncw==?= message of "Tue, 7 Feb 2023 15:16:45 +0100") Message-ID: <87lel1d3e1.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: * =D0=BD=D0=B0=D0=B1: > This largely duplicates the ASCII code with the error path changed > > There are two user-facing changes: > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, > (a) is 1-byte, stateless, and contains 256 characters > (b) they collate in byte order > (c) the first 128 characters are equivalent to ASCII (like previous) > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > changes to the standard; > in short, this means that mbrtowc() must never fail and must return > b if b <=3D 0x7F else ab+c for all bytes b > where c is some constant >=3D0x80 > and a is a positive integer constant > > By strategically picking c=3D we land at the tail-end of the > Unicode Low Surrogate Area at DC00-DFFF, described as > > Isolated surrogate code points have no interpretation; > > consequently, no character code charts or names lists > > are provided for this range. > and match musl I've thought about this some more, and I don't think this is the direction we should be going in. * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in the Python style). It should have the property that it can encode every byte string as a string of wchar_t characters, and convert the result back. It's not entirely trivial because we need to handle partial UTF-8 sequences at the end of the buffer carefully. There might be some warts regarding EILSEQ handling lurking there. Like the Python approach, it is somewhat imperfect because it's not preserving identity under string concatenation, i.e. f(x) || f(y) is not always equal to f(x || y), but that's just unavoidable. * Switch the charset for the default C locale to UTF-8SE. This matches the POSIX requirement that every byte can be encoded. * Work with POSIX to drop the requirement that the C locale needs to be a single-byte locale. * (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8 locales can be used as UTF-8SE without recompilation. Thanks, Florian