From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=GHRy=6J=redhat.com=fweimer@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id C9275385B539
	for <libc-alpha@sourceware.org>; Mon, 13 Feb 2023 14:52:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C9275385B539
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1676299930;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=rH0txXV7L7d2yigQ5qrshW9+XQBFUfLN5sfxj5mExBU=;
	b=SiQQscN4q66QD2B1zPGesxJp5NdN08q3px6fhd00JTAdWS9qCR1jLHl2XBhpi7AP6t8vea
	cRs5NV/aFaBC4m4C2h8dbxm7UBCMRVG/msraHQLRZxObLOd/xc5RkqmVS/YGgSK0eAQjdg
	TLOvja5dIfmvv0xk7ymX9mh+aNol1K8=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-303-zxFdI35SNMq7CBSigesRGg-1; Mon, 13 Feb 2023 09:52:09 -0500
X-MC-Unique: zxFdI35SNMq7CBSigesRGg-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A344329ABA29;
	Mon, 13 Feb 2023 14:52:08 +0000 (UTC)
Received: from oldenburg.str.redhat.com (unknown [10.2.16.7])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id E718F2026D4B;
	Mon, 13 Feb 2023 14:52:07 +0000 (UTC)
From: Florian Weimer <fweimer@redhat.com>
To: =?utf-8?B?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz>
	<20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz>
Date: Mon, 13 Feb 2023 15:52:06 +0100
In-Reply-To: <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz>
	(=?utf-8?B?ItC90LDQsSIncw==?= message of "Tue, 7 Feb 2023 15:16:45 +0100")
Message-ID: <87lel1d3e1.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

* =D0=BD=D0=B0=D0=B1:

> This largely duplicates the ASCII code with the error path changed
>
> There are two user-facing changes:
>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>   * mbrtowc() and friends return b if b <=3D 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>   (a) is 1-byte, stateless, and contains 256 characters
>   (b) they collate in byte order
>   (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
>   b if b <=3D 0x7F else ab+c for all bytes b
>   where c is some constant >=3D0x80
>     and a is a positive integer constant
>
> By strategically picking c=3D<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
>   > Isolated surrogate code points have no interpretation;
>   > consequently, no character code charts or names lists
>   > are provided for this range.
> and match musl

I've thought about this some more, and I don't think this is the
direction we should be going in.

* Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
  the Python style).  It should have the property that it can encode
  every byte string as a string of wchar_t characters, and convert the
  result back.  It's not entirely trivial because we need to handle
  partial UTF-8 sequences at the end of the buffer carefully.  There
  might be some warts regarding EILSEQ handling lurking there.  Like the
  Python approach, it is somewhat imperfect because it's not preserving
  identity under string concatenation, i.e. f(x) || f(y) is not always
  equal to f(x || y), but that's just unavoidable.

* Switch the charset for the default C locale to UTF-8SE.  This matches
  the POSIX requirement that every byte can be encoded.

* Work with POSIX to drop the requirement that the C locale needs to be
  a single-byte locale.

* (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8
  locales can be used as UTF-8SE without recompilation.

Thanks,
Florian