From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=K2CS=AR=nabijaczleweli.xyz=nabijaczleweli@sourceware.org>
Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42])
	by sourceware.org (Postfix) with ESMTP id 789C53858C53
	for <libc-alpha@sourceware.org>; Wed, 26 Apr 2023 18:54:15 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 789C53858C53
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz
Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250])
	by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 8E8616C38;
	Wed, 26 Apr 2023 20:54:13 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz;
	s=202211; t=1682535253;
	bh=4Yo+Hg6mjxvzBs08jAwrYW86Ych2vrgS7FAM0TRGFwk=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=rGMH297KP2YLrxW9qux4Hc3dVHiU49c5LKeC26uT2C5L+uwKUVgQwDvSb5nEsKkTm
	 DKvoq4wmvhMPmpH/c/XTMhhZBdmHDD+fiA1c1dROFXwQrijqXok/Ex2Balxom45fjq
	 y/EJHz7jmjD6aOwtmTWWcM30YcuUTB1o471ZhOwpiLNqw6iTChJaOsdTJqPGtbyZUe
	 81l2u/NsMlPXa7OX9liHuTWFSn2GyIKudYOB6RhHXJJqDbIrMTdt2DptMqBOi/cbtD
	 xdxPPN8ymYQPeYDZoLIQywSjcOumXPeTfPmAole9Ot5LV17CMBhA/EuN1DR3p4V2hy
	 +a6w/61XkZXEA==
Date: Wed, 26 Apr 2023 20:54:12 +0200
From: =?utf-8?B?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz>
To: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
Message-ID: <es43pxh5nu2eqshlx2ujtpl77afmqtef5s3jacdsgzrcd7l6m6@pgrmyzdxsjel>
References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz>
 <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz>
 <87lel1d3e1.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="pweol33dldl42332"
Content-Disposition: inline
In-Reply-To: <87lel1d3e1.fsf@oldenburg.str.redhat.com>
User-Agent: NeoMutt/20230407
X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_INFOUSMEBIZ,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>


--pweol33dldl42332
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi! Long time, apologies.

On Mon, Feb 13, 2023 at 03:52:06PM +0100, Florian Weimer wrote:
> > This largely duplicates the ASCII code with the error path changed
> >
> > There are two user-facing changes:
> >   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> >   * mbrtowc() and friends return b if b <=3D 0x7F else <UDF00>+b
> >
> > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> >   (a) is 1-byte, stateless, and contains 256 characters
> >   (b) they collate in byte order
> >   (c) the first 128 characters are equivalent to ASCII (like previous)
> > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of
> > changes to the standard;
> > in short, this means that mbrtowc() must never fail and must return
> >   b if b <=3D 0x7F else ab+c for all bytes b
> >   where c is some constant >=3D0x80
> >     and a is a positive integer constant
> >
> > By strategically picking c=3D<UDF00> we land at the tail-end of the
> > Unicode Low Surrogate Area at DC00-DFFF, described as
> >   > Isolated surrogate code points have no interpretation;
> >   > consequently, no character code charts or names lists
> >   > are provided for this range.
> > and match musl
>=20
> I've thought about this some more, and I don't think this is the
> direction we should be going in.
>=20
> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>   the Python style).  It should have the property that it can encode
>   every byte string as a string of wchar_t characters, and convert the
>   result back.  It's not entirely trivial because we need to handle
>   partial UTF-8 sequences at the end of the buffer carefully.  There
>   might be some warts regarding EILSEQ handling lurking there.  Like the
>   Python approach, it is somewhat imperfect because it's not preserving
>   identity under string concatenation, i.e. f(x) || f(y) is not always
>   equal to f(x || y), but that's just unavoidable.
>=20
> * Switch the charset for the default C locale to UTF-8SE.  This matches
>   the POSIX requirement that every byte can be encoded.
The main point of LC_CTYPE=3DPOSIX as specified is that it allows you to
process paths (which are sequences of bytes, not characters) in a sane
way =E2=80=92 part of that is that collation needs to be correct, so maybe,=
 as a
smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".

  >>> b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape')
  '=C4=BF'
  >>> b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape')
  '\udcc4\udcc0'
  >>>
  >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape'))]
  [319]
  >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape'))]
  [56516, 56512]
which, I mean, sure, maybe that's sensible (I wouldn't say so), but
  >>> b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape')
  '\uffff'
  >>> b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape')
  '\udcef\udcbf\udcc0'
  >>>
  >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape'=
))]
  [65535]
  >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape'=
))]
  [56559, 56511, 56512]

Which means you can't process arbitrary data (pathnames) in a way that
makes sense. In my opinion this would be /worse/ than the current
behaviour, behaving erratically in the presence of Some Data instead of
simply not supporting it.

> * Work with POSIX to drop the requirement that the C locale needs to be
>   a single-byte locale.
That's not going to happen because it's the /only/ way to process paths.
Indeed, XBD 8.2 puts it nicely:
  Users may use the following environment variables to announce specific
  localization requirements to applications.
As a user, I want to be able to announce "each byte is a character,
 in natural ordering". This is what LC_CTYPE=3DC lets me do. I hope
you'll agree this is a good feature to be support.

POSIX, also, explicitly says that (XBD 8.2):
5499  1. If the LC_ALL environment variable is defined and is not null, the=
 value of LC_ALL shall
5500     be used.
5501  2. If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGE=
S,
5502     LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the =
value of the
5503     environment variable shall be used to initialize the category that=
 corresponds to the
5504     environment variable.
5505  3. If the LANG environment variable is defined and is not null, the v=
alue of the LANG
5506     environment variable shall be used.
5507  4. If the LANG environment variable is not set or is set to the empty=
 string, the
5508     implementation-defined default locale shall be used.
and XBD 7.2:
3643  All implementations shall define a locale as the default locale, to b=
e invoked when no
3644  environment variables are set, or set to the empty string. This defau=
lt locale can be the POSIX
3645  locale or any other implementation-defined locale. Some implementatio=
ns may provide facilities
3646  for local installation administrators to set the default locale, cust=
omizing it for each location.
3647  POSIX.1-202x does not require such a facility.


To that end, how's about:
  * invent UTF-8SE encoding as you say
  * invent POSIX   encoding like in this patch
    (but move the area to match UTF-8SE probably, it's a good precedent)
  * hook up POSIX to POSIX as in here
  * change the implementation-defined default locale to POSIX-but-UTF-8SE
  * (maybe) change the default locale on entry to main() to POSIX-but-UTF-8=
SE

POSIX requires that LC_ALL=3DPOSIX is the default on entry to main().
That said, I wouldn't mind violating /that/, since anything we do with it
is backwards-compatible. Maybe it makes sense to do that for programs that
don't call setlocale() at all, and they'll behave better when used
internationally. Or not.

Logically, this translates to:
  * if the user has their native locale selected, use that
  * if the user has explicitly selected the bytewise locale, use that
  * if the user hasn't configured their locales at all,
    assume they want UTF-8 but degrade sensibly
  * (maybe) if the program hasn't been written with locales in mind,
            assume the user will be using it with UTF-8 input but
			degrade sensibly

I think this leaves the wolf full and the sheep alive =E2=80=92 the default
behaviour is UTF-8(ish), and can be overridden to full UTF-8 or bytes,
per the user's requirements.

Existing users will thus gain the ability to:
  * process data that's UTF-8 but skip over/retain
    illegal/otherwise-encoded bytes losslessly
    (this makes the sample above a killer feature instead of non-sensible,
     so long as it's an encoding in its own right)
  * correctly process arbitrily-encoded data as bytes

Thoughts?
=D0=BD=D0=B0=D0=B1

--pweol33dldl42332
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmRJc1EACgkQvP0LAY0m
WPEwjxAAtC6SqRLseK1u1OGthVGHC84TPop+dsovQRJPNLAtIEbuSd9e0B12XK39
o/dBkLevpLTrieq9NP4n1W4pVA3PsIsUXRQ2DRjO+4EYfLzY/6hovx/zkmHO6JAb
phnhGsY5bwqtlH6pnBf1X71zgqQ5pDo3Ha3+MCJj2eqQlo5CQYSzsVlcOcimugGR
5UPNpofo0QwrtXITjY/8Rz0OYbw96TFPHjwpKHNDsM2KyEq5DbuG3sq9EMvFpxi6
SkXCP+P1erAGSr6ivmH8ijQRRDT17wgbKMNSI9zwQ6uCJcrPkMPKeMBD8dXwzbgP
ZKVf9yQzUgS4iIEn08pJTILt3EVT+/X8aPYgpCBuYpxhT36Cid+00JzGrni8+NiS
W5/viDyXvynZTOM+0+d/FuCsNdSvnOuyuxPLvD+250UY5LkR5HSZnthSZnU5XbTY
oXX12WrrPYwOFHbwu5uSKYhHJ9Bug/tKHT9QsaK5tEQEb3vIxUWB5p3Z0s0nCqzC
zOW/V1/N4JcKYyLIqX32RvwFBfxoWjgK9PzCzwonOx/QVclkBTbXBzH1NMfusRIC
SuhzGzvCXFejBHK1od33XmksyhsKvpx4fdAd0LASRbkSGFohAS6oDHI0lGQeDA3n
s/4NWrJJUWMqfMxmhJxx/uGJ0B/ZS8ToaicWq1TLFXRl/nZSzkA=
=Fg0p
-----END PGP SIGNATURE-----

--pweol33dldl42332--