public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Florian Weimer <fweimer@redhat.com>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
Date: Mon, 13 Feb 2023 15:52:06 +0100	[thread overview]
Message-ID: <87lel1d3e1.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> (=?utf-8?B?ItC90LDQsSIncw==?= message of "Tue, 7 Feb 2023 15:16:45 +0100")

* наб:

> This largely duplicates the ASCII code with the error path changed
>
> There are two user-facing changes:
>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>   (a) is 1-byte, stateless, and contains 256 characters
>   (b) they collate in byte order
>   (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
>   b if b <= 0x7F else ab+c for all bytes b
>   where c is some constant >=0x80
>     and a is a positive integer constant
>
> By strategically picking c=<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
>   > Isolated surrogate code points have no interpretation;
>   > consequently, no character code charts or names lists
>   > are provided for this range.
> and match musl

I've thought about this some more, and I don't think this is the
direction we should be going in.

* Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
  the Python style).  It should have the property that it can encode
  every byte string as a string of wchar_t characters, and convert the
  result back.  It's not entirely trivial because we need to handle
  partial UTF-8 sequences at the end of the buffer carefully.  There
  might be some warts regarding EILSEQ handling lurking there.  Like the
  Python approach, it is somewhat imperfect because it's not preserving
  identity under string concatenation, i.e. f(x) || f(y) is not always
  equal to f(x || y), but that's just unavoidable.

* Switch the charset for the default C locale to UTF-8SE.  This matches
  the POSIX requirement that every byte can be encoded.

* Work with POSIX to drop the requirement that the C locale needs to be
  a single-byte locale.

* (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8
  locales can be used as UTF-8SE without recompilation.

Thanks,
Florian


  reply	other threads:[~2023-02-13 14:52 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] " наб
2022-09-06 14:06 ` [PATCH v2] " наб
2022-09-06 14:19 ` [PATCH] " Florian Weimer
2022-09-06 18:06   ` наб
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-11-02 17:17           ` [PATCH v6 " наб
2022-11-09 12:49             ` Florian Weimer
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб
2022-11-09 14:20             ` Florian Weimer
2022-11-09 16:14               ` [PATCH v7] " наб
2022-11-10  9:52                 ` Florian Weimer
2023-01-09 15:17                   ` [PATCH v8] " наб
2023-02-07 14:16                     ` [PATCH v9] " наб
2023-02-13 14:52                       ` Florian Weimer [this message]
2023-04-26 18:54                         ` наб
2023-04-26 21:27                           ` Florian Weimer
2023-04-27  0:17                             ` [PATCH v10] " наб
2023-04-28 15:43                               ` [PATCH v11] " наб
2023-05-07 22:53                                 ` [PATCH v12] " наб
2023-05-29 13:54                                   ` [PATCH v13] " наб
2022-11-10  8:10               ` [PATCH v6 2/2] " Florian Weimer
2022-11-28 16:24                 ` наб
2022-12-02 17:36                   ` Florian Weimer
2022-12-02 18:42                     ` наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-06 18:11     ` [PATCH v3 " наб

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lel1d3e1.fsf@oldenburg.str.redhat.com \
    --to=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=nabijaczleweli@nabijaczleweli.xyz \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).