public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Florian Weimer <fweimer@redhat.com>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Date: Thu, 10 Nov 2022 09:10:57 +0100	[thread overview]
Message-ID: <87wn8344by.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <874jv8dxat.fsf@oldenburg.str.redhat.com> (Florian Weimer's message of "Wed, 09 Nov 2022 15:20:26 +0100")

* Florian Weimer:

> * наб:
>
>> This is a logistically trivial patch,
>> largely duplicating the extant ASCII code with the error path changed
>
> I wouldn't say it's trivial in the commit message. 8-)
>
>> There are two user-facing changes:
>>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>>   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>>
>> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>>   (a) is 1-byte, stateless, and contains 256 characters
>>   (b) they collate in byte order
>>   (c) the first 128 characters are equivalent to ASCII (like previous)
>> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
>> changes to the standard;
>> in short, this means that mbrtowc() must never fail and must return
>>   b if b <= 0x7F else ab+c for all bytes b
>>   where c is some constant >=0x80
>>     and a is a positive integer constant
>>
>> By strategically picking c=<UDF00> we land at the tail-end of the
>> Unicode Low Surrogate Area at DC00-DFFF, described as
>>   > Isolated surrogate code points have no interpretation;
>>   > consequently, no character code charts or names lists
>>   > are provided for this range.
>> and match musl
>
> Sadly this doesn't match Python and PEP 540:
>
>>>> b'\x80'.decode('UTF-8', errors='surrogateescape')
> '\udc80'
>
> I believe the implementation translates this to 0xDF80 instead.
>
> Not sure what is more important here, musl compatibility or Python
> compatibility.  Cc:ing Victor in case he as comments.  I should probably
> ask on the musl list as well as how this divergence came to pass.

Raised on the musl list here:

  Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
  <https://www.openwall.com/lists/musl/2022/11/10/1>

> This change definitely needs a NEWS entry.

(With this I meant the change overall, not the encoding.)

Thanks,
Florian


  parent reply	other threads:[~2022-11-10  8:11 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] " наб
2022-09-06 14:06 ` [PATCH v2] " наб
2022-09-06 14:19 ` [PATCH] " Florian Weimer
2022-09-06 18:06   ` наб
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-11-02 17:17           ` [PATCH v6 " наб
2022-11-09 12:49             ` Florian Weimer
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб
2022-11-09 14:20             ` Florian Weimer
2022-11-09 16:14               ` [PATCH v7] " наб
2022-11-10  9:52                 ` Florian Weimer
2023-01-09 15:17                   ` [PATCH v8] " наб
2023-02-07 14:16                     ` [PATCH v9] " наб
2023-02-13 14:52                       ` Florian Weimer
2023-04-26 18:54                         ` наб
2023-04-26 21:27                           ` Florian Weimer
2023-04-27  0:17                             ` [PATCH v10] " наб
2023-04-28 15:43                               ` [PATCH v11] " наб
2023-05-07 22:53                                 ` [PATCH v12] " наб
2023-05-29 13:54                                   ` [PATCH v13] " наб
2022-11-10  8:10               ` Florian Weimer [this message]
2022-11-28 16:24                 ` [PATCH v6 2/2] " наб
2022-12-02 17:36                   ` Florian Weimer
2022-12-02 18:42                     ` наб
2022-09-21 14:01         ` [PATCH v5 " наб
2022-09-14  2:39       ` [PATCH v4 " наб
2022-09-06 18:11     ` [PATCH v3 " наб

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wn8344by.fsf@oldenburg.str.redhat.com \
    --to=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=nabijaczleweli@nabijaczleweli.xyz \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).