public inbox for
 help / color / mirror / Atom feed
From: Jonathan Wakely <>
To: Dimitrij Mijoski <>
Subject: Re: [PATCH v3] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]
Date: Fri, 29 Sep 2023 16:02:37 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <AS1P192MB1620C0581FE17DC9EA0FA6ACACC1A@AS1P192MB1620.EURP192.PROD.OUTLOOK.COM>

On Thu, 28 Sept 2023 at 20:39, Dimitrij Mijoski via Libstdc++
<> wrote:
> This patch fixes the handling of surrogate code points in all standard
> facets for transcoding Unicode that are based on std::codecvt. Surrogate
> code points should always be treated as error. On the other hand
> surrogate code units can only appear in UTF-16 and only when they come
> in a proper pair.
> Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
> of bytes were given in the range [from, from_end), error was returned
> always. The last byte in such range does not form a full UTF-16 code
> unit and we can not make any decisions for error, instead partial should
> be returned.
> The testsuite for testing these facets was updated in the following
> order:
> 1. All functions that test codecvts that work with UTF-8 were refactored
>    and made more generic so they accept codecvt that works with the char
>    type char8_t.
> 2. The same functions were updated with new test cases for transcoding
>    errors and now additionally test for surrogates, overlong UTF-8
>    sequences, code points out of the Unicode range, and more tests for
>    missing leading and trailing code units.
> 3. New tests were added to test codecvt_utf16 in both of its variants,
>    UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
> libstdc++-v3/ChangeLog:
>         * src/c++11/ (read_utf8_code_point): Fix handing of
>         surrogates in UTF-8.
>         (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
>         (ucs4_in): Fix handling of range with odd number of bytes.
>         (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
>         (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
>         (ucs2_in): Fix handling of range with odd number of bytes.
>         (__codecvt_utf16_base<char16_t>::do_in): Likewise.
>         (__codecvt_utf16_base<char32_t>::do_in): Likewise.
>         (__codecvt_utf16_base<wchar_t>::do_in): Likewise.
>         * testsuite/22_locale/codecvt/ Renames, add
>         tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>.
>         * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
>         testing functions for char8_t, add more test cases for errors,
>         add testing functions for codecvt_utf16.
>         * testsuite/22_locale/codecvt/
>         Renames, add tests for codecvt_utf16<whchar_t>.
>         * testsuite/22_locale/codecvt/codecvt_utf16/ (test06):
>         Fix test.
>         * testsuite/22_locale/codecvt/ New test.

Thanks, your v2 patch was still on my TODO list. I've pushed this
version to trunk now.

      reply	other threads:[~2023-09-29 15:02 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-28 19:38 Dimitrij Mijoski
2023-09-29 15:02 ` Jonathan Wakely [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).