public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jonathan Wakely <jwakely@redhat.com>
To: Dimitrij Mijoski <dmjpp@hotmail.com>
Cc: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org
Subject: Re: [PATCH] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]
Date: Mon, 20 Mar 2023 15:21:04 +0000	[thread overview]
Message-ID: <CACb0b4m-5r7aStZY=yc+75nf5bX18Q-nMgsktjqWK7-VT8WQXg@mail.gmail.com> (raw)
In-Reply-To: <AS1P192MB16201866FEA701E17850E8D3ACB49@AS1P192MB1620.EURP192.PROD.OUTLOOK.COM>

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]

On Wed, 8 Mar 2023 at 14:09, Dimitrij Mijoski via Libstdc++ <
libstdc++@gcc.gnu.org> wrote:

> This patch fixes the handling of surrogate code points in all standard
> facets for transcoding Unicode that are based on std::codecvt. Surrogate
> code points should always be treated as error. On the other hand
> surrogate code units can only appear in UTF-16 and only when they come
> in a proper pair.
>
> Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
> of bytes were given in the range [from, from_end), error was returned
> always. The last byte in such range does not form a full UTF-16 code
> unit and we can not make any decisions for error, instead partial should
> be returned.
>
> The testsuite for testing these facets was updated in the following
> order:
>
> 1. All functions that test codecvts that work with UTF-8 were refactored
>    and made more generic so they accept codecvt that works with the char
>    type char8_t.
> 2. The same functions were updated with new test cases for transcoding
>    errors and now additionally test for surrogates, overlong UTF-8
>    sequences, code points out of the Unicode range, and more tests for
>    missing leading and trailing code units.
> 3. New tests were added to test codecvt_utf16 in both of its variants,
>    UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
>

Thanks, the patch looks OK to my uninformed eye, but I'm seeing a new
regression:

/home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc:86:
void test06(): Assertion 'result == u"from_bytes failed"' failed.
FAIL: 22_locale/codecvt/codecvt_utf16/79980.cc execution test


Also, I see that libc++ fails some of your new tests the same way as
current libstdc++ does:

unicode:
/home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h:298:
void utf8_to_utf32_in_error(const std::codecvt<InternT, ExternT, mbstate_t>
&) [InternT = char32_t, ExternT = char]: Assertion `res == cvt.error'
failed.
Aborted (core dumped)

Does that mean they have the same problem? Or is the test wrong? Or is your
patch implementing something that contradicts the requirements of the
standard? I think it's that libc++ has the same handling of surrogates, but
I'd like to be sure that's right.

  reply	other threads:[~2023-03-20 15:21 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-08 14:08 Dimitrij Mijoski
2023-03-20 15:21 ` Jonathan Wakely [this message]
2023-03-20 17:45   ` Dimitrij Mijoski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACb0b4m-5r7aStZY=yc+75nf5bX18Q-nMgsktjqWK7-VT8WQXg@mail.gmail.com' \
    --to=jwakely@redhat.com \
    --cc=dmjpp@hotmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).