On Wed, 8 Mar 2023 at 14:09, Dimitrij Mijoski via Libstdc++ < libstdc++@gcc.gnu.org> wrote: > This patch fixes the handling of surrogate code points in all standard > facets for transcoding Unicode that are based on std::codecvt. Surrogate > code points should always be treated as error. On the other hand > surrogate code units can only appear in UTF-16 and only when they come > in a proper pair. > > Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number > of bytes were given in the range [from, from_end), error was returned > always. The last byte in such range does not form a full UTF-16 code > unit and we can not make any decisions for error, instead partial should > be returned. > > The testsuite for testing these facets was updated in the following > order: > > 1. All functions that test codecvts that work with UTF-8 were refactored > and made more generic so they accept codecvt that works with the char > type char8_t. > 2. The same functions were updated with new test cases for transcoding > errors and now additionally test for surrogates, overlong UTF-8 > sequences, code points out of the Unicode range, and more tests for > missing leading and trailing code units. > 3. New tests were added to test codecvt_utf16 in both of its variants, > UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2. > Thanks, the patch looks OK to my uninformed eye, but I'm seeing a new regression: /home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc:86: void test06(): Assertion 'result == u"from_bytes failed"' failed. FAIL: 22_locale/codecvt/codecvt_utf16/79980.cc execution test Also, I see that libc++ fails some of your new tests the same way as current libstdc++ does: unicode: /home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h:298: void utf8_to_utf32_in_error(const std::codecvt &) [InternT = char32_t, ExternT = char]: Assertion `res == cvt.error' failed. Aborted (core dumped) Does that mean they have the same problem? Or is the test wrong? Or is your patch implementing something that contradicts the requirements of the standard? I think it's that libc++ has the same handling of surrogates, but I'd like to be sure that's right.