From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR05-AM6-obe.outbound.protection.outlook.com (mail-am6eur05olkn2019.outbound.protection.outlook.com [40.92.91.19]) by sourceware.org (Postfix) with ESMTPS id DA18038555A4; Wed, 8 Mar 2023 14:08:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DA18038555A4 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=hotmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=hotmail.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MkeGgrMtzzmQdebWBiTa0TEXyRdxWpO06J8O/nLF8kjm6VYmAIT9e4OG7TXbxK5ufMmunYVHB6+ZNreuKwhT0zNh5pmUFEFYej5oHf/A62A3msXXovTg/USO7UmgBJHnQBGt5zbK1AaASBKkbFpx9ORTFd1CI20cZecssGu+QgUs8s4cVcJAtkeflUxzxrkZzFx6d2tIr+MMzfbKvE1LG3+losOgKnTLN0Gjh0hvX83InjsMWm1+Ajh6bW410iB86nITVBeJLZyAVUHPj9hkeLd+itl8gjEjo+IGf5AKLgAYBrE8QgXl1RcjRI3wE6g0Zv4z0NTH2GLgUJoBrDBWEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=aXEvM+4kMtXd0nA9KDYuvkrDOxtH3xvUzRCTQNx/Z7I=; b=BztE+43UpN6y+cheFVbNNsXSBco+nTOH43qLV5Fro5s3qtwvzLRVLDOs+zAbnVFrb2T5kxxu2cSNDKIln0WhcHgiMg/Tj2AsASNaZHNju41uDPNPNDLfyb8PbFcvCiUtcSR2IkUczdVwXrBMuRj9Jr91Qc+OV6wPPtlW/PtrEsbb1FNE+hwIYFb7ohVAqKZS45T1Df9GBoduffP2pA332sKTiL7kCfEvB5ErGzrQYpYqaEdNDJEHCk1sT5kTDwliUtgKGaypYtZx9O2byIHKPN6EjNCjOLAZUSdAiVGxqk+1FfIG7vXT/vIwI/a7+BH51sTMKocWoIYza+AExQVUYQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=aXEvM+4kMtXd0nA9KDYuvkrDOxtH3xvUzRCTQNx/Z7I=; b=TrGVmJDY18EY1/WXvk19OwE2MNDfl6XMqeSPLKn45sAuaJVx3CdexcZM+jX7w8ZfUMXla3oCO2gUPXL/aGeVA+TGhERHd20FvnRVBha1+8NxAdA+vtNFeyboMO8DmD8G4YTgGbwK0I7drIa+eto4SIzWP50TqeYRT9vu+KFdAeIxT1X0CHjDYo2z7S1tvqYWDoEPcguiLEQaUM1LBp1moHgSm9vBwEdjL0e/kfQYpZU295nNHhVD7YwVEbCo7fLty4t2RLrL07gwlhs320PCyVVsi8eEwGdR9Bn2fh/QWP34kCmOfDPHcCBBF715YNTcAVcai5N6zh5Vo0Qbg9ACwQ== Received: from AS1P192MB1620.EURP192.PROD.OUTLOOK.COM (2603:10a6:20b:4a0::7) by DBAP192MB0889.EURP192.PROD.OUTLOOK.COM (2603:10a6:10:1cd::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6178.16; Wed, 8 Mar 2023 14:08:52 +0000 Received: from AS1P192MB1620.EURP192.PROD.OUTLOOK.COM ([fe80::2418:1ef3:17f5:341f]) by AS1P192MB1620.EURP192.PROD.OUTLOOK.COM ([fe80::2418:1ef3:17f5:341f%6]) with mapi id 15.20.6156.028; Wed, 8 Mar 2023 14:08:52 +0000 Message-ID: Subject: [PATCH] libstdc++: Fix handling of surrogate CP in codecvt [PR108976] From: Dimitrij Mijoski To: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org Date: Wed, 08 Mar 2023 15:08:49 +0100 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.44.4-0ubuntu1 X-TMN: [jz4/dlB8UmY/vtQMJd7lkLsPvAsKu5NV] X-ClientProxiedBy: VI1PR06CA0120.eurprd06.prod.outlook.com (2603:10a6:803:8c::49) To AS1P192MB1620.EURP192.PROD.OUTLOOK.COM (2603:10a6:20b:4a0::7) X-Microsoft-Original-Message-ID: MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: AS1P192MB1620:EE_|DBAP192MB0889:EE_ X-MS-Office365-Filtering-Correlation-Id: bbf245fd-b38d-4725-2682-08db1fdea541 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: B45Kld02NGwtKIRAeLXqWwdh90WZNJ+4qQDMvWlgbxKr+WS1P45CCpC/a9IyM37xmL4j24FnvHOQbQQqWMFcNoPS6YyBUUhxDfB2qG8qbvUGEBjAKGaX1brjxz7p2PDKrzKOFxVau/4mYfegaZIEOro6xpsIRKZv961Q4R2fpcHacWdply3O2weZGsOG1MVucCaMOhvapL8jfq4CyfpY8bs4U7PMZgQduj++q0uJlPRZO2JOC0qZhfZe9dVBIc9UK2X/szWxvyxWwW6enz2a2cOGMs2hBSgyxPLLeyFPpO/hkTSu83DDmWdnl3TN8hDF+AEfWWol9IX6LU/CcA7Xe3TkQAPb3/GN1SmhH91NyWVIbJKQ8xJI9UuhNCRmG9DMe961/WcbgdokBJSQvMBxQ02+S2mTFocSl/WFZHAnq+XDbETF6sbWOiagqDoJjGpQBeS2hqQt70zGK/KBe+wpfVQ7AURqSOdM/lc6s1DYrG3mNdmNWgMMSUvUJa36ohy3m6DaI+lfQwf2hk0OL28YFg7rlISjIREixygsgFKYczzS1dR/B2bdIhgikBQq64cswVpZi/yS3NB4OWP02FDK4w== X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?NlpQVzVSQXZpUGF0TmJ2Qm9nWlo1bGtBaG5tdmJnRlBaNFdKQWRubGhGRURY?= =?utf-8?B?N1FicXlFQ1c2TmFUeXFGYk81S2J2WWdqbEtXaWVzYmRoVUhTMGh2ZWp4a2xj?= =?utf-8?B?ek1oMXBQZ1JpQTNmaFlybWZCQ1phRmV2MkNiWjFaeHhzVGtDMGJnUkxGS1lp?= =?utf-8?B?UzI3Q2svRVo2ajcwb1E0eHR0YnJqWEVzZVFNTUg4OGZoZDN0TkJDR0hpRGhF?= =?utf-8?B?MEVYcmNVWnNvNHY2R01naVpTOE1jcGFzUWRoRmFFREpuQzB2OFJ3My9URy9V?= =?utf-8?B?Ync4MldDN3MyaXpBNWFoSmhGUnhiem9RZGZuSGxKNXlWMk8wcFdSdGhMbE5v?= =?utf-8?B?VXJDS2kwTzR2b1FuTHVaVjR6VGg0ek0zUGFiM1NLcXZ3ZVRGaHFmRzNMNlEz?= =?utf-8?B?dmpaei9zSjcwaTZZb2x3SmRlc25BMnlsNmJDMWhKV1kzTnJsMHhHTysvMTh1?= =?utf-8?B?RENWQWZTREdjQUxYYnZ2OE5EN0ZEb2hCajViNVBXeDZ1WDVtQW5uaC9FaDlM?= =?utf-8?B?RHZFak5JWHZZVFVQLzlXY1poMm1VS2c0K3RkOFg2L2FSOUUySkJWeVpjaG5V?= =?utf-8?B?NHo4aDVtK3UwNkdUSUxlbnkwbnM1V3Z1SUZiOEt5dGhhZXVJSjZMTmVZcnJu?= =?utf-8?B?OHAvRURNUlpKVTJGN1ZRU3JKTDYxRFd6RE9oNkIxdDRKcVE5dU83dUFLYUhL?= =?utf-8?B?dHNwY0RwNTJvSlM0L0pHakY0SHR0L2JiVFJCK1gzalVsRmpoVjFMcmM0VzFG?= =?utf-8?B?NG5ybWU1dHBGaHU0eW1SZXRqZ3V6QjdVVW9jOXEvMUh2YW1hYUQrWXZhMDA2?= =?utf-8?B?R05PSFl0SVZJNThoT0hUa2ZtR2MreXRLNkd0NTRUT0NRUUZYOGFFWkFNQXFi?= =?utf-8?B?eFd0dTBiL1h2RU55TlAzRGdlaU4vam9yRWtaZWxoRVNCMXlMZkhrQUtWdFhz?= =?utf-8?B?eHlQWnhkU0l3RkszbkorZ0xVRHRhVGt4MFpkRC9PU0NTSUtZOGc0aE42ZlFP?= =?utf-8?B?WkhDZDNnb0xLOUNUMUxYTEtPbXdma2E4WFRPdXVvMzcvaXYyMHNvR1BPbVgz?= =?utf-8?B?UFJFY3JGUnFmN0NBdStRUUh4NG13dFpXVm9sckNubFNTL2M0UytOdHR2MTNy?= =?utf-8?B?VHJkNlBac05TU2taNUVQVHNWbjdPRE9INW12SVNSWnVnc2dzS0w3N0tYYWZ3?= =?utf-8?B?d1JwWTVvbldCY0pMWG1hQXk5clJjZ3h6bW45UnB2dC9QT2dUa2Z1NzR1VlIv?= =?utf-8?B?TFJRWHVLNkF3Y1VvNWEvdHd6clVlM1RhWTZJZ2pFMTZWU3YvZStIeENDYnhU?= =?utf-8?B?RmFaWHBjNkJHZkJFcGZBVkUzbUZTczBTSVBuVkZaTktFZnJ1TldFd2FyMlZS?= =?utf-8?B?NFhuSkVta3JYRW1sNWF1aXl0TFJDUzVWVHRUb3NEclFiS1hvUmtTL2RyUVdw?= =?utf-8?B?djhiYk5SblUvd1R6NjBmU1RsaHZCdURZOVpteUNTMmF4WEx5VFAyTk1GMGZH?= =?utf-8?B?a2Z2VWFDWlQ4am5tbGp1QU02MmxGZ1A1bzhwWnVKVHNoY1pzcG4wRDVLOTlS?= =?utf-8?B?UmtMWU1DUGZ3Mi9VRHF1YUFubmk1Q2pnOHR5ZU5IK0hQcFhtWG9EWXRBcjVR?= =?utf-8?B?MDVwZG83bTU2MnV6UERYajV3azRpMnFieW56WTNGQU1JNDVPN2hQdWliL09V?= =?utf-8?Q?viUmMQtkjhJ0iGfdOGqr?= X-OriginatorOrg: sct-15-20-4755-11-msonline-outlook-fb43a.templateTenant X-MS-Exchange-CrossTenant-Network-Message-Id: bbf245fd-b38d-4725-2682-08db1fdea541 X-MS-Exchange-CrossTenant-AuthSource: AS1P192MB1620.EURP192.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Mar 2023 14:08:52.5030 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DBAP192MB0889 X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SCC_5_SHORT_WORD_LINES,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This patch fixes the handling of surrogate code points in all standard facets for transcoding Unicode that are based on std::codecvt. Surrogate code points should always be treated as error. On the other hand surrogate code units can only appear in UTF-16 and only when they come in a proper pair. Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number of bytes were given in the range [from, from_end), error was returned always. The last byte in such range does not form a full UTF-16 code unit and we can not make any decisions for error, instead partial should be returned. The testsuite for testing these facets was updated in the following order: 1. All functions that test codecvts that work with UTF-8 were refactored and made more generic so they accept codecvt that works with the char type char8_t. 2. The same functions were updated with new test cases for transcoding errors and now additionally test for surrogates, overlong UTF-8 sequences, code points out of the Unicode range, and more tests for missing leading and trailing code units. 3. New tests were added to test codecvt_utf16 in both of its variants, UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2. libstdc++-v3/ChangeLog: * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of surrogates in UTF-8. (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8. (ucs4_in): Fix handling of range with odd number of bytes. (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16. (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16. (ucs2_in): Fix handling of range with odd number of bytes. (__codecvt_utf16_base::do_in): Likewise. (__codecvt_utf16_base::do_in): Likewise. (__codecvt_utf16_base::do_in): Likewise. * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add tests for codecvt_utf16 and codecvt_utf16. * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8 testing functions for char8_t, add more test cases for errors, add testing functions for codecvt_utf16. * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: Renames, add tests for codecvt_utf16. * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test. --- libstdc++-v3/src/c++11/codecvt.cc | 18 +- .../22_locale/codecvt/codecvt_unicode.cc | 38 +- .../22_locale/codecvt/codecvt_unicode.h | 1799 +++++++++++++---- .../codecvt/codecvt_unicode_char8_t.cc | 53 + .../codecvt/codecvt_unicode_wchar_t.cc | 32 +- 5 files changed, 1492 insertions(+), 448 deletions(-) create mode 100644 libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicod= e_char8_t.cc diff --git a/libstdc++-v3/src/c++11/codecvt.cc b/libstdc++-v3/src/c++11/cod= ecvt.cc index 02f05752d..2cc812cfc 100644 --- a/libstdc++-v3/src/c++11/codecvt.cc +++ b/libstdc++-v3/src/c++11/codecvt.cc @@ -284,6 +284,8 @@ namespace return invalid_mb_sequence; if (c1 =3D=3D 0xE0 && c2 < 0xA0) [[unlikely]] // overlong return invalid_mb_sequence; + if (c1 =3D=3D 0xED && c2 >=3D 0xA0) [[unlikely]] // surrogate + return invalid_mb_sequence; if (avail < 3) [[unlikely]] return incomplete_mb_character; char32_t c3 =3D (unsigned char) from[2]; @@ -484,6 +486,8 @@ namespace while (from.size()) { const char32_t c =3D from[0]; + if (0xD800 <=3D c && c <=3D 0xDFFF) [[unlikely]] + return codecvt_base::error; if (c > maxcode) [[unlikely]] return codecvt_base::error; if (!write_utf8_code_point(to, c)) [[unlikely]] @@ -508,7 +512,7 @@ namespace return codecvt_base::error; to =3D codepoint; } - return from.size() ? codecvt_base::partial : codecvt_base::ok; + return from.nbytes() ? codecvt_base::partial : codecvt_base::ok; } =20 // ucs4 -> utf16 @@ -521,6 +525,8 @@ namespace while (from.size()) { const char32_t c =3D from[0]; + if (0xD800 <=3D c && c <=3D 0xDFFF) [[unlikely]] + return codecvt_base::error; if (c > maxcode) [[unlikely]] return codecvt_base::error; if (!write_utf16_code_point(to, c, mode)) [[unlikely]] @@ -653,7 +659,7 @@ namespace while (from.size() && to.size()) { char16_t c =3D from[0]; - if (is_high_surrogate(c)) + if (0xD800 <=3D c && c <=3D 0xDFFF) return codecvt_base::error; if (c > maxcode) return codecvt_base::error; @@ -680,7 +686,7 @@ namespace return codecvt_base::error; to =3D c; } - return from.size() =3D=3D 0 ? codecvt_base::ok : codecvt_base::partial= ; + return from.nbytes() =3D=3D 0 ? codecvt_base::ok : codecvt_base::parti= al; } =20 const char16_t* @@ -1344,8 +1350,6 @@ do_in(state_type&, const extern_type* __from, const e= xtern_type* __from_end, auto res =3D ucs2_in(from, to, _M_maxcode, _M_mode); __from_next =3D reinterpret_cast(from.next); __to_next =3D to.next; - if (res =3D=3D codecvt_base::ok && __from_next !=3D __from_end) - res =3D codecvt_base::error; return res; } =20 @@ -1419,8 +1423,6 @@ do_in(state_type&, const extern_type* __from, const e= xtern_type* __from_end, auto res =3D ucs4_in(from, to, _M_maxcode, _M_mode); __from_next =3D reinterpret_cast(from.next); __to_next =3D to.next; - if (res =3D=3D codecvt_base::ok && __from_next !=3D __from_end) - res =3D codecvt_base::error; return res; } =20 @@ -1521,8 +1523,6 @@ do_in(state_type&, const extern_type* __from, const e= xtern_type* __from_end, #endif __from_next =3D reinterpret_cast(from.next); __to_next =3D reinterpret_cast(to.next); - if (res =3D=3D codecvt_base::ok && __from_next !=3D __from_end) - res =3D codecvt_base::error; return res; } =20 diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc b/= libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc index df1a2b4cc..c563781ca 100644 --- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.cc @@ -27,38 +27,58 @@ void test_utf8_utf32_codecvts () { using codecvt_c32 =3D codecvt; - auto loc_c =3D locale::classic (); + auto &loc_c =3D locale::classic (); VERIFY (has_facet (loc_c)); =20 auto &cvt =3D use_facet (loc_c); - test_utf8_utf32_codecvts (cvt); + test_utf8_utf32_cvt (cvt); =20 codecvt_utf8 cvt2; - test_utf8_utf32_codecvts (cvt2); + test_utf8_utf32_cvt (cvt2); } =20 void test_utf8_utf16_codecvts () { using codecvt_c16 =3D codecvt; - auto loc_c =3D locale::classic (); + auto &loc_c =3D locale::classic (); VERIFY (has_facet (loc_c)); =20 auto &cvt =3D use_facet (loc_c); - test_utf8_utf16_cvts (cvt); + test_utf8_utf16_cvt (cvt); =20 codecvt_utf8_utf16 cvt2; - test_utf8_utf16_cvts (cvt2); + test_utf8_utf16_cvt (cvt2); =20 codecvt_utf8_utf16 cvt3; - test_utf8_utf16_cvts (cvt3); + test_utf8_utf16_cvt (cvt3); } =20 void test_utf8_ucs2_codecvts () { codecvt_utf8 cvt; - test_utf8_ucs2_cvts (cvt); + test_utf8_ucs2_cvt (cvt); +} + +void +test_utf16_utf32_codecvts () +{ + codecvt_utf16 cvt; + test_utf16_utf32_cvt (cvt, utf16_big_endian); + + codecvt_utf16 cvt2; + test_utf16_utf32_cvt (cvt2, utf16_little_endian); +} + +void +test_utf16_ucs2_codecvts () +{ + codecvt_utf16 cvt; + test_utf16_ucs2_cvt (cvt, utf16_big_endian); + + codecvt_utf16 cvt2; + test_utf16_ucs2_cvt (cvt2, utf16_little_endian); } =20 int @@ -67,4 +87,6 @@ main () test_utf8_utf32_codecvts (); test_utf8_utf16_codecvts (); test_utf8_ucs2_codecvts (); + test_utf16_utf32_codecvts (); + test_utf16_ucs2_codecvts (); } diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h b/l= ibstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h index fbdc7a35b..f48f8e555 100644 --- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h @@ -42,33 +42,33 @@ auto constexpr array_size (const T (&)[N]) -> size_t return N; } =20 -template +template void -utf8_to_utf32_in_ok (const std::codecvt &cvt) +utf8_to_utf32_in_ok (const std::codecvt &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char32_t exp_literal[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - std::copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 5, ""); - VERIFY (char_traits::length (in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 4); + const unsigned char input[] =3D "b\u0448\uAAAA\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 4); =20 test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {3, 2}, {6, 3}, {10, 4}}; for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -76,19 +76,19 @@ utf8_to_utf32_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } =20 for (auto t : offsets) { - CharT out[array_size (exp)] =3D {}; + InternT out[array_size (exp)] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res @@ -96,29 +96,29 @@ utf8_to_utf32_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -utf8_to_utf32_in_partial (const std::codecvt &cvt) +utf8_to_utf32_in_partial (const std::codecvt = &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char32_t exp_literal[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - std::copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 5, ""); - VERIFY (char_traits::length (in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 4); + const unsigned char input[] =3D "b\u0448\uAAAA\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 4); =20 test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -144,14 +144,14 @@ utf8_to_utf32_in_partial (const std::codecvt &cvt) =20 for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -159,37 +159,58 @@ utf8_to_utf32_in_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -utf8_to_utf32_in_error (const std::codecvt &cvt) +utf8_to_utf32_in_error (const std::codecvt &c= vt) { using namespace std; - // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char valid_in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char32_t exp_literal[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - std::copy (begin (exp_literal), end (exp_literal), begin (exp)); + // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP, 4-byte CP + const unsigned char input[] =3D "b\u0448\uD700\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uD700\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 4); + + // There are 5 classes of errors in UTF-8 decoding + // 1. Missing leading byte + // 2. Missing trailing byte + // 3. Surrogate CP + // 4. Ovelong sequence + // 5. CP out of Unicode range + test_offsets_error offsets[] =3D { + + // 1. Missing leading byte. We will replace the leading byte with + // non-leading byte, such as a byte that is always invalid or a traili= ng + // byte. =20 - static_assert (array_size (valid_in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 5, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 4); + // replace leading byte with invalid byte + {1, 4, 0, 0, 0xFF, 0}, + {3, 4, 1, 1, 0xFF, 1}, + {6, 4, 3, 2, 0xFF, 3}, + {10, 4, 6, 3, 0xFF, 6}, =20 - test_offsets_error offsets[] =3D { + // replace leading byte with trailing byte + {1, 4, 0, 0, 0b10101010, 0}, + {3, 4, 1, 1, 0b10101010, 1}, + {6, 4, 3, 2, 0b10101010, 3}, + {10, 4, 6, 3, 0b10101010, 6}, =20 - // replace leading byte with invalid byte - {1, 4, 0, 0, '\xFF', 0}, - {3, 4, 1, 1, '\xFF', 1}, - {6, 4, 3, 2, '\xFF', 3}, - {10, 4, 6, 3, '\xFF', 6}, + // 2. Missing trailing byte. We will replace the trailing byte with + // non-trailing byte, such as a byte that is always invalid or a leadi= ng + // byte (simple ASCII byte in our case). =20 // replace first trailing byte with ASCII byte {3, 4, 1, 1, 'z', 2}, @@ -197,21 +218,27 @@ utf8_to_utf32_in_error (const std::codecvt &cvt) {10, 4, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte - {3, 4, 1, 1, '\xFF', 2}, - {6, 4, 3, 2, '\xFF', 4}, - {10, 4, 6, 3, '\xFF', 7}, + {3, 4, 1, 1, 0xFF, 2}, + {6, 4, 3, 2, 0xFF, 4}, + {10, 4, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte {6, 4, 3, 2, 'z', 5}, {10, 4, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte - {6, 4, 3, 2, '\xFF', 5}, - {10, 4, 6, 3, '\xFF', 8}, + {6, 4, 3, 2, 0xFF, 5}, + {10, 4, 6, 3, 0xFF, 8}, =20 // replace third trailing byte {10, 4, 6, 3, 'z', 9}, - {10, 4, 6, 3, '\xFF', 9}, + {10, 4, 6, 3, 0xFF, 9}, + + // 2.1 The following test-cases raise doubt whether error or partial s= hould + // be returned. For example, we have 4-byte sequence with valid leadin= g + // byte. If we hide the last byte we need to return partial. But, if t= he + // second or third byte, which are visible to the call to codecvt, are + // malformed then error should be returned. =20 // replace first trailing byte with ASCII byte, also incomplete at end {5, 4, 3, 2, 'z', 4}, @@ -219,30 +246,51 @@ utf8_to_utf32_in_error (const std::codecvt &cvt) {9, 4, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte, also incomplete at e= nd - {5, 4, 3, 2, '\xFF', 4}, - {8, 4, 6, 3, '\xFF', 7}, - {9, 4, 6, 3, '\xFF', 7}, + {5, 4, 3, 2, 0xFF, 4}, + {8, 4, 6, 3, 0xFF, 7}, + {9, 4, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte, also incomplete at en= d {9, 4, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte, also incomplete at = end - {9, 4, 6, 3, '\xFF', 8}, + {9, 4, 6, 3, 0xFF, 8}, + + // 3. Surrogate CP. We modify the second byte (first trailing) of the = 3-byte + // CP U+D700 + {6, 4, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800 + {6, 4, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00 + {6, 4, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00 + {6, 4, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00 + + // 4. Overlong sequence. The CPs in the input are chosen such as modif= ying + // just the leading byte is enough to make them overlong, i.e. for the + // 3-byte and 4-byte CP the second byte (first trailing) has enough le= ading + // zeroes. + {3, 4, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong + {3, 4, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong + {6, 4, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong + {10, 4, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong + + // 5. CP above range + // turn U+10AAAA into U+14AAAA by changing its leading byte + {10, 4, 6, 3, 0b11110101, 6}, + // turn U+10AAAA into U+11AAAA by changing its 2nd byte + {10, 4, 6, 3, 0b10011010, 7}, }; for (auto t : offsets) { - char in[array_size (valid_in)] =3D {}; - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - char_traits::copy (in, valid_in, array_size (valid_in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -250,48 +298,51 @@ utf8_to_utf32_in_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -utf8_to_utf32_in (const std::codecvt &cvt) +utf8_to_utf32_in (const std::codecvt &cvt) { utf8_to_utf32_in_ok (cvt); utf8_to_utf32_in_partial (cvt); utf8_to_utf32_in_error (cvt); } =20 -template +template void -utf32_to_utf8_out_ok (const std::codecvt &cvt) +utf32_to_utf8_out_ok (const std::codecvt &cvt= ) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char32_t in_literal[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - CharT in[array_size (in_literal)] =3D {}; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 5, ""); - static_assert (array_size (in) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 4); - VERIFY (char_traits::length (in) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 10); + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 4); + VERIFY (char_traits::length (exp) =3D=3D 10); =20 const test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {2, 3}, {3, 6}, {4,= 10}}; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -299,29 +350,29 @@ utf32_to_utf8_out_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -utf32_to_utf8_out_partial (const std::codecvt &cvt= ) +utf32_to_utf8_out_partial (const std::codecvt= &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char32_t in_literal[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - CharT in[array_size (in_literal)] =3D {}; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 5, ""); - static_assert (array_size (in) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 4); - VERIFY (char_traits::length (in) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 10); + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 4); + VERIFY (char_traits::length (exp) =3D=3D 10); =20 const test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -340,14 +391,14 @@ utf32_to_utf8_out_partial (const std::codecvt &cvt) }; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -355,44 +406,58 @@ utf32_to_utf8_out_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -utf32_to_utf8_out_error (const std::codecvt &cvt) +utf32_to_utf8_out_error (const std::codecvt &= cvt) { using namespace std; - const char32_t valid_in[] =3D U"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - - static_assert (array_size (valid_in) =3D=3D 5, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 4); - VERIFY (char_traits::length (exp) =3D=3D 10); - - test_offsets_error offsets[] =3D {{4, 10, 0, 0, 0x00110000, 0}, - {4, 10, 1, 1, 0x00110000, 1}, - {4, 10, 2, 3, 0x00110000, 2}, - {4, 10, 3, 6, 0x00110000, 3}}; + // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 4); + VERIFY (char_traits::length (exp) =3D=3D 10); + + test_offsets_error offsets[] =3D { + + // Surrogate CP + {4, 10, 0, 0, 0xD800, 0}, + {4, 10, 1, 1, 0xDBFF, 1}, + {4, 10, 2, 3, 0xDC00, 2}, + {4, 10, 3, 6, 0xDFFF, 3}, + + // CP out of range + {4, 10, 0, 0, 0x00110000, 0}, + {4, 10, 1, 1, 0x00110000, 1}, + {4, 10, 2, 3, 0x00110000, 2}, + {4, 10, 3, 6, 0x00110000, 3}}; =20 for (auto t : offsets) { - CharT in[array_size (valid_in)] =3D {}; - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - copy (begin (valid_in), end (valid_in), begin (in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -400,56 +465,59 @@ utf32_to_utf8_out_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -utf32_to_utf8_out (const std::codecvt &cvt) +utf32_to_utf8_out (const std::codecvt &cvt) { utf32_to_utf8_out_ok (cvt); utf32_to_utf8_out_partial (cvt); utf32_to_utf8_out_error (cvt); } =20 -template +template void -test_utf8_utf32_codecvts (const std::codecvt &cvt) +test_utf8_utf32_cvt (const std::codecvt &cvt) { utf8_to_utf32_in (cvt); utf32_to_utf8_out (cvt); } =20 -template +template void -utf8_to_utf16_in_ok (const std::codecvt &cvt) +utf8_to_utf16_in_ok (const std::codecvt &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 6, ""); - VERIFY (char_traits::length (in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 5); + const unsigned char input[] =3D "b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 5); =20 test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {3, 2}, {6, 3}, {10, 5}}; for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -457,19 +525,19 @@ utf8_to_utf16_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } =20 for (auto t : offsets) { - CharT out[array_size (exp)] =3D {}; + InternT out[array_size (exp)] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res @@ -477,29 +545,29 @@ utf8_to_utf16_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -utf8_to_utf16_in_partial (const std::codecvt &cvt) +utf8_to_utf16_in_partial (const std::codecvt = &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 6, ""); - VERIFY (char_traits::length (in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 5); + const unsigned char input[] =3D "b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 5); =20 test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -530,14 +598,14 @@ utf8_to_utf16_in_partial (const std::codecvt &cvt) =20 for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -545,36 +613,58 @@ utf8_to_utf16_in_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -utf8_to_utf16_in_error (const std::codecvt &cvt) +utf8_to_utf16_in_error (const std::codecvt &c= vt) { using namespace std; - const char valid_in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); + // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP, 4-byte CP + const unsigned char input[] =3D "b\u0448\uD700\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uD700\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 5); + + // There are 5 classes of errors in UTF-8 decoding + // 1. Missing leading byte + // 2. Missing trailing byte + // 3. Surrogate CP + // 4. Ovelong sequence + // 5. CP out of Unicode range + test_offsets_error offsets[] =3D { + + // 1. Missing leading byte. We will replace the leading byte with + // non-leading byte, such as a byte that is always invalid or a traili= ng + // byte. =20 - static_assert (array_size (valid_in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 6, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 5); + // replace leading byte with invalid byte + {1, 5, 0, 0, 0xFF, 0}, + {3, 5, 1, 1, 0xFF, 1}, + {6, 5, 3, 2, 0xFF, 3}, + {10, 5, 6, 3, 0xFF, 6}, =20 - test_offsets_error offsets[] =3D { + // replace leading byte with trailing byte + {1, 5, 0, 0, 0b10101010, 0}, + {3, 5, 1, 1, 0b10101010, 1}, + {6, 5, 3, 2, 0b10101010, 3}, + {10, 5, 6, 3, 0b10101010, 6}, =20 - // replace leading byte with invalid byte - {1, 5, 0, 0, '\xFF', 0}, - {3, 5, 1, 1, '\xFF', 1}, - {6, 5, 3, 2, '\xFF', 3}, - {10, 5, 6, 3, '\xFF', 6}, + // 2. Missing trailing byte. We will replace the trailing byte with + // non-trailing byte, such as a byte that is always invalid or a leadi= ng + // byte (simple ASCII byte in our case). =20 // replace first trailing byte with ASCII byte {3, 5, 1, 1, 'z', 2}, @@ -582,21 +672,27 @@ utf8_to_utf16_in_error (const std::codecvt &cvt) {10, 5, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte - {3, 5, 1, 1, '\xFF', 2}, - {6, 5, 3, 2, '\xFF', 4}, - {10, 5, 6, 3, '\xFF', 7}, + {3, 5, 1, 1, 0xFF, 2}, + {6, 5, 3, 2, 0xFF, 4}, + {10, 5, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte {6, 5, 3, 2, 'z', 5}, {10, 5, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte - {6, 5, 3, 2, '\xFF', 5}, - {10, 5, 6, 3, '\xFF', 8}, + {6, 5, 3, 2, 0xFF, 5}, + {10, 5, 6, 3, 0xFF, 8}, =20 // replace third trailing byte {10, 5, 6, 3, 'z', 9}, - {10, 5, 6, 3, '\xFF', 9}, + {10, 5, 6, 3, 0xFF, 9}, + + // 2.1 The following test-cases raise doubt whether error or partial s= hould + // be returned. For example, we have 4-byte sequence with valid leadin= g + // byte. If we hide the last byte we need to return partial. But, if t= he + // second or third byte, which are visible to the call to codecvt, are + // malformed then error should be returned. =20 // replace first trailing byte with ASCII byte, also incomplete at end {5, 5, 3, 2, 'z', 4}, @@ -604,30 +700,51 @@ utf8_to_utf16_in_error (const std::codecvt &cvt) {9, 5, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte, also incomplete at e= nd - {5, 5, 3, 2, '\xFF', 4}, - {8, 5, 6, 3, '\xFF', 7}, - {9, 5, 6, 3, '\xFF', 7}, + {5, 5, 3, 2, 0xFF, 4}, + {8, 5, 6, 3, 0xFF, 7}, + {9, 5, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte, also incomplete at en= d {9, 5, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte, also incomplete at = end - {9, 5, 6, 3, '\xFF', 8}, + {9, 5, 6, 3, 0xFF, 8}, + + // 3. Surrogate CP. We modify the second byte (first trailing) of the = 3-byte + // CP U+D700 + {6, 5, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800 + {6, 5, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00 + {6, 5, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00 + {6, 5, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00 + + // 4. Overlong sequence. The CPs in the input are chosen such as modif= ying + // just the leading byte is enough to make them overlong, i.e. for the + // 3-byte and 4-byte CP the second byte (first trailing) has enough le= ading + // zeroes. + {3, 5, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong + {3, 5, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong + {6, 5, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong + {10, 5, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong + + // 5. CP above range + // turn U+10AAAA into U+14AAAA by changing its leading byte + {10, 5, 6, 3, 0b11110101, 6}, + // turn U+10AAAA into U+11AAAA by changing its 2nd byte + {10, 5, 6, 3, 0b10011010, 7}, }; for (auto t : offsets) { - char in[array_size (valid_in)] =3D {}; - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - char_traits::copy (in, valid_in, array_size (valid_in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -635,48 +752,51 @@ utf8_to_utf16_in_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -utf8_to_utf16_in (const std::codecvt &cvt) +utf8_to_utf16_in (const std::codecvt &cvt) { utf8_to_utf16_in_ok (cvt); utf8_to_utf16_in_partial (cvt); utf8_to_utf16_in_error (cvt); } =20 -template +template void -utf16_to_utf8_out_ok (const std::codecvt &cvt) +utf16_to_utf8_out_ok (const std::codecvt &cvt= ) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char16_t in_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - CharT in[array_size (in_literal)]; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - static_assert (array_size (in) =3D=3D 6, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 10); - VERIFY (char_traits::length (in) =3D=3D 5); + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 5); + VERIFY (char_traits::length (exp) =3D=3D 10); =20 const test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {2, 3}, {3, 6}, {5,= 10}}; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -684,29 +804,29 @@ utf16_to_utf8_out_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -utf16_to_utf8_out_partial (const std::codecvt &cvt= ) +utf16_to_utf8_out_partial (const std::codecvt= &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP - const char16_t in_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - CharT in[array_size (in_literal)]; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - static_assert (array_size (in) =3D=3D 6, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 10); - VERIFY (char_traits::length (in) =3D=3D 5); + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 5); + VERIFY (char_traits::length (exp) =3D=3D 10); =20 const test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -732,14 +852,14 @@ utf16_to_utf8_out_partial (const std::codecvt &cvt) }; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -747,26 +867,34 @@ utf16_to_utf8_out_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -utf16_to_utf8_out_error (const std::codecvt &cvt) +utf16_to_utf8_out_error (const std::codecvt &= cvt) { using namespace std; - const char16_t valid_in[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - - static_assert (array_size (valid_in) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 10); - - test_offsets_error offsets[] =3D { + // UTF-8 string of 1-byte CP, 2-byte CP, 3-byte CP and 4-byte CP + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 5); + VERIFY (char_traits::length (exp) =3D=3D 10); + + // The only possible error in UTF-16 is unpaired surrogate code units. + // So we replace valid code points (scalar values) with lone surrogate C= U. + test_offsets_error offsets[] =3D { {5, 10, 0, 0, 0xD800, 0}, {5, 10, 0, 0, 0xDBFF, 0}, {5, 10, 0, 0, 0xDC00, 0}, @@ -796,18 +924,17 @@ utf16_to_utf8_out_error (const std::codecvt &cvt) =20 for (auto t : offsets) { - CharT in[array_size (valid_in)] =3D {}; - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - copy (begin (valid_in), end (valid_in), begin (in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -815,56 +942,59 @@ utf16_to_utf8_out_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -utf16_to_utf8_out (const std::codecvt &cvt) +utf16_to_utf8_out (const std::codecvt &cvt) { utf16_to_utf8_out_ok (cvt); utf16_to_utf8_out_partial (cvt); utf16_to_utf8_out_error (cvt); } =20 -template +template void -test_utf8_utf16_cvts (const std::codecvt &cvt) +test_utf8_utf16_cvt (const std::codecvt &cvt) { utf8_to_utf16_in (cvt); utf16_to_utf8_out (cvt); } =20 -template +template void -utf8_to_ucs2_in_ok (const std::codecvt &cvt) +utf8_to_ucs2_in_ok (const std::codecvt &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP and 3-byte CP - const char in[] =3D "b=D1=88\uAAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 7, ""); - static_assert (array_size (exp_literal) =3D=3D 4, ""); - static_assert (array_size (exp) =3D=3D 4, ""); - VERIFY (char_traits::length (in) =3D=3D 6); - VERIFY (char_traits::length (exp_literal) =3D=3D 3); - VERIFY (char_traits::length (exp) =3D=3D 3); + const unsigned char input[] =3D "b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 7, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 6); + VERIFY (char_traits::length (exp) =3D=3D 3); =20 test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {3, 2}, {6, 3}}; for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -872,19 +1002,19 @@ utf8_to_ucs2_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } =20 for (auto t : offsets) { - CharT out[array_size (exp)] =3D {}; + InternT out[array_size (exp)] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res @@ -892,29 +1022,29 @@ utf8_to_ucs2_in_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0)= ; + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -utf8_to_ucs2_in_partial (const std::codecvt &cvt) +utf8_to_ucs2_in_partial (const std::codecvt &= cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP and 3-byte CP - const char in[] =3D "b=D1=88\uAAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); - - static_assert (array_size (in) =3D=3D 7, ""); - static_assert (array_size (exp_literal) =3D=3D 4, ""); - static_assert (array_size (exp) =3D=3D 4, ""); - VERIFY (char_traits::length (in) =3D=3D 6); - VERIFY (char_traits::length (exp_literal) =3D=3D 3); - VERIFY (char_traits::length (exp) =3D=3D 3); + const unsigned char input[] =3D "b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 7, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 6); + VERIFY (char_traits::length (exp) =3D=3D 3); =20 test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -932,14 +1062,14 @@ utf8_to_ucs2_in_partial (const std::codecvt &cvt) =20 for (auto t : offsets) { - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -947,36 +1077,57 @@ utf8_to_ucs2_in_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -utf8_to_ucs2_in_error (const std::codecvt &cvt) +utf8_to_ucs2_in_error (const std::codecvt &cv= t) { using namespace std; - const char valid_in[] =3D "b=D1=88\uAAAA\U0010AAAA"; - const char16_t exp_literal[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - CharT exp[array_size (exp_literal)] =3D {}; - copy (begin (exp_literal), end (exp_literal), begin (exp)); + const unsigned char input[] =3D "b\u0448\uD700\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uD700\U0010AAAA"; + static_assert (array_size (input) =3D=3D 11, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + ExternT in[array_size (input)]; + InternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 10); + VERIFY (char_traits::length (exp) =3D=3D 5); + + // There are 5 classes of errors in UTF-8 decoding + // 1. Missing leading byte + // 2. Missing trailing byte + // 3. Surrogate CP + // 4. Ovelong sequence + // 5. CP out of Unicode range + test_offsets_error offsets[] =3D { + + // 1. Missing leading byte. We will replace the leading byte with + // non-leading byte, such as a byte that is always invalid or a traili= ng + // byte. =20 - static_assert (array_size (valid_in) =3D=3D 11, ""); - static_assert (array_size (exp_literal) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 6, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 10); - VERIFY (char_traits::length (exp_literal) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 5); + // replace leading byte with invalid byte + {1, 5, 0, 0, 0xFF, 0}, + {3, 5, 1, 1, 0xFF, 1}, + {6, 5, 3, 2, 0xFF, 3}, + {10, 5, 6, 3, 0xFF, 6}, =20 - test_offsets_error offsets[] =3D { + // replace leading byte with trailing byte + {1, 5, 0, 0, 0b10101010, 0}, + {3, 5, 1, 1, 0b10101010, 1}, + {6, 5, 3, 2, 0b10101010, 3}, + {10, 5, 6, 3, 0b10101010, 6}, =20 - // replace leading byte with invalid byte - {1, 5, 0, 0, '\xFF', 0}, - {3, 5, 1, 1, '\xFF', 1}, - {6, 5, 3, 2, '\xFF', 3}, - {10, 5, 6, 3, '\xFF', 6}, + // 2. Missing trailing byte. We will replace the trailing byte with + // non-trailing byte, such as a byte that is always invalid or a leadi= ng + // byte (simple ASCII byte in our case). =20 // replace first trailing byte with ASCII byte {3, 5, 1, 1, 'z', 2}, @@ -984,72 +1135,90 @@ utf8_to_ucs2_in_error (const std::codecvt &cvt) {10, 5, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte - {3, 5, 1, 1, '\xFF', 2}, - {6, 5, 3, 2, '\xFF', 4}, - {10, 5, 6, 3, '\xFF', 7}, + {3, 5, 1, 1, 0xFF, 2}, + {6, 5, 3, 2, 0xFF, 4}, + {10, 5, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte {6, 5, 3, 2, 'z', 5}, {10, 5, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte - {6, 5, 3, 2, '\xFF', 5}, - {10, 5, 6, 3, '\xFF', 8}, + {6, 5, 3, 2, 0xFF, 5}, + {10, 5, 6, 3, 0xFF, 8}, =20 // replace third trailing byte {10, 5, 6, 3, 'z', 9}, - {10, 5, 6, 3, '\xFF', 9}, - - // When we see a leading byte of 4-byte CP, we should return error, no - // matter if it is incomplete at the end or has errors in the trailing - // bytes. - - // Don't replace anything, show full 4-byte CP - {10, 4, 6, 3, 'b', 0}, - {10, 5, 6, 3, 'b', 0}, + {10, 5, 6, 3, 0xFF, 9}, =20 - // Don't replace anything, show incomplete 4-byte CP at the end - {7, 4, 6, 3, 'b', 0}, // incomplete fourth CP - {8, 4, 6, 3, 'b', 0}, // incomplete fourth CP - {9, 4, 6, 3, 'b', 0}, // incomplete fourth CP - {7, 5, 6, 3, 'b', 0}, // incomplete fourth CP - {8, 5, 6, 3, 'b', 0}, // incomplete fourth CP - {9, 5, 6, 3, 'b', 0}, // incomplete fourth CP + // 2.1 The following test-cases raise doubt whether error or partial s= hould + // be returned. For example, we have 4-byte sequence with valid leadin= g + // byte. If we hide the last byte we need to return partial. But, if t= he + // second or third byte, which are visible to the call to codecvt, are + // malformed then error should be returned. =20 // replace first trailing byte with ASCII byte, also incomplete at end {5, 5, 3, 2, 'z', 4}, - - // replace first trailing byte with invalid byte, also incomplete at e= nd - {5, 5, 3, 2, '\xFF', 4}, - - // replace first trailing byte with ASCII byte, also incomplete at end {8, 5, 6, 3, 'z', 7}, {9, 5, 6, 3, 'z', 7}, =20 // replace first trailing byte with invalid byte, also incomplete at e= nd - {8, 5, 6, 3, '\xFF', 7}, - {9, 5, 6, 3, '\xFF', 7}, + {5, 5, 3, 2, 0xFF, 4}, + {8, 5, 6, 3, 0xFF, 7}, + {9, 5, 6, 3, 0xFF, 7}, =20 // replace second trailing byte with ASCII byte, also incomplete at en= d {9, 5, 6, 3, 'z', 8}, =20 // replace second trailing byte with invalid byte, also incomplete at = end - {9, 5, 6, 3, '\xFF', 8}, + {9, 5, 6, 3, 0xFF, 8}, + + // 3. Surrogate CP. We modify the second byte (first trailing) of the = 3-byte + // CP U+D700 + {6, 5, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800 + {6, 5, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00 + {6, 5, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00 + {6, 5, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00 + + // 4. Overlong sequence. The CPs in the input are chosen such as modif= ying + // just the leading byte is enough to make them overlong, i.e. for the + // 3-byte and 4-byte CP the second byte (first trailing) has enough le= ading + // zeroes. + {3, 5, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong + {3, 5, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong + {6, 5, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong + {10, 5, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong + + // 5. CP above range + // turn U+10AAAA into U+14AAAA by changing its leading byte + {10, 5, 6, 3, 0b11110101, 6}, + // turn U+10AAAA into U+11AAAA by changing its 2nd byte + {10, 5, 6, 3, 0b10011010, 7}, + // Don't replace anything, show full 4-byte CP U+10AAAA + {10, 4, 6, 3, 'b', 0}, + {10, 5, 6, 3, 'b', 0}, + // Don't replace anything, show incomplete 4-byte CP at the end. It's = still + // out of UCS2 range just by seeing the first byte. + {7, 4, 6, 3, 'b', 0}, // incomplete fourth CP + {8, 4, 6, 3, 'b', 0}, // incomplete fourth CP + {9, 4, 6, 3, 'b', 0}, // incomplete fourth CP + {7, 5, 6, 3, 'b', 0}, // incomplete fourth CP + {8, 5, 6, 3, 'b', 0}, // incomplete fourth CP + {9, 5, 6, 3, 'b', 0}, // incomplete fourth CP }; for (auto t : offsets) { - char in[array_size (valid_in)] =3D {}; - CharT out[array_size (exp) - 1] =3D {}; + InternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - char_traits::copy (in, valid_in, array_size (valid_in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const char *) nullptr; - auto out_next =3D (CharT *) nullptr; + auto in_next =3D (const ExternT *) nullptr; + auto out_next =3D (InternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, @@ -1057,48 +1226,51 @@ utf8_to_ucs2_in_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -utf8_to_ucs2_in (const std::codecvt &cvt) +utf8_to_ucs2_in (const std::codecvt &cvt) { utf8_to_ucs2_in_ok (cvt); utf8_to_ucs2_in_partial (cvt); utf8_to_ucs2_in_error (cvt); } =20 -template +template void -ucs2_to_utf8_out_ok (const std::codecvt &cvt) +ucs2_to_utf8_out_ok (const std::codecvt &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP and 3-byte CP - const char16_t in_literal[] =3D u"b=D1=88\uAAAA"; - const char exp[] =3D "b=D1=88\uAAAA"; - CharT in[array_size (in_literal)] =3D {}; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 4, ""); - static_assert (array_size (exp) =3D=3D 7, ""); - static_assert (array_size (in) =3D=3D 4, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 3); - VERIFY (char_traits::length (exp) =3D=3D 6); - VERIFY (char_traits::length (in) =3D=3D 3); + const char16_t input[] =3D u"b\u0448\uAAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 7, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 3); + VERIFY (char_traits::length (exp) =3D=3D 6); =20 const test_offsets_ok offsets[] =3D {{0, 0}, {1, 1}, {2, 3}, {3, 6}}; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -1106,29 +1278,29 @@ ucs2_to_utf8_out_ok (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.ok); VERIFY (in_next =3D=3D in + t.in_size); VERIFY (out_next =3D=3D out + t.out_size); - VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); if (t.out_size < array_size (out)) VERIFY (out[t.out_size] =3D=3D 0); } } =20 -template +template void -ucs2_to_utf8_out_partial (const std::codecvt &cvt) +ucs2_to_utf8_out_partial (const std::codecvt = &cvt) { using namespace std; // UTF-8 string of 1-byte CP, 2-byte CP and 3-byte CP - const char16_t in_literal[] =3D u"b=D1=88\uAAAA"; - const char exp[] =3D "b=D1=88\uAAAA"; - CharT in[array_size (in_literal)] =3D {}; - copy (begin (in_literal), end (in_literal), begin (in)); - - static_assert (array_size (in_literal) =3D=3D 4, ""); - static_assert (array_size (exp) =3D=3D 7, ""); - static_assert (array_size (in) =3D=3D 4, ""); - VERIFY (char_traits::length (in_literal) =3D=3D 3); - VERIFY (char_traits::length (exp) =3D=3D 6); - VERIFY (char_traits::length (in) =3D=3D 3); + const char16_t input[] =3D u"b\u0448\uAAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 7, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 3); + VERIFY (char_traits::length (exp) =3D=3D 6); =20 const test_offsets_partial offsets[] =3D { {1, 0, 0, 0}, // no space for first CP @@ -1142,14 +1314,14 @@ ucs2_to_utf8_out_partial (const std::codecvt &cvt) }; for (auto t : offsets) { - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -1157,43 +1329,45 @@ ucs2_to_utf8_out_partial (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.partial); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); } } =20 -template +template void -ucs2_to_utf8_out_error (const std::codecvt &cvt) +ucs2_to_utf8_out_error (const std::codecvt &c= vt) { using namespace std; - const char16_t valid_in[] =3D u"b=D1=88\uAAAA\U0010AAAA"; - const char exp[] =3D "b=D1=88\uAAAA\U0010AAAA"; - - static_assert (array_size (valid_in) =3D=3D 6, ""); - static_assert (array_size (exp) =3D=3D 11, ""); - VERIFY (char_traits::length (valid_in) =3D=3D 5); - VERIFY (char_traits::length (exp) =3D=3D 10); - - test_offsets_error offsets[] =3D { - {5, 10, 0, 0, 0xD800, 0}, - {5, 10, 0, 0, 0xDBFF, 0}, - {5, 10, 0, 0, 0xDC00, 0}, - {5, 10, 0, 0, 0xDFFF, 0}, - - {5, 10, 1, 1, 0xD800, 1}, - {5, 10, 1, 1, 0xDBFF, 1}, - {5, 10, 1, 1, 0xDC00, 1}, - {5, 10, 1, 1, 0xDFFF, 1}, - - {5, 10, 2, 3, 0xD800, 2}, - {5, 10, 2, 3, 0xDBFF, 2}, - {5, 10, 2, 3, 0xDC00, 2}, - {5, 10, 2, 3, 0xDFFF, 2}, - - // dont replace anything, just show the surrogate pair - {5, 10, 3, 6, u'b', 0}, + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const unsigned char expected[] =3D "b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 11, ""); + + InternT in[array_size (input)]; + ExternT exp[array_size (expected)]; + copy (begin (input), end (input), begin (in)); + copy (begin (expected), end (expected), begin (exp)); + VERIFY (char_traits::length (in) =3D=3D 5); + VERIFY (char_traits::length (exp) =3D=3D 10); + + test_offsets_error offsets[] =3D { + {3, 6, 0, 0, 0xD800, 0}, + {3, 6, 0, 0, 0xDBFF, 0}, + {3, 6, 0, 0, 0xDC00, 0}, + {3, 6, 0, 0, 0xDFFF, 0}, + + {3, 6, 1, 1, 0xD800, 1}, + {3, 6, 1, 1, 0xDBFF, 1}, + {3, 6, 1, 1, 0xDC00, 1}, + {3, 6, 1, 1, 0xDFFF, 1}, + + {3, 6, 2, 3, 0xD800, 2}, + {3, 6, 2, 3, 0xDBFF, 2}, + {3, 6, 2, 3, 0xDC00, 2}, + {3, 6, 2, 3, 0xDFFF, 2}, =20 // make the leading surrogate a trailing one {5, 10, 3, 6, 0xDC00, 3}, @@ -1206,6 +1380,9 @@ ucs2_to_utf8_out_error (const std::codecvt &cvt) // make the trailing surrogate a BMP char {5, 10, 3, 6, u'z', 4}, =20 + // don't replace anything in the test cases bellow, just show the surr= ogate + // pair (fourth CP) fully or partially + {5, 10, 3, 6, u'b', 0}, {5, 7, 3, 6, u'b', 0}, // no space for fourth CP {5, 8, 3, 6, u'b', 0}, // no space for fourth CP {5, 9, 3, 6, u'b', 0}, // no space for fourth CP @@ -1214,23 +1391,21 @@ ucs2_to_utf8_out_error (const std::codecvt &cvt) {4, 7, 3, 6, u'b', 0}, // incomplete fourth CP, and no space for it {4, 8, 3, 6, u'b', 0}, // incomplete fourth CP, and no space for it {4, 9, 3, 6, u'b', 0}, // incomplete fourth CP, and no space for it - }; =20 for (auto t : offsets) { - CharT in[array_size (valid_in)] =3D {}; - char out[array_size (exp) - 1] =3D {}; + ExternT out[array_size (exp) - 1] =3D {}; VERIFY (t.in_size <=3D array_size (in)); VERIFY (t.out_size <=3D array_size (out)); VERIFY (t.expected_in_next <=3D t.in_size); VERIFY (t.expected_out_next <=3D t.out_size); - copy (begin (valid_in), end (valid_in), begin (in)); + auto old_char =3D in[t.replace_pos]; in[t.replace_pos] =3D t.replace_char; =20 auto state =3D mbstate_t{}; - auto in_next =3D (const CharT *) nullptr; - auto out_next =3D (char *) nullptr; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (ExternT *) nullptr; auto res =3D codecvt_base::result (); =20 res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, @@ -1238,25 +1413,793 @@ ucs2_to_utf8_out_error (const std::codecvt &cvt) VERIFY (res =3D=3D cvt.error); VERIFY (in_next =3D=3D in + t.expected_in_next); VERIFY (out_next =3D=3D out + t.expected_out_next); - VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); if (t.expected_out_next < array_size (out)) VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; } } =20 -template +template void -ucs2_to_utf8_out (const std::codecvt &cvt) +ucs2_to_utf8_out (const std::codecvt &cvt) { ucs2_to_utf8_out_ok (cvt); ucs2_to_utf8_out_partial (cvt); ucs2_to_utf8_out_error (cvt); } =20 -template +template void -test_utf8_ucs2_cvts (const std::codecvt &cvt) +test_utf8_ucs2_cvt (const std::codecvt &cvt) { utf8_to_ucs2_in (cvt); ucs2_to_utf8_out (cvt); } + +enum utf16_endianess +{ + utf16_big_endian, + utf16_little_endian +}; + +template +Iter2 +utf16_to_bytes (Iter1 f, Iter1 l, Iter2 o, utf16_endianess e) +{ + if (e =3D=3D utf16_big_endian) + for (; f !=3D l; ++f) + { + *o++ =3D (*f >> 8) & 0xFF; + *o++ =3D *f & 0xFF; + } + else + for (; f !=3D l; ++f) + { + *o++ =3D *f & 0xFF; + *o++ =3D (*f >> 8) & 0xFF; + } + return o; +} + +template +void +utf16_to_utf32_in_ok (const std::codecvt &cvt, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + char in[array_size (input) * 2]; + InternT exp[array_size (expected)]; + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + copy (begin (expected), end (expected), begin (exp)); + + test_offsets_ok offsets[] =3D {{0, 0}, {2, 1}, {4, 2}, {6, 3}, {10, 4}}; + for (auto t : offsets) + { + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } + + for (auto t : offsets) + { + InternT out[array_size (exp)] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res + =3D cvt.in (state, in, in + t.in_size, in_next, out, end (out), out_next)= ; + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } +} + +template +void +utf16_to_utf32_in_partial (const std::codecvt &c= vt, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + char in[array_size (input) * 2]; + InternT exp[array_size (expected)]; + auto in_iter =3D begin (in); + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + copy (begin (expected), end (expected), begin (exp)); + + test_offsets_partial offsets[] =3D { + {2, 0, 0, 0}, // no space for first CP + {1, 1, 0, 0}, // incomplete first CP + {1, 0, 0, 0}, // incomplete first CP, and no space for it + + {4, 1, 2, 1}, // no space for second CP + {3, 2, 2, 1}, // incomplete second CP + {3, 1, 2, 1}, // incomplete second CP, and no space for it + + {6, 2, 4, 2}, // no space for third CP + {5, 3, 4, 2}, // incomplete third CP + {5, 2, 4, 2}, // incomplete third CP, and no space for it + + {10, 3, 6, 3}, // no space for fourth CP + {7, 4, 6, 3}, // incomplete fourth CP + {8, 4, 6, 3}, // incomplete fourth CP + {9, 4, 6, 3}, // incomplete fourth CP + {7, 3, 6, 3}, // incomplete fourth CP, and no space for it + {8, 3, 6, 3}, // incomplete fourth CP, and no space for it + {9, 3, 6, 3}, // incomplete fourth CP, and no space for it + }; + + for (auto t : offsets) + { + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.partial); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + } +} + +template +void +utf16_to_utf32_in_error (const std::codecvt &cvt= , + utf16_endianess endianess) +{ + using namespace std; + char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const char32_t expected[] =3D U"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 5, ""); + + InternT exp[array_size (expected)]; + copy (begin (expected), end (expected), begin (exp)); + + // The only possible error in UTF-16 is unpaired surrogate code units. + // So we replace valid code points (scalar values) with lone surrogate C= U. + test_offsets_error offsets[] =3D { + {10, 4, 0, 0, 0xD800, 0}, + {10, 4, 0, 0, 0xDBFF, 0}, + {10, 4, 0, 0, 0xDC00, 0}, + {10, 4, 0, 0, 0xDFFF, 0}, + + {10, 4, 2, 1, 0xD800, 1}, + {10, 4, 2, 1, 0xDBFF, 1}, + {10, 4, 2, 1, 0xDC00, 1}, + {10, 4, 2, 1, 0xDFFF, 1}, + + {10, 4, 4, 2, 0xD800, 2}, + {10, 4, 4, 2, 0xDBFF, 2}, + {10, 4, 4, 2, 0xDC00, 2}, + {10, 4, 4, 2, 0xDFFF, 2}, + + // make the leading surrogate a trailing one + {10, 4, 6, 3, 0xDC00, 3}, + {10, 4, 6, 3, 0xDFFF, 3}, + + // make the trailing surrogate a leading one + {10, 4, 6, 3, 0xD800, 4}, + {10, 4, 6, 3, 0xDBFF, 4}, + + // make the trailing surrogate a BMP char + {10, 4, 6, 3, u'z', 4}, + }; + + for (auto t : offsets) + { + char in[array_size (input) * 2]; + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto old_char =3D input[t.replace_pos]; + input[t.replace_pos] =3D t.replace_char; // replace in input, not in= in + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.error); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + + input[t.replace_pos] =3D old_char; + } +} + +template +void +utf32_to_utf16_out_ok (const std::codecvt &cvt, + utf16_endianess endianess) +{ + using namespace std; + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + const test_offsets_ok offsets[] =3D {{0, 0}, {1, 2}, {2, 4}, {3, 6}, {4,= 10}}; + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } +} + +template +void +utf32_to_utf16_out_partial (const std::codecvt &= cvt, + utf16_endianess endianess) +{ + using namespace std; + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + const test_offsets_partial offsets[] =3D { + {1, 0, 0, 0}, // no space for first CP + {1, 1, 0, 0}, // no space for first CP + + {2, 2, 1, 2}, // no space for second CP + {2, 3, 1, 2}, // no space for second CP + + {3, 4, 2, 4}, // no space for third CP + {3, 5, 2, 4}, // no space for third CP + + {4, 6, 3, 6}, // no space for fourth CP + {4, 7, 3, 6}, // no space for fourth CP + {4, 8, 3, 6}, // no space for fourth CP + {4, 9, 3, 6}, // no space for fourth CP + }; + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.partial); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + } +} + +template +void +utf32_to_utf16_out_error (const std::codecvt &cv= t, + utf16_endianess endianess) +{ + using namespace std; + const char32_t input[] =3D U"b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 5, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + test_offsets_error offsets[] =3D { + + // Surrogate CP + {4, 10, 0, 0, 0xD800, 0}, + {4, 10, 1, 2, 0xDBFF, 1}, + {4, 10, 2, 4, 0xDC00, 2}, + {4, 10, 3, 6, 0xDFFF, 3}, + + // CP out of range + {4, 10, 0, 0, 0x00110000, 0}, + {4, 10, 1, 2, 0x00110000, 1}, + {4, 10, 2, 4, 0x00110000, 2}, + {4, 10, 3, 6, 0x00110000, 3}}; + + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto old_char =3D in[t.replace_pos]; + in[t.replace_pos] =3D t.replace_char; + + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.error); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; + } +} + +template +void +test_utf16_utf32_cvt (const std::codecvt &cvt, + utf16_endianess endianess) +{ + utf16_to_utf32_in_ok (cvt, endianess); + utf16_to_utf32_in_partial (cvt, endianess); + utf16_to_utf32_in_error (cvt, endianess); + utf32_to_utf16_out_ok (cvt, endianess); + utf32_to_utf16_out_partial (cvt, endianess); + utf32_to_utf16_out_error (cvt, endianess); +} + +template +void +utf16_to_ucs2_in_ok (const std::codecvt &cvt, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + char in[array_size (input) * 2]; + InternT exp[array_size (expected)]; + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + copy (begin (expected), end (expected), begin (exp)); + + test_offsets_ok offsets[] =3D {{0, 0}, {2, 1}, {4, 2}, {6, 3}}; + for (auto t : offsets) + { + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } + + for (auto t : offsets) + { + InternT out[array_size (exp)] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res + =3D cvt.in (state, in, in + t.in_size, in_next, out, end (out), out_next)= ; + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D = 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } +} + +template +void +utf16_to_ucs2_in_partial (const std::codecvt &cv= t, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + char in[array_size (input) * 2]; + InternT exp[array_size (expected)]; + auto in_iter =3D begin (in); + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + copy (begin (expected), end (expected), begin (exp)); + + test_offsets_partial offsets[] =3D { + {2, 0, 0, 0}, // no space for first CP + {1, 1, 0, 0}, // incomplete first CP + {1, 0, 0, 0}, // incomplete first CP, and no space for it + + {4, 1, 2, 1}, // no space for second CP + {3, 2, 2, 1}, // incomplete second CP + {3, 1, 2, 1}, // incomplete second CP, and no space for it + + {6, 2, 4, 2}, // no space for third CP + {5, 3, 4, 2}, // incomplete third CP + {5, 2, 4, 2}, // incomplete third CP, and no space for it + }; + + for (auto t : offsets) + { + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.partial); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + } +} + +template +void +utf16_to_ucs2_in_error (const std::codecvt &cvt, + utf16_endianess endianess) +{ + using namespace std; + char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + InternT exp[array_size (expected)]; + copy (begin (expected), end (expected), begin (exp)); + + // The only possible error in UTF-16 is unpaired surrogate code units. + // Additionally, because the target encoding is UCS-2, a proper pair of + // surrogates is also error. Simply, any surrogate CU is error. + test_offsets_error offsets[] =3D { + {6, 3, 0, 0, 0xD800, 0}, + {6, 3, 0, 0, 0xDBFF, 0}, + {6, 3, 0, 0, 0xDC00, 0}, + {6, 3, 0, 0, 0xDFFF, 0}, + + {6, 3, 2, 1, 0xD800, 1}, + {6, 3, 2, 1, 0xDBFF, 1}, + {6, 3, 2, 1, 0xDC00, 1}, + {6, 3, 2, 1, 0xDFFF, 1}, + + {6, 3, 4, 2, 0xD800, 2}, + {6, 3, 4, 2, 0xDBFF, 2}, + {6, 3, 4, 2, 0xDC00, 2}, + {6, 3, 4, 2, 0xDFFF, 2}, + + // make the leading surrogate a trailing one + {10, 5, 6, 3, 0xDC00, 3}, + {10, 5, 6, 3, 0xDFFF, 3}, + + // make the trailing surrogate a leading one + {10, 5, 6, 3, 0xD800, 4}, + {10, 5, 6, 3, 0xDBFF, 4}, + + // make the trailing surrogate a BMP char + {10, 5, 6, 3, u'z', 4}, + + // don't replace anything in the test cases bellow, just show the surr= ogate + // pair (fourth CP) fully or partially (just the first surrogate) + {10, 5, 6, 3, u'b', 0}, + {8, 5, 6, 3, u'b', 0}, + {9, 5, 6, 3, u'b', 0}, + + {10, 4, 6, 3, u'b', 0}, + {8, 4, 6, 3, u'b', 0}, + {9, 4, 6, 3, u'b', 0}, + }; + + for (auto t : offsets) + { + char in[array_size (input) * 2]; + InternT out[array_size (exp) - 1] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto old_char =3D input[t.replace_pos]; + input[t.replace_pos] =3D t.replace_char; // replace in input, not in= in + utf16_to_bytes (begin (input), end (input), begin (in), endianess); + + auto state =3D mbstate_t{}; + auto in_next =3D (const char *) nullptr; + auto out_next =3D (InternT *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.in (state, in, in + t.in_size, in_next, out, out + t.out= _size, + out_next); + VERIFY (res =3D=3D cvt.error); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next= ) + =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + + input[t.replace_pos] =3D old_char; + } +} + +template +void +ucs2_to_utf16_out_ok (const std::codecvt &cvt, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + const test_offsets_ok offsets[] =3D {{0, 0}, {1, 2}, {2, 4}, {3, 6}}; + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.ok); + VERIFY (in_next =3D=3D in + t.in_size); + VERIFY (out_next =3D=3D out + t.out_size); + VERIFY (char_traits::compare (out, exp, t.out_size) =3D=3D 0); + if (t.out_size < array_size (out)) + VERIFY (out[t.out_size] =3D=3D 0); + } +} + +template +void +ucs2_to_utf16_out_partial (const std::codecvt &c= vt, + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA"; + static_assert (array_size (input) =3D=3D 4, ""); + static_assert (array_size (expected) =3D=3D 4, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + const test_offsets_partial offsets[] =3D { + {1, 0, 0, 0}, // no space for first CP + {1, 1, 0, 0}, // no space for first CP + + {2, 2, 1, 2}, // no space for second CP + {2, 3, 1, 2}, // no space for second CP + + {3, 4, 2, 4}, // no space for third CP + {3, 5, 2, 4}, // no space for third CP + }; + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.partial); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + } +} + +template +void +ucs2_to_utf16_out_error (const std::codecvt &cvt= , + utf16_endianess endianess) +{ + using namespace std; + const char16_t input[] =3D u"b\u0448\uAAAA\U0010AAAA"; + const char16_t expected[] =3D u"b\u0448\uAAAA\U0010AAAA"; + static_assert (array_size (input) =3D=3D 6, ""); + static_assert (array_size (expected) =3D=3D 6, ""); + + InternT in[array_size (input)]; + char exp[array_size (expected) * 2]; + copy (begin (input), end (input), begin (in)); + utf16_to_bytes (begin (expected), end (expected), begin (exp), endianess= ); + + test_offsets_error offsets[] =3D { + {3, 6, 0, 0, 0xD800, 0}, + {3, 6, 0, 0, 0xDBFF, 0}, + {3, 6, 0, 0, 0xDC00, 0}, + {3, 6, 0, 0, 0xDFFF, 0}, + + {3, 6, 1, 2, 0xD800, 1}, + {3, 6, 1, 2, 0xDBFF, 1}, + {3, 6, 1, 2, 0xDC00, 1}, + {3, 6, 1, 2, 0xDFFF, 1}, + + {3, 6, 2, 4, 0xD800, 2}, + {3, 6, 2, 4, 0xDBFF, 2}, + {3, 6, 2, 4, 0xDC00, 2}, + {3, 6, 2, 4, 0xDFFF, 2}, + + // make the leading surrogate a trailing one + {5, 10, 3, 6, 0xDC00, 3}, + {5, 10, 3, 6, 0xDFFF, 3}, + + // make the trailing surrogate a leading one + {5, 10, 3, 6, 0xD800, 4}, + {5, 10, 3, 6, 0xDBFF, 4}, + + // make the trailing surrogate a BMP char + {5, 10, 3, 6, u'z', 4}, + + // don't replace anything in the test cases bellow, just show the surr= ogate + // pair (fourth CP) fully or partially (just the first surrogate) + {5, 10, 3, 6, u'b', 0}, + {5, 8, 3, 6, u'b', 0}, + {5, 9, 3, 6, u'b', 0}, + + {4, 10, 3, 6, u'b', 0}, + {4, 8, 3, 6, u'b', 0}, + {4, 9, 3, 6, u'b', 0}, + }; + + for (auto t : offsets) + { + char out[array_size (exp) - 2] =3D {}; + VERIFY (t.in_size <=3D array_size (in)); + VERIFY (t.out_size <=3D array_size (out)); + VERIFY (t.expected_in_next <=3D t.in_size); + VERIFY (t.expected_out_next <=3D t.out_size); + auto old_char =3D in[t.replace_pos]; + in[t.replace_pos] =3D t.replace_char; + + auto state =3D mbstate_t{}; + auto in_next =3D (const InternT *) nullptr; + auto out_next =3D (char *) nullptr; + auto res =3D codecvt_base::result (); + + res =3D cvt.out (state, in, in + t.in_size, in_next, out, out + t.ou= t_size, + out_next); + VERIFY (res =3D=3D cvt.error); + VERIFY (in_next =3D=3D in + t.expected_in_next); + VERIFY (out_next =3D=3D out + t.expected_out_next); + VERIFY (char_traits::compare (out, exp, t.expected_out_next) = =3D=3D 0); + if (t.expected_out_next < array_size (out)) + VERIFY (out[t.expected_out_next] =3D=3D 0); + + in[t.replace_pos] =3D old_char; + } +} + +template +void +test_utf16_ucs2_cvt (const std::codecvt &cvt, + utf16_endianess endianess) +{ + utf16_to_ucs2_in_ok (cvt, endianess); + utf16_to_ucs2_in_partial (cvt, endianess); + utf16_to_ucs2_in_error (cvt, endianess); + ucs2_to_utf16_out_ok (cvt, endianess); + ucs2_to_utf16_out_partial (cvt, endianess); + ucs2_to_utf16_out_error (cvt, endianess); +} diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8= _t.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc new file mode 100644 index 000000000..8ab5ba79f --- /dev/null +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc @@ -0,0 +1,53 @@ +// Copyright (C) 2020-2023 Free Software Foundation, Inc. +// +// This file is part of the GNU ISO C++ Library. This library is free +// software; you can redistribute it and/or modify it under the +// terms of the GNU General Public License as published by the +// Free Software Foundation; either version 3, or (at your option) +// any later version. + +// This library is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. + +// You should have received a copy of the GNU General Public License along +// with this library; see the file COPYING3. If not see +// . + +// { dg-do run { target c++11 } } +// { dg-require-cstdint "" } +// { dg-options "-fchar8_t" } + +#include "codecvt_unicode.h" + +using namespace std; + +void +test_utf8_utf32_codecvts () +{ + using codecvt_c32_c8 =3D codecvt; + auto &loc_c =3D locale::classic (); + VERIFY (has_facet (loc_c)); + + auto &cvt =3D use_facet (loc_c); + test_utf8_utf32_cvt (cvt); +} + +void +test_utf8_utf16_codecvts () +{ + using codecvt_c16_c8 =3D codecvt; + auto &loc_c =3D locale::classic (); + VERIFY (has_facet (loc_c)); + + auto &cvt =3D use_facet (loc_c); + test_utf8_utf16_cvt (cvt); +} + +int +main () +{ + test_utf8_utf32_codecvts (); + test_utf8_utf16_codecvts (); +} diff --git a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar= _t.cc b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc index 4fd1bfec6..d6e5b20e8 100644 --- a/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc +++ b/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc @@ -28,7 +28,7 @@ test_utf8_utf32_codecvts () { #if __SIZEOF_WCHAR_T__ =3D=3D 4 codecvt_utf8 cvt; - test_utf8_utf32_codecvts (cvt); + test_utf8_utf32_cvt (cvt); #endif } =20 @@ -37,7 +37,7 @@ test_utf8_utf16_codecvts () { #if __SIZEOF_WCHAR_T__ >=3D 2 codecvt_utf8_utf16 cvt; - test_utf8_utf16_cvts (cvt); + test_utf8_utf16_cvt (cvt); #endif } =20 @@ -46,7 +46,31 @@ test_utf8_ucs2_codecvts () { #if __SIZEOF_WCHAR_T__ =3D=3D 2 codecvt_utf8 cvt; - test_utf8_ucs2_cvts (cvt); + test_utf8_ucs2_cvt (cvt); +#endif +} + +void +test_utf16_utf32_codecvts () +{ +#if __SIZEOF_WCHAR_T__ =3D=3D 4 + codecvt_utf16 cvt3; + test_utf16_utf32_cvt (cvt3, utf16_big_endian); + + codecvt_utf16 cvt4; + test_utf16_utf32_cvt (cvt4, utf16_little_endian); +#endif +} + +void +test_utf16_ucs2_codecvts () +{ +#if __SIZEOF_WCHAR_T__ =3D=3D 2 + codecvt_utf16 cvt3; + test_utf16_ucs2_cvt (cvt3, utf16_big_endian); + + codecvt_utf16 cvt4; + test_utf16_ucs2_cvt (cvt4, utf16_little_endian); #endif } =20 @@ -56,4 +80,6 @@ main () test_utf8_utf32_codecvts (); test_utf8_utf16_codecvts (); test_utf8_ucs2_codecvts (); + test_utf16_utf32_codecvts (); + test_utf16_ucs2_codecvts (); } --=20 2.34.1