From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id B3D8D3858C78 for ; Fri, 29 Sep 2023 15:02:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B3D8D3858C78 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1695999772; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=a2+pG5uvAdE/h489WnHj5myD6FxIASVovUPBA30Ra6E=; b=CEOgNPzAFK/WpdTP8XLQRkvv/O9CXrWX3ZedJmmzIgEvgImfES/VbIA7txIF1GoSkeCqvq zNwpaKgF9HtMktellEnWKsVdsZYs5y/Ur8YqA2bTXRW2x2Gz57F75TcS5nVnwoMsOezxOE YmSZYsjncpmYuwgtlBPxOfIVf7Y7ODI= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-686-ch9IHq1KOVy2spvgCZ8dKQ-1; Fri, 29 Sep 2023 11:02:51 -0400 X-MC-Unique: ch9IHq1KOVy2spvgCZ8dKQ-1 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-2bffb48a78cso6958051fa.0 for ; Fri, 29 Sep 2023 08:02:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695999769; x=1696604569; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=a2+pG5uvAdE/h489WnHj5myD6FxIASVovUPBA30Ra6E=; b=v3bGJyDTFIV4FAM0PkEizoBfefzRc7lTnjMCusoPwDx/yAQjztU17SUFo5eCDkctte VaQXDsAsDfRsKoDlPalapCxZSZ98xZaxaBANqCQSjdDRZctmVSpsLicVkUh1NeIO7uF2 z5dM8g9sOT4mF6HrvRmj7oyNOsyY+Y7CAj2eJUWn2xiPgktAMD3prH1kdfHdN6QjzeHv CzHRaY+Yfjd+byw8nfGu7WpzLDJymtKCn1uRP0X+KhwSo/Rcju3IEJuwnvPUM/Jq27nP 0MQiy2pBq118GxKpV31xCL01bEIyOBpS2I2Loj1eDSB8VtgLYGpvBqpPZIebOT7rvgvH d/6w== X-Gm-Message-State: AOJu0YyxXSO76cv5IbJn0kDK6nFJuFZ2AJMKlpoKbLmIr8eszMKGqflv KPXh/Yl2A1DfHiaB+QN5T35sywz9nZx2KgUBVdoh6Va4DHtQFB2R26N7tGkW9Z4YNWbrfXyoERM CQ/z2yQghNC8rdBgboqTulpnvPPz6TZI= X-Received: by 2002:a2e:99da:0:b0:2bd:a67:e8c with SMTP id l26-20020a2e99da000000b002bd0a670e8cmr3306099ljj.3.1695999769487; Fri, 29 Sep 2023 08:02:49 -0700 (PDT) X-Google-Smtp-Source: AGHT+IExSBOIdjRI4Uag+43eADjhi6DyE6Y1DDcsUuXHoV/8j9cmhV56+awZNOW8wu1lD1kHYvwIAWxYcTIBpof5O4E= X-Received: by 2002:a2e:99da:0:b0:2bd:a67:e8c with SMTP id l26-20020a2e99da000000b002bd0a670e8cmr3306068ljj.3.1695999769001; Fri, 29 Sep 2023 08:02:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jonathan Wakely Date: Fri, 29 Sep 2023 16:02:37 +0100 Message-ID: Subject: Re: [PATCH v3] libstdc++: Fix handling of surrogate CP in codecvt [PR108976] To: Dimitrij Mijoski Cc: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, 28 Sept 2023 at 20:39, Dimitrij Mijoski via Libstdc++ wrote: > > This patch fixes the handling of surrogate code points in all standard > facets for transcoding Unicode that are based on std::codecvt. Surrogate > code points should always be treated as error. On the other hand > surrogate code units can only appear in UTF-16 and only when they come > in a proper pair. > > Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number > of bytes were given in the range [from, from_end), error was returned > always. The last byte in such range does not form a full UTF-16 code > unit and we can not make any decisions for error, instead partial should > be returned. > > The testsuite for testing these facets was updated in the following > order: > > 1. All functions that test codecvts that work with UTF-8 were refactored > and made more generic so they accept codecvt that works with the char > type char8_t. > 2. The same functions were updated with new test cases for transcoding > errors and now additionally test for surrogates, overlong UTF-8 > sequences, code points out of the Unicode range, and more tests for > missing leading and trailing code units. > 3. New tests were added to test codecvt_utf16 in both of its variants, > UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2. > > libstdc++-v3/ChangeLog: > > * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of > surrogates in UTF-8. > (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8. > (ucs4_in): Fix handling of range with odd number of bytes. > (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16. > (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16. > (ucs2_in): Fix handling of range with odd number of bytes. > (__codecvt_utf16_base::do_in): Likewise. > (__codecvt_utf16_base::do_in): Likewise. > (__codecvt_utf16_base::do_in): Likewise. > * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add > tests for codecvt_utf16 and codecvt_utf16. > * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8 > testing functions for char8_t, add more test cases for errors, > add testing functions for codecvt_utf16. > * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: > Renames, add tests for codecvt_utf16. > * testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06): > Fix test. > * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test. Thanks, your v2 patch was still on my TODO list. I've pushed this version to trunk now.