From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id B21E73898C73 for ; Mon, 14 Mar 2022 13:10:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B21E73898C73 Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-36-CJnSjbmJNf-cznPbHUsFxw-1; Mon, 14 Mar 2022 09:10:50 -0400 X-MC-Unique: CJnSjbmJNf-cznPbHUsFxw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F23D12999B30; Mon, 14 Mar 2022 13:10:49 +0000 (UTC) Received: from localhost (unknown [10.33.36.3]) by smtp.corp.redhat.com (Postfix) with ESMTP id BA76B407EE62; Mon, 14 Mar 2022 13:10:49 +0000 (UTC) From: Jonathan Wakely To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org Subject: [committed] libstdc++: Fix reading UTF-8 characters for 16-bit targets [PR104875] Date: Mon, 14 Mar 2022 13:10:49 +0000 Message-Id: <20220314131049.2080585-1-jwakely@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Mar 2022 13:10:56 -0000 Tested powerpc64le-linux and sparc-sun-solaris2.11, pushed to trunk. This should be backported too. It's not a regression, but this code is just broken on 16-bit targets. -- >8 -- The current code in read_utf8_code_point assumes that integer promotion will create a 32-bit int, but that's not true for 16-bit targets like msp430 and avr. This changes the intermediate variables used for each octet from unsigned char to char32_t, so that (c << N) works correctly when N > 8. libstdc++-v3/ChangeLog: PR libstdc++/104875 * src/c++11/codecvt.cc (read_utf8_code_point): Use char32_t to hold octets that will be left-shifted. --- libstdc++-v3/src/c++11/codecvt.cc | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/libstdc++-v3/src/c++11/codecvt.cc b/libstdc++-v3/src/c++11/codecvt.cc index d9f2dacb647..9f8cb767732 100644 --- a/libstdc++-v3/src/c++11/codecvt.cc +++ b/libstdc++-v3/src/c++11/codecvt.cc @@ -254,7 +254,7 @@ namespace const size_t avail = from.size(); if (avail == 0) return incomplete_mb_character; - unsigned char c1 = from[0]; + char32_t c1 = (unsigned char) from[0]; // https://en.wikipedia.org/wiki/UTF-8#Sample_code if (c1 < 0x80) { @@ -267,7 +267,7 @@ namespace { if (avail < 2) return incomplete_mb_character; - unsigned char c2 = from[1]; + char32_t c2 = (unsigned char) from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 6) + c2 - 0x3080; @@ -279,12 +279,12 @@ namespace { if (avail < 3) return incomplete_mb_character; - unsigned char c2 = from[1]; + char32_t c2 = (unsigned char) from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; if (c1 == 0xE0 && c2 < 0xA0) // overlong return invalid_mb_sequence; - unsigned char c3 = from[2]; + char32_t c3 = (unsigned char) from[2]; if ((c3 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 12) + (c2 << 6) + c3 - 0xE2080; @@ -296,17 +296,17 @@ namespace { if (avail < 4) return incomplete_mb_character; - unsigned char c2 = from[1]; + char32_t c2 = (unsigned char) from[1]; if ((c2 & 0xC0) != 0x80) return invalid_mb_sequence; if (c1 == 0xF0 && c2 < 0x90) // overlong return invalid_mb_sequence; if (c1 == 0xF4 && c2 >= 0x90) // > U+10FFFF return invalid_mb_sequence; - unsigned char c3 = from[2]; + char32_t c3 = (unsigned char) from[2]; if ((c3 & 0xC0) != 0x80) return invalid_mb_sequence; - unsigned char c4 = from[3]; + char32_t c4 = (unsigned char) from[3]; if ((c4 & 0xC0) != 0x80) return invalid_mb_sequence; char32_t c = (c1 << 18) + (c2 << 12) + (c3 << 6) + c4 - 0x3C82080; -- 2.34.1