From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt1-x836.google.com (mail-qt1-x836.google.com [IPv6:2607:f8b0:4864:20::836]) by sourceware.org (Postfix) with ESMTPS id E418938417F3 for ; Sat, 10 Dec 2022 22:21:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E418938417F3 Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=kitware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kitware.com Received: by mail-qt1-x836.google.com with SMTP id x11so942731qtv.13 for ; Sat, 10 Dec 2022 14:21:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kitware.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FmQARJR0+/Ox6vGEg4VaTuSXbISaHDilZGoCgwFhBBk=; b=DsN4EWXIX675yx4lm/WmCbJyVLnxnVkr2tA7qOMAH6fq2XNQd9wuc46LJqccFYqtTo Cn3mDAesY+BWMsxmt5ulZyVrkPjp+wVVAJQGCmmGm9RKEomzdBkkkrJiEyXe92zQTb/q kejy96yX4B2BB4MKXBtX5STgYCQbpNJ60j+/8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FmQARJR0+/Ox6vGEg4VaTuSXbISaHDilZGoCgwFhBBk=; b=eK9ySj1Ltc68HB63flqQRm7/BivPr8ds1M2gUdIE5EhWmDBpsLAkjhvaxGeyOzqV0q djlbqKnGfcqCQ+xeoLnuxxXemSeXLCqf3lGsw/Qpchz1lYi7MUX+OnYzsUw7O8UOVCFI A6myN3X8HasXE9kbvkPmjLiI4xmNVhSwC3M1Wjj8Ideqb2FAHsdyM6lFdQj8oO75F7Or tqzPq5A+bsQE2e4VHANIIZV8j65nnA2X4E8Sh4UJgMEuSDe75ZeMK/Un3oDY0BW9nM7T W0n8Z/L4laBB1M5lmIvQEtEPgcYeVz3Jj990CzK//rTAB2sg9nWZVFmQBrKi1Ak0wDf4 vMVQ== X-Gm-Message-State: ANoB5plFp1iD6INFiXjQkoHfDdgTBPP1pMeItTw+NyfSENATYNmnHo5z fGbOuAYXSCYv8r5gVBx8wHjc7K7lASPKKHs7JL0= X-Google-Smtp-Source: AA0mqf7Zk+6qPdnXxkduKrFnBkBz/+qlZm2lKMwF0mbV0GhHFrHbC8uEY1Eu89XQmQ5bHMB1zccE1A== X-Received: by 2002:ac8:4292:0:b0:3a7:ed21:ac45 with SMTP id o18-20020ac84292000000b003a7ed21ac45mr16212648qtl.20.1670710862332; Sat, 10 Dec 2022 14:21:02 -0800 (PST) Received: from localhost (cpe-142-105-146-128.nycap.res.rr.com. [142.105.146.128]) by smtp.gmail.com with ESMTPSA id u13-20020a05620a430d00b006fbdeecad51sm2902956qko.48.2022.12.10.14.21.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 10 Dec 2022 14:21:01 -0800 (PST) From: Ben Boeckel To: gcc-patches@gcc.gnu.org Cc: Ben Boeckel , jason@redhat.com, nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com Subject: [PATCH v4 1/3] libcpp: reject codepoints above 0x10FFFF Date: Sat, 10 Dec 2022 17:20:48 -0500 Message-Id: <20221210222050.1674457-2-ben.boeckel@kitware.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20221210222050.1674457-1-ben.boeckel@kitware.com> References: <20221210222050.1674457-1-ben.boeckel@kitware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Unicode does not support such values because they are unrepresentable in UTF-16. libcpp/ * charset.cc: Reject encodings of codepoints above 0x10FFFF. UTF-16 does not support such codepoints and therefore all Unicode rejects such values. Signed-off-by: Ben Boeckel --- libcpp/charset.cc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/libcpp/charset.cc b/libcpp/charset.cc index 12a398e7527..324b5b19136 100644 --- a/libcpp/charset.cc +++ b/libcpp/charset.cc @@ -158,6 +158,10 @@ struct _cpp_strbuf encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or FC 80 80 80 9F 80. Only the first is valid. + Additionally, Unicode declares that all codepoints above 0010FFFF are + invalid because they cannot be represented in UTF-16. As such, all 5- and + 6-byte encodings are invalid. + An implementation note: the transformation from UTF-16 to UTF-8, or vice versa, is easiest done by using UTF-32 as an intermediary. */ @@ -216,7 +220,7 @@ one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp, if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; /* Make sure the character is valid. */ - if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; + if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; *cp = c; *inbufp = inbuf; @@ -320,7 +324,7 @@ one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp, s += inbuf[bigend ? 2 : 1] << 8; s += inbuf[bigend ? 3 : 0]; - if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF)) + if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF)) return EILSEQ; rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp); -- 2.38.1