From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by sourceware.org (Postfix) with ESMTPS id 8B717385828F for ; Wed, 9 Nov 2022 02:11:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8B717385828F Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=kitware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kitware.com Received: by mail-qk1-x72a.google.com with SMTP id i9so10211665qki.10 for ; Tue, 08 Nov 2022 18:11:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kitware.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FmQARJR0+/Ox6vGEg4VaTuSXbISaHDilZGoCgwFhBBk=; b=OX6DBDb+b6ye6rABYXPl4PCN92Y6xB40lkNw2mTApeuo/p7XJSHbeorARGdNSjuXfM Vr1mQk24B3lmXSDqNrm4gI+yfPVKr/xLLuNpoQLKz8x09SZpbKs1K7sdQmQ03acovXpu YA9BmCdgUhDyZw4jQml/0MNPMECyj8BM1O32E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FmQARJR0+/Ox6vGEg4VaTuSXbISaHDilZGoCgwFhBBk=; b=ydKEr32xhgxht5alZ2+oUuQ3Uare7NNW/Oiw9sVyk4lXNJtmALJoskOELhI+H+zxa8 9Owo+cCZjPWPqG9uJ+CTyoaueKEzqyBVWxT6FeXR1gdwbg/dZOh8GZCSu3kP6qXDn/Fk oMaTd7TSE0V/IWDmq+luR1oWsu9b4dA+jmcg2Bjp6Rd04aSpWotuWV++jMI/edJOwdx2 4PZBp2XLw3ki+ELDtkROyxLOkRgJmAHk9qIp771ohzNQrq/GvhGwcUqGY6OA7Hf937KU 2Y2rM6tGV6+MiLvt9y1ZmxfQGfd6vTCw+4laa4SImLVnuIIO3bz6AjWwEUlwVyNcfvg+ g/sw== X-Gm-Message-State: ACrzQf0k/QyrzA82vDM8yo7vmXA7229s7XZbJcg1LCFO5Ors4tPSfr8E RRfz8OaRJF46Ze/ji9PGYGfH/+6Jc2HRiw== X-Google-Smtp-Source: AMsMyM4jqj9hiZ617BMRheMZp0Jtd+M0WaQi3MBftSpjIM8rHRDJxAIxfHuPBpUCsLTSYNdYhIiEnQ== X-Received: by 2002:ae9:ed06:0:b0:6f9:efd2:9b96 with SMTP id c6-20020ae9ed06000000b006f9efd29b96mr43302844qkg.651.1667959860858; Tue, 08 Nov 2022 18:11:00 -0800 (PST) Received: from localhost (cpe-142-105-146-128.nycap.res.rr.com. [142.105.146.128]) by smtp.gmail.com with ESMTPSA id gd11-20020a05622a5c0b00b00398d83256ddsm9158014qtb.31.2022.11.08.18.11.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Nov 2022 18:11:00 -0800 (PST) From: Ben Boeckel To: gcc-patches@gcc.gnu.org Cc: Ben Boeckel , jason@redhat.com, nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com Subject: [PATCH v3 1/3] libcpp: reject codepoints above 0x10FFFF Date: Tue, 8 Nov 2022 21:10:46 -0500 Message-Id: <20221109021048.2123704-2-ben.boeckel@kitware.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20221109021048.2123704-1-ben.boeckel@kitware.com> References: <20221109021048.2123704-1-ben.boeckel@kitware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Unicode does not support such values because they are unrepresentable in UTF-16. libcpp/ * charset.cc: Reject encodings of codepoints above 0x10FFFF. UTF-16 does not support such codepoints and therefore all Unicode rejects such values. Signed-off-by: Ben Boeckel --- libcpp/charset.cc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/libcpp/charset.cc b/libcpp/charset.cc index 12a398e7527..324b5b19136 100644 --- a/libcpp/charset.cc +++ b/libcpp/charset.cc @@ -158,6 +158,10 @@ struct _cpp_strbuf encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or FC 80 80 80 9F 80. Only the first is valid. + Additionally, Unicode declares that all codepoints above 0010FFFF are + invalid because they cannot be represented in UTF-16. As such, all 5- and + 6-byte encodings are invalid. + An implementation note: the transformation from UTF-16 to UTF-8, or vice versa, is easiest done by using UTF-32 as an intermediary. */ @@ -216,7 +220,7 @@ one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp, if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; /* Make sure the character is valid. */ - if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; + if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; *cp = c; *inbufp = inbuf; @@ -320,7 +324,7 @@ one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp, s += inbuf[bigend ? 2 : 1] << 8; s += inbuf[bigend ? 3 : 0]; - if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF)) + if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF)) return EILSEQ; rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp); -- 2.38.1