From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com [IPv6:2607:f8b0:4864:20::f35]) by sourceware.org (Postfix) with ESMTPS id 46A683858C50 for ; Wed, 25 Jan 2023 21:06:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 46A683858C50 Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=kitware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kitware.com Received: by mail-qv1-xf35.google.com with SMTP id q10so15001029qvt.10 for ; Wed, 25 Jan 2023 13:06:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kitware.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+AZO+CP+ZJuGxRNxlin0kEJjHYpUx2qBzWY38dxdrgA=; b=be6G+tMmbiYoOA+/chAzjWU7beQ89l7FqEBM9EfaXBqAnBty4tcaBiohRxdvecLcYA Mi0G/a5aYTIF/b/1muhYdSendJqNIgwF6YuOZ2ae6/MonhudWVbMO3SKvchl/UOw3aZv 5p2ykoLaf48ld2KXpaMgGy2AEZ9aJdSUAjbMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+AZO+CP+ZJuGxRNxlin0kEJjHYpUx2qBzWY38dxdrgA=; b=0RMJviyaqEkCtG7VtCAOuG8tMZVEvgic+e1AlR6HoG5QD1pBwxnKZr2EZIJ8/OpsKC XkQX4bYcWoN94hJsk5AqQL5jX12ksZYOvDoX9ngXy8MlXGqBuJohsMu8s1O5ITAiNW7z XEgaicx5wFUB7dif18wpmhIrCe5GwpkgcIZXfYNRF7OZYFdyP5cv0aVZTAlj7PPcpziq ugPdzh63mIwQIUFx21UqkNbn/OTzaJSeQRunpfacizYRwDRmq1/l4ywh6UIZWj3QxkYL 5A/2t5+63nHBoeUwampV7gpfPTVR2NWG+iqOMjE3XKU5Kzvir5bJBglw7HakBK8tKH18 mVGg== X-Gm-Message-State: AFqh2krmXoNjEkjIHtaooyPsfNZa1Yq066pVFmge3WXPk6sXyQ4YkUGm MFkKbozHUrllK6tzz0PwxQUBd2BXjT9cLBTAXiE= X-Google-Smtp-Source: AMrXdXsyIcjIRqdPmTlRMpWsWfSyb5RGPhyWFdEeyzQOJ5Zr2x8AdQPhAuvHRnKjLSKmkiy89EyYKg== X-Received: by 2002:a05:6214:268d:b0:537:4b26:7229 with SMTP id gm13-20020a056214268d00b005374b267229mr33671352qvb.29.1674680813691; Wed, 25 Jan 2023 13:06:53 -0800 (PST) Received: from localhost (cpe-142-105-146-128.nycap.res.rr.com. [142.105.146.128]) by smtp.gmail.com with ESMTPSA id a1-20020a05620a438100b006fc2b672950sm4246013qkp.37.2023.01.25.13.06.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Jan 2023 13:06:53 -0800 (PST) From: Ben Boeckel To: gcc-patches@gcc.gnu.org Cc: Ben Boeckel , jason@redhat.com, nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com Subject: [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF Date: Wed, 25 Jan 2023 16:06:32 -0500 Message-Id: <20230125210636.2960049-2-ben.boeckel@kitware.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230125210636.2960049-1-ben.boeckel@kitware.com> References: <20230125210636.2960049-1-ben.boeckel@kitware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Unicode does not support such values because they are unrepresentable in UTF-16. libcpp/ * charset.cc: Reject encodings of codepoints above 0x10FFFF. UTF-16 does not support such codepoints and therefore all Unicode rejects such values. Signed-off-by: Ben Boeckel --- libcpp/charset.cc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/libcpp/charset.cc b/libcpp/charset.cc index 3c47d4f868b..f7ae12ea5a2 100644 --- a/libcpp/charset.cc +++ b/libcpp/charset.cc @@ -158,6 +158,10 @@ struct _cpp_strbuf encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or FC 80 80 80 9F 80. Only the first is valid. + Additionally, Unicode declares that all codepoints above 0010FFFF are + invalid because they cannot be represented in UTF-16. As such, all 5- and + 6-byte encodings are invalid. + An implementation note: the transformation from UTF-16 to UTF-8, or vice versa, is easiest done by using UTF-32 as an intermediary. */ @@ -216,7 +220,7 @@ one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp, if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; /* Make sure the character is valid. */ - if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; + if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; *cp = c; *inbufp = inbuf; @@ -320,7 +324,7 @@ one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp, s += inbuf[bigend ? 2 : 1] << 8; s += inbuf[bigend ? 3 : 0]; - if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF)) + if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF)) return EILSEQ; rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp); -- 2.39.0