From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 143ED385801E for ; Wed, 29 Sep 2021 13:16:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 143ED385801E Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-417-GsCRbrSJPW2oUtF2u6m2Fw-1; Wed, 29 Sep 2021 09:16:05 -0400 X-MC-Unique: GsCRbrSJPW2oUtF2u6m2Fw-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D56A9100D681; Wed, 29 Sep 2021 13:16:04 +0000 (UTC) Received: from localhost (unknown [10.33.36.241]) by smtp.corp.redhat.com (Postfix) with ESMTP id 80A2960938; Wed, 29 Sep 2021 13:16:04 +0000 (UTC) Date: Wed, 29 Sep 2021 14:16:03 +0100 From: Jonathan Wakely To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org Subject: [committed] libstdc++: std::basic_regex should treat '\0' as an ordinary char [PR84110] Message-ID: MIME-Version: 1.0 X-Clacks-Overhead: GNU Terry Pratchett X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="7bQJOceIl41nOVfE" Content-Disposition: inline X-Spam-Status: No, score=-13.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=unavailable autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Sep 2021 13:16:11 -0000 --7bQJOceIl41nOVfE Content-Type: text/plain; charset=us-ascii Content-Disposition: inline When the input sequence contains a _CharT(0) character, the strchr call in _Scanner<_CharT>::_M_scan_normal() will search for '\0' and so return a pointer to the terminating null at the end of the string. This makes the scanner think it's found a special character. Because it doesn't match any of the actual special characters, we fall off the end of the function (or assert in debug mode). We should check for a null character explicitly and either treat it as an ordinary character (for the ECMAScript grammar) or an error (for all others). I'm not 100% sure that's right, but it seems consistent with the POSIX RE rules where a '\0' means the end of the regex pattern or the end of the sequence being matched. Signed-off-by: Jonathan Wakely libstdc++-v3/ChangeLog: PR libstdc++/84110 * include/bits/regex_error.h (regex_constants::_S_null): New error code for internal use. * include/bits/regex_scanner.tcc (_Scanner::_M_scan_normal()): Check for null character. * testsuite/28_regex/basic_regex/84110.cc: New test. Tested x86_64-linux. Committed to trunk. --7bQJOceIl41nOVfE Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="patch.txt" commit b701e1f8f6870c0f8cb4050674da489101dd05a5 Author: Jonathan Wakely Date: Wed Sep 29 13:48:11 2021 libstdc++: std::basic_regex should treat '\0' as an ordinary char [PR84110] When the input sequence contains a _CharT(0) character, the strchr call in _Scanner<_CharT>::_M_scan_normal() will search for '\0' and so return a pointer to the terminating null at the end of the string. This makes the scanner think it's found a special character. Because it doesn't match any of the actual special characters, we fall off the end of the function (or assert in debug mode). We should check for a null character explicitly and either treat it as an ordinary character (for the ECMAScript grammar) or an error (for all others). I'm not 100% sure that's right, but it seems consistent with the POSIX RE rules where a '\0' means the end of the regex pattern or the end of the sequence being matched. Signed-off-by: Jonathan Wakely libstdc++-v3/ChangeLog: PR libstdc++/84110 * include/bits/regex_error.h (regex_constants::_S_null): New error code for internal use. * include/bits/regex_scanner.tcc (_Scanner::_M_scan_normal()): Check for null character. * testsuite/28_regex/basic_regex/84110.cc: New test. diff --git a/libstdc++-v3/include/bits/regex_error.h b/libstdc++-v3/include/bits/regex_error.h index d3713fa5f47..722ce26cda3 100644 --- a/libstdc++-v3/include/bits/regex_error.h +++ b/libstdc++-v3/include/bits/regex_error.h @@ -61,6 +61,7 @@ namespace regex_constants _S_error_badrepeat, _S_error_complexity, _S_error_stack, + _S_null }; /** The expression contained an invalid collating element name. */ diff --git a/libstdc++-v3/include/bits/regex_scanner.tcc b/libstdc++-v3/include/bits/regex_scanner.tcc index b2b709ce3cb..d81627dc3e9 100644 --- a/libstdc++-v3/include/bits/regex_scanner.tcc +++ b/libstdc++-v3/include/bits/regex_scanner.tcc @@ -175,6 +175,16 @@ namespace __detail _M_state = _S_state_in_brace; _M_token = _S_token_interval_begin; } + else if (__builtin_expect(__c == _CharT(0), false)) + { + if (!_M_is_ecma()) + { + __throw_regex_error(regex_constants::_S_null, + "Unexpected null character in regular expression"); + } + _M_token = _S_token_ord_char; + _M_value.assign(1, __c); + } else if (__c != ']' && __c != '}') { auto __it = _M_token_tbl; diff --git a/libstdc++-v3/testsuite/28_regex/basic_regex/84110.cc b/libstdc++-v3/testsuite/28_regex/basic_regex/84110.cc new file mode 100644 index 00000000000..b9971dcaac5 --- /dev/null +++ b/libstdc++-v3/testsuite/28_regex/basic_regex/84110.cc @@ -0,0 +1,39 @@ +// { dg-do run { target c++11 } } +#include +#include +#include + +void test01() +{ + const std::string s(1ul, '\0'); + std::regex re(s); + VERIFY( std::regex_match(s, re) ); // PR libstdc++/84110 + +#if __cpp_exceptions + using namespace std::regex_constants; + for (auto syn : {basic, extended, awk, grep, egrep}) + { + try + { + std::regex{s, syn}; // '\0' is not valid for other grammars + VERIFY( false ); + } + catch (const std::regex_error&) + { + } + } +#endif +} + +void test02() +{ + const std::string s("uh-\0h", 5); + std::regex re(s); + VERIFY( std::regex_match(s, re) ); +} + +int main() +{ + test01(); + test02(); +} --7bQJOceIl41nOVfE--