* [Patch, libstdc++/77356] Support escape in regex bracket expression
@ 2016-08-24 7:18 Tim Shen
2016-08-24 8:41 ` Jonathan Wakely
0 siblings, 1 reply; 5+ messages in thread
From: Tim Shen @ 2016-08-24 7:18 UTC (permalink / raw)
To: libstdc++, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 279 bytes --]
I didn't realized that we can actually escape a dash inside bracket
expression: R"([\-])", in which case the dash should be treated
literally.
Tell me if you feel like we need more documentations. :P
Bootstrapped and tested on x86_64-linux-gnu.
Thanks!
--
Regards,
Tim Shen
[-- Attachment #2: a.diff --]
[-- Type: text/plain, Size: 7281 bytes --]
commit 404a69b17f7fddc9856e9664e304f122601f212f
Author: Tim Shen <timshen@google.com>
Date: Wed Aug 24 00:06:54 2016 -0700
2016-08-25 Tim Shen <timshen@google.com>
PR libstdc++/77356
* include/bits/regex_compiler.tcc(_M_insert_bracket_matcher,
_M_expression_term): Modify to support dash literal.
* include/bits/regex_scanner.h: Add dash as a token type to make
a different from the mandated dash literal by escaping.
* include/bits/regex_scanner.tcc(_M_scan_in_bracket): Emit dash
token in bracket expression parsing.
* testsuite/28_regex/regression.cc: Add new testcase.
diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc
index ff69e16..3ffa170 100644
--- a/libstdc++-v3/include/bits/regex_compiler.tcc
+++ b/libstdc++-v3/include/bits/regex_compiler.tcc
@@ -426,13 +426,21 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
pair<bool, _CharT> __last_char; // Optional<_CharT>
__last_char.first = false;
if (!(_M_flags & regex_constants::ECMAScript))
- if (_M_try_char())
- {
- __matcher._M_add_char(_M_value[0]);
- __last_char.first = true;
- __last_char.second = _M_value[0];
- }
+ {
+ if (_M_try_char())
+ {
+ __last_char.first = true;
+ __last_char.second = _M_value[0];
+ }
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
+ {
+ __last_char.first = true;
+ __last_char.second = '-';
+ }
+ }
while (_M_expression_term(__last_char, __matcher));
+ if (__last_char.first)
+ __matcher._M_add_char(__last_char.second);
__matcher._M_ready();
_M_stack.push(_StateSeqT(
*_M_nfa,
@@ -449,19 +457,35 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
if (_M_match_token(_ScannerT::_S_token_bracket_end))
return false;
+ const auto __push_char = [&](_CharT __ch)
+ {
+ if (__last_char.first)
+ __matcher._M_add_char(__last_char.second);
+ else
+ __last_char.first = true;
+ __last_char.second = __ch;
+ };
+
if (_M_match_token(_ScannerT::_S_token_collsymbol))
{
auto __symbol = __matcher._M_add_collate_element(_M_value);
if (__symbol.size() == 1)
- {
- __last_char.first = true;
- __last_char.second = __symbol[0];
- }
+ __push_char(__symbol[0]);
+ else
+ __last_char.first = false;
}
else if (_M_match_token(_ScannerT::_S_token_equiv_class_name))
- __matcher._M_add_equivalence_class(_M_value);
+ {
+ __last_char.first = false;
+ __matcher._M_add_equivalence_class(_M_value);
+ }
else if (_M_match_token(_ScannerT::_S_token_char_class_name))
- __matcher._M_add_character_class(_M_value, false);
+ {
+ __last_char.first = false;
+ __matcher._M_add_character_class(_M_value, false);
+ }
+ else if (_M_try_char())
+ __push_char(_M_value[0]);
// POSIX doesn't allow '-' as a start-range char (say [a-z--0]),
// except when the '-' is the first or last character in the bracket
// expression ([--0]). ECMAScript treats all '-' after a range as a
@@ -472,55 +496,55 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
// Clang (3.5) always uses ECMAScript style even in its POSIX syntax.
//
// It turns out that no one reads BNFs ;)
- else if (_M_try_char())
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
{
if (!__last_char.first)
{
- __matcher._M_add_char(_M_value[0]);
- if (_M_value[0] == '-'
- && !(_M_flags & regex_constants::ECMAScript))
+ if (!(_M_flags & regex_constants::ECMAScript))
{
if (_M_match_token(_ScannerT::_S_token_bracket_end))
- return false;
+ {
+ __push_char('-');
+ return false;
+ }
__throw_regex_error(
regex_constants::error_range,
"Unexpected dash in bracket expression. For POSIX syntax, "
"a dash is not treated literally only when it is at "
"beginning or end.");
}
- __last_char.first = true;
- __last_char.second = _M_value[0];
+ __push_char('-');
}
else
{
- if (_M_value[0] == '-')
+ if (_M_try_char())
{
- if (_M_try_char())
- {
- __matcher._M_make_range(__last_char.second , _M_value[0]);
- __last_char.first = false;
- }
- else
- {
- if (_M_scanner._M_get_token()
- != _ScannerT::_S_token_bracket_end)
- __throw_regex_error(
- regex_constants::error_range,
- "Unexpected end of bracket expression.");
- __matcher._M_add_char(_M_value[0]);
- }
+ __matcher._M_make_range(__last_char.second, _M_value[0]);
+ __last_char.first = false;
+ }
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
+ {
+ __matcher._M_make_range(__last_char.second, '-');
+ __last_char.first = false;
}
else
{
- __matcher._M_add_char(_M_value[0]);
- __last_char.second = _M_value[0];
+ if (_M_scanner._M_get_token()
+ != _ScannerT::_S_token_bracket_end)
+ __throw_regex_error(
+ regex_constants::error_range,
+ "Character is expected after a dash.");
+ __push_char(_M_value[0]);
}
}
}
else if (_M_match_token(_ScannerT::_S_token_quoted_class))
- __matcher._M_add_character_class(_M_value,
- _M_ctype.is(_CtypeT::upper,
- _M_value[0]));
+ {
+ __last_char.first = false;
+ __matcher._M_add_character_class(_M_value,
+ _M_ctype.is(_CtypeT::upper,
+ _M_value[0]));
+ }
else
__throw_regex_error(regex_constants::error_brack,
"Unexpected character in bracket expression.");
diff --git a/libstdc++-v3/include/bits/regex_scanner.h b/libstdc++-v3/include/bits/regex_scanner.h
index 37dea84..2a83d1c 100644
--- a/libstdc++-v3/include/bits/regex_scanner.h
+++ b/libstdc++-v3/include/bits/regex_scanner.h
@@ -73,6 +73,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_S_token_comma,
_S_token_dup_count,
_S_token_eof,
+ _S_token_bracket_dash,
_S_token_unknown
};
diff --git a/libstdc++-v3/include/bits/regex_scanner.tcc b/libstdc++-v3/include/bits/regex_scanner.tcc
index fedba09..a734bb1 100644
--- a/libstdc++-v3/include/bits/regex_scanner.tcc
+++ b/libstdc++-v3/include/bits/regex_scanner.tcc
@@ -210,7 +210,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
auto __c = *_M_current++;
- if (__c == '[')
+ if (__c == '-')
+ _M_token = _S_token_bracket_dash;
+ else if (__c == '[')
{
if (_M_current == _M_end)
__throw_regex_error(regex_constants::error_brack,
diff --git a/libstdc++-v3/testsuite/28_regex/regression.cc b/libstdc++-v3/testsuite/28_regex/regression.cc
index d367c8b..0896a74 100644
--- a/libstdc++-v3/testsuite/28_regex/regression.cc
+++ b/libstdc++-v3/testsuite/28_regex/regression.cc
@@ -61,12 +61,23 @@ test03()
VERIFY(!regex_search_debug("a", regex(R"(\b$)"), regex_constants::match_not_eow));
}
+// PR libstdc++/77356
+void
+test04()
+{
+ bool test __attribute__((unused)) = true;
+ static const char* kNumericAnchor ="(\\$|usd)(usd|\\$|to|and|up to|[0-9,\\.\\-\\sk])+";
+ const std::regex re(kNumericAnchor);
+ (void)re;
+}
+
int
main()
{
test01();
test02();
test03();
+ test04();
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch, libstdc++/77356] Support escape in regex bracket expression
2016-08-24 7:18 [Patch, libstdc++/77356] Support escape in regex bracket expression Tim Shen
@ 2016-08-24 8:41 ` Jonathan Wakely
2016-08-24 19:48 ` Tim Shen
0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2016-08-24 8:41 UTC (permalink / raw)
To: Tim Shen; +Cc: libstdc++, gcc-patches
On 24/08/16 00:18 -0700, Tim Shen wrote:
>I didn't realized that we can actually escape a dash inside bracket
>expression: R"([\-])", in which case the dash should be treated
>literally.
With this patch applied I no longer get a match for:
regex_match("-", std::regex{"[a-]", std::regex_constants::basic})
"[-a]" still works correctly, but they should be equivalent for POSIX
BREs and EREs.
>diff --git a/libstdc++-v3/include/bits/regex_scanner.h b/libstdc++-v3/include/bits/regex_scanner.h
>index 37dea84..2a83d1c 100644
>--- a/libstdc++-v3/include/bits/regex_scanner.h
>+++ b/libstdc++-v3/include/bits/regex_scanner.h
>@@ -73,6 +73,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> _S_token_comma,
> _S_token_dup_count,
> _S_token_eof,
>+ _S_token_bracket_dash,
> _S_token_unknown
> };
I wonder if we want to give _S_token_unknown an initializer with a
specific value, so it doesn't change whenever we add new tokens.
If we use -1 it would change the underlying type of
_ScannerBase::_TokenT from unsigned int to signed int, so we should
give it a fixed type too:
struct _ScannerBase
{
public:
/// Token types returned from the scanner.
enum _TokenT : unsigned
{
...
_S_token_unknown = -1u
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch, libstdc++/77356] Support escape in regex bracket expression
2016-08-24 8:41 ` Jonathan Wakely
@ 2016-08-24 19:48 ` Tim Shen
2016-08-26 10:09 ` Jonathan Wakely
0 siblings, 1 reply; 5+ messages in thread
From: Tim Shen @ 2016-08-24 19:48 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: libstdc++, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 961 bytes --]
On Wed, Aug 24, 2016 at 1:41 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
> On 24/08/16 00:18 -0700, Tim Shen wrote:
>>
>> I didn't realized that we can actually escape a dash inside bracket
>> expression: R"([\-])", in which case the dash should be treated
>> literally.
>
>
> With this patch applied I no longer get a match for:
>
> regex_match("-", std::regex{"[a-]", std::regex_constants::basic})
Nice catch. I'm surprised that there is no test for it. Added one.
> I wonder if we want to give _S_token_unknown an initializer with a
> specific value, so it doesn't change whenever we add new tokens.
>
> If we use -1 it would change the underlying type of
> _ScannerBase::_TokenT from unsigned int to signed int, so we should
> give it a fixed type too:
>
> struct _ScannerBase
> {
> public:
> /// Token types returned from the scanner.
> enum _TokenT : unsigned
> {
> ...
> _S_token_unknown = -1u
Done.
--
Regards,
Tim Shen
[-- Attachment #2: c.diff --]
[-- Type: text/plain, Size: 7871 bytes --]
commit 4d35cb02470ae73560df67fd7d1c2d901304cbf3
Author: Tim Shen <timshen@google.com>
Date: Wed Aug 24 12:43:22 2016 -0700
2016-08-24 Tim Shen <timshen@google.com>
PR libstdc++/77356
* include/bits/regex_compiler.tcc(_M_insert_bracket_matcher,
_M_expression_term): Modify to support dash literal.
* include/bits/regex_scanner.h: Add dash as a token type to make
a different from the mandated dash literal by escaping.
* include/bits/regex_scanner.tcc(_M_scan_in_bracket): Emit dash
token in bracket expression parsing.
* testsuite/28_regex/regression.cc: Add new testcases.
diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc
index ff69e16..ef6ebdd 100644
--- a/libstdc++-v3/include/bits/regex_compiler.tcc
+++ b/libstdc++-v3/include/bits/regex_compiler.tcc
@@ -426,13 +426,21 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
pair<bool, _CharT> __last_char; // Optional<_CharT>
__last_char.first = false;
if (!(_M_flags & regex_constants::ECMAScript))
- if (_M_try_char())
- {
- __matcher._M_add_char(_M_value[0]);
- __last_char.first = true;
- __last_char.second = _M_value[0];
- }
+ {
+ if (_M_try_char())
+ {
+ __last_char.first = true;
+ __last_char.second = _M_value[0];
+ }
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
+ {
+ __last_char.first = true;
+ __last_char.second = '-';
+ }
+ }
while (_M_expression_term(__last_char, __matcher));
+ if (__last_char.first)
+ __matcher._M_add_char(__last_char.second);
__matcher._M_ready();
_M_stack.push(_StateSeqT(
*_M_nfa,
@@ -449,19 +457,43 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
if (_M_match_token(_ScannerT::_S_token_bracket_end))
return false;
+ const auto __push_char = [&](_CharT __ch)
+ {
+ if (__last_char.first)
+ __matcher._M_add_char(__last_char.second);
+ else
+ __last_char.first = true;
+ __last_char.second = __ch;
+ };
+ const auto __flush = [&]
+ {
+ if (__last_char.first)
+ {
+ __matcher._M_add_char(__last_char.second);
+ __last_char.first = false;
+ }
+ };
+
if (_M_match_token(_ScannerT::_S_token_collsymbol))
{
auto __symbol = __matcher._M_add_collate_element(_M_value);
if (__symbol.size() == 1)
- {
- __last_char.first = true;
- __last_char.second = __symbol[0];
- }
+ __push_char(__symbol[0]);
+ else
+ __flush();
}
else if (_M_match_token(_ScannerT::_S_token_equiv_class_name))
- __matcher._M_add_equivalence_class(_M_value);
+ {
+ __flush();
+ __matcher._M_add_equivalence_class(_M_value);
+ }
else if (_M_match_token(_ScannerT::_S_token_char_class_name))
- __matcher._M_add_character_class(_M_value, false);
+ {
+ __flush();
+ __matcher._M_add_character_class(_M_value, false);
+ }
+ else if (_M_try_char())
+ __push_char(_M_value[0]);
// POSIX doesn't allow '-' as a start-range char (say [a-z--0]),
// except when the '-' is the first or last character in the bracket
// expression ([--0]). ECMAScript treats all '-' after a range as a
@@ -472,55 +504,55 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
// Clang (3.5) always uses ECMAScript style even in its POSIX syntax.
//
// It turns out that no one reads BNFs ;)
- else if (_M_try_char())
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
{
if (!__last_char.first)
{
- __matcher._M_add_char(_M_value[0]);
- if (_M_value[0] == '-'
- && !(_M_flags & regex_constants::ECMAScript))
+ if (!(_M_flags & regex_constants::ECMAScript))
{
if (_M_match_token(_ScannerT::_S_token_bracket_end))
- return false;
+ {
+ __push_char('-');
+ return false;
+ }
__throw_regex_error(
regex_constants::error_range,
"Unexpected dash in bracket expression. For POSIX syntax, "
"a dash is not treated literally only when it is at "
"beginning or end.");
}
- __last_char.first = true;
- __last_char.second = _M_value[0];
+ __push_char('-');
}
else
{
- if (_M_value[0] == '-')
+ if (_M_try_char())
{
- if (_M_try_char())
- {
- __matcher._M_make_range(__last_char.second , _M_value[0]);
- __last_char.first = false;
- }
- else
- {
- if (_M_scanner._M_get_token()
- != _ScannerT::_S_token_bracket_end)
- __throw_regex_error(
- regex_constants::error_range,
- "Unexpected end of bracket expression.");
- __matcher._M_add_char(_M_value[0]);
- }
+ __matcher._M_make_range(__last_char.second, _M_value[0]);
+ __last_char.first = false;
+ }
+ else if (_M_match_token(_ScannerT::_S_token_bracket_dash))
+ {
+ __matcher._M_make_range(__last_char.second, '-');
+ __last_char.first = false;
}
else
{
- __matcher._M_add_char(_M_value[0]);
- __last_char.second = _M_value[0];
+ if (_M_scanner._M_get_token()
+ != _ScannerT::_S_token_bracket_end)
+ __throw_regex_error(
+ regex_constants::error_range,
+ "Character is expected after a dash.");
+ __push_char('-');
}
}
}
else if (_M_match_token(_ScannerT::_S_token_quoted_class))
- __matcher._M_add_character_class(_M_value,
- _M_ctype.is(_CtypeT::upper,
- _M_value[0]));
+ {
+ __flush();
+ __matcher._M_add_character_class(_M_value,
+ _M_ctype.is(_CtypeT::upper,
+ _M_value[0]));
+ }
else
__throw_regex_error(regex_constants::error_brack,
"Unexpected character in bracket expression.");
diff --git a/libstdc++-v3/include/bits/regex_scanner.h b/libstdc++-v3/include/bits/regex_scanner.h
index 37dea84..ed0b723 100644
--- a/libstdc++-v3/include/bits/regex_scanner.h
+++ b/libstdc++-v3/include/bits/regex_scanner.h
@@ -43,7 +43,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
{
public:
/// Token types returned from the scanner.
- enum _TokenT
+ enum _TokenT : unsigned
{
_S_token_anychar,
_S_token_ord_char,
@@ -73,7 +73,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_S_token_comma,
_S_token_dup_count,
_S_token_eof,
- _S_token_unknown
+ _S_token_bracket_dash,
+ _S_token_unknown = -1u
};
protected:
diff --git a/libstdc++-v3/include/bits/regex_scanner.tcc b/libstdc++-v3/include/bits/regex_scanner.tcc
index fedba09..a734bb1 100644
--- a/libstdc++-v3/include/bits/regex_scanner.tcc
+++ b/libstdc++-v3/include/bits/regex_scanner.tcc
@@ -210,7 +210,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
auto __c = *_M_current++;
- if (__c == '[')
+ if (__c == '-')
+ _M_token = _S_token_bracket_dash;
+ else if (__c == '[')
{
if (_M_current == _M_end)
__throw_regex_error(regex_constants::error_brack,
diff --git a/libstdc++-v3/testsuite/28_regex/regression.cc b/libstdc++-v3/testsuite/28_regex/regression.cc
index d367c8b..fac7fa2 100644
--- a/libstdc++-v3/testsuite/28_regex/regression.cc
+++ b/libstdc++-v3/testsuite/28_regex/regression.cc
@@ -61,12 +61,35 @@ test03()
VERIFY(!regex_search_debug("a", regex(R"(\b$)"), regex_constants::match_not_eow));
}
+// PR libstdc++/77356
+void
+test04()
+{
+ bool test __attribute__((unused)) = true;
+
+ static const char* kNumericAnchor ="(\\$|usd)(usd|\\$|to|and|up to|[0-9,\\.\\-\\sk])+";
+ const std::regex re(kNumericAnchor);
+ (void)re;
+}
+
+void
+test05()
+{
+ bool test __attribute__((unused)) = true;
+
+ VERIFY(regex_match_debug("!", std::regex("[![:alnum:]]")));
+ VERIFY(regex_match_debug("-", std::regex("[a-]", regex_constants::basic)));
+ VERIFY(regex_match_debug("-", std::regex("[a-]")));
+}
+
int
main()
{
test01();
test02();
test03();
+ test04();
+ test05();
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch, libstdc++/77356] Support escape in regex bracket expression
2016-08-24 19:48 ` Tim Shen
@ 2016-08-26 10:09 ` Jonathan Wakely
2016-08-27 2:04 ` Tim Shen
0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2016-08-26 10:09 UTC (permalink / raw)
To: Tim Shen; +Cc: libstdc++, gcc-patches
On 24/08/16 12:48 -0700, Tim Shen wrote:
>On Wed, Aug 24, 2016 at 1:41 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
>> On 24/08/16 00:18 -0700, Tim Shen wrote:
>>>
>>> I didn't realized that we can actually escape a dash inside bracket
>>> expression: R"([\-])", in which case the dash should be treated
>>> literally.
>>
>>
>> With this patch applied I no longer get a match for:
>>
>> regex_match("-", std::regex{"[a-]", std::regex_constants::basic})
>
>Nice catch. I'm surprised that there is no test for it. Added one.
>
>> I wonder if we want to give _S_token_unknown an initializer with a
>> specific value, so it doesn't change whenever we add new tokens.
>>
>> If we use -1 it would change the underlying type of
>> _ScannerBase::_TokenT from unsigned int to signed int, so we should
>> give it a fixed type too:
>>
>> struct _ScannerBase
>> {
>> public:
>> /// Token types returned from the scanner.
>> enum _TokenT : unsigned
>> {
>> ...
>> _S_token_unknown = -1u
>
>Done.
OK for trunk, thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Patch, libstdc++/77356] Support escape in regex bracket expression
2016-08-26 10:09 ` Jonathan Wakely
@ 2016-08-27 2:04 ` Tim Shen
0 siblings, 0 replies; 5+ messages in thread
From: Tim Shen @ 2016-08-27 2:04 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: libstdc++, gcc-patches
On Fri, Aug 26, 2016 at 3:09 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
> OK for trunk, thanks.
Committed as r239794.
Thanks!
--
Regards,
Tim Shen
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-08-27 2:04 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24 7:18 [Patch, libstdc++/77356] Support escape in regex bracket expression Tim Shen
2016-08-24 8:41 ` Jonathan Wakely
2016-08-24 19:48 ` Tim Shen
2016-08-26 10:09 ` Jonathan Wakely
2016-08-27 2:04 ` Tim Shen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).