public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch] Add regex_constants::__polynomial
@ 2015-04-25  8:10 Tim Shen
  2015-04-27  0:55 ` Tim Shen
  0 siblings, 1 reply; 4+ messages in thread
From: Tim Shen @ 2015-04-25  8:10 UTC (permalink / raw)
  To: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

On Tue, Feb 17, 2015 at 1:54 AM, Jonathan Wakely wrote:
> On 16/02/15 22:10 +0000, Tim Shen wrote:
>>
>> Hi Jon,
>>
>> The Thompson NFA algorithm (BFS approach) in libstdc++ regex exists for a
>> while, and I do think we can add it to the standard as a flag
>> regex_constants::syntax_option_type::polynomial, which requires the regex
>> compiler to produce a NFA that never consumes exponentially large space
>> and
>> and exectuer not to consume exponential time for matching. It's good for
>> security reasons.
>>
>> When compiling a regex with this flag, a back-reference in regex will
>> cause
>> a regex_error::error_type::unexpected_backref exception, since matching a
>> backref in polynomial time is as hard as proving P=NP :P. I think that's
>> the only requirement of this flag.
>>
>> re2 <https://code.google.com/p/re2/> is a library that actually implements
>> this requirement.
>>
>> Do you think it's an appropriate proposal?
>>
>> Thanks! :)
>
>
> The names would have to be uglified so e.g. __polynomial and
> __unexpected_backref but I think it's an excellent idea.

As discussed with Jon, we think it's a good idea to add the
__polynomial as an extension flag for exposing our BFS executer
(Thompson NFA matching algorithm).

Currently ECMAScript doesn't correctly support Thompson NFA though;
I'm actually (slowly) doing a refactoring starting with ECMAScript,
but I think it's still good to let users use it earlier.

Bootstrapped and tested.

Thanks!


-- 
Regards,
Tim Shen

[-- Attachment #2: a.diff --]
[-- Type: text/plain, Size: 4311 bytes --]

commit 38469980d4b5cf2a71d0a340415864394a404509
Author: Tim Shen <timshen@google.com>
Date:   Sat Apr 25 00:23:52 2015 -0700

    	* include/bits/regex.tcc: Handle regex_constants::__polynomial.
    	* include/bits/regex_automaton.tcc: Throw exception when parsing
    	back-reference with flag __polynomial.
    	* include/bits/regex_constants.h: Add extension flag
    	syntax_option_type __polynomial.

diff --git a/libstdc++-v3/include/bits/regex.tcc b/libstdc++-v3/include/bits/regex.tcc
index 823ad51..fedc2b9 100644
--- a/libstdc++-v3/include/bits/regex.tcc
+++ b/libstdc++-v3/include/bits/regex.tcc
@@ -28,13 +28,6 @@
  *  Do not attempt to use it directly. @headername{regex}
  */
 
-// A non-standard switch to let the user pick the matching algorithm.
-// If _GLIBCXX_REGEX_USE_THOMPSON_NFA is defined, the thompson NFA
-// algorithm will be used. This algorithm is not enabled by default,
-// and cannot be used if the regex contains back-references, but has better
-// (polynomial instead of exponential) worst case performance.
-// See __regex_algo_impl below.
-
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 namespace __detail
@@ -67,16 +60,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       for (auto& __it : __res)
 	__it.matched = false;
 
-      // __policy is used by testsuites so that they can use Thompson NFA
-      // without defining a macro. Users should define
-      // _GLIBCXX_REGEX_USE_THOMPSON_NFA if they need to use this approach.
       bool __ret;
-      if (!__re._M_automaton->_M_has_backref
-	  && !(__re._M_flags & regex_constants::ECMAScript)
-#ifndef _GLIBCXX_REGEX_USE_THOMPSON_NFA
-	  && __policy == _RegexExecutorPolicy::_S_alternate
-#endif
-	  )
+      if ((__re.flags() & regex_constants::__polynomial)
+	  || (__policy == _RegexExecutorPolicy::_S_alternate
+	      && !__re._M_automaton->_M_has_backref))
 	{
 	  _Executor<_BiIter, _Alloc, _TraitsT, false>
 	    __executor(__s, __e, __m, __re, __flags);
diff --git a/libstdc++-v3/include/bits/regex_automaton.tcc b/libstdc++-v3/include/bits/regex_automaton.tcc
index fbc3389..72fe978 100644
--- a/libstdc++-v3/include/bits/regex_automaton.tcc
+++ b/libstdc++-v3/include/bits/regex_automaton.tcc
@@ -148,6 +148,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _StateIdT
     _NFA<_TraitsT>::_M_insert_backref(size_t __index)
     {
+      if (this->_M_flags & regex_constants::__polynomial)
+	__throw_regex_error(regex_constants::error_complexity);
       // To figure out whether a backref is valid, a stack is used to store
       // unfinished sub-expressions. For example, when parsing
       // "(a(b)(c\\1(d)))" at '\\1', _M_subexpr_count is 3, indicating that 3
diff --git a/libstdc++-v3/include/bits/regex_constants.h b/libstdc++-v3/include/bits/regex_constants.h
index e2c7631..80ec761 100644
--- a/libstdc++-v3/include/bits/regex_constants.h
+++ b/libstdc++-v3/include/bits/regex_constants.h
@@ -63,6 +63,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _S_awk,
     _S_grep,
     _S_egrep,
+    _S_polynomial,
     _S_syntax_last
   };
 
@@ -169,6 +170,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr syntax_option_type egrep =
     static_cast<syntax_option_type>(1 << _S_egrep);
 
+  /**
+   * Extension: Ensure both space complexity of compiled regex and
+   * time complexity execution are not exponential.
+   * If specified in a regex with back-references, the exception
+   * regex_constants::error_complexity will be thrown.
+   */
+  constexpr syntax_option_type __polynomial =
+    static_cast<syntax_option_type>(1 << _S_polynomial);
+
   constexpr inline syntax_option_type
   operator&(syntax_option_type __a, syntax_option_type __b)
   {
diff --git a/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc b/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
index 784be29..82449ce 100644
--- a/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
+++ b/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
@@ -45,7 +45,9 @@ int main()
     regex re("tour|tournament|tourn", g);
     const char str[] = "tournament";
     cmatch m;
-    VERIFY(regex_search_debug(str, m, re));
+    VERIFY(regex_search(str, m, re));
+    // TODO: Fix ECMAScript BFS matcher.
+    //VERIFY(regex_search_debug(str, m, re));
     VERIFY(sol[i] == m[0]);
     i++;
   }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch] Add regex_constants::__polynomial
  2015-04-25  8:10 [Patch] Add regex_constants::__polynomial Tim Shen
@ 2015-04-27  0:55 ` Tim Shen
  2015-04-27 11:58   ` Jonathan Wakely
  0 siblings, 1 reply; 4+ messages in thread
From: Tim Shen @ 2015-04-27  0:55 UTC (permalink / raw)
  To: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 227 bytes --]

On Sat, Apr 25, 2015 at 1:10 AM, Tim Shen <timshen@google.com> wrote:
> Bootstrapped and tested.

I didn't test with _GLIBCXX_DEBUG though. Updated the patch for
removing DFS restriction for ECMAScript.


-- 
Regards,
Tim Shen

[-- Attachment #2: b.diff --]
[-- Type: text/plain, Size: 5233 bytes --]

commit 32cd325c4acffdcdf16caca4233a2455ea483d69
Author: Tim Shen <timshen@google.com>
Date:   Sat Apr 25 00:23:52 2015 -0700

    	* include/bits/regex.tcc: Handle regex_constants::__polynomial.
    	* include/bits/regex_automaton.tcc: Throw exception when parsing
    	back-reference with flag __polynomial.
    	* include/bits/regex_constants.h: Add extension flag
    	syntax_option_type __polynomial.
    	* bits/regex_executor.tcc: Still let BFS process ECMAScript.
    	Alternative operation will be fixed in the coming refactoring.
    	* testsuite/28_regex/algorithms/regex_search/61424.cc: Turn
    	loose match_search_debug to use DFS only.

diff --git a/libstdc++-v3/include/bits/regex.tcc b/libstdc++-v3/include/bits/regex.tcc
index 823ad51..fedc2b9 100644
--- a/libstdc++-v3/include/bits/regex.tcc
+++ b/libstdc++-v3/include/bits/regex.tcc
@@ -28,13 +28,6 @@
  *  Do not attempt to use it directly. @headername{regex}
  */
 
-// A non-standard switch to let the user pick the matching algorithm.
-// If _GLIBCXX_REGEX_USE_THOMPSON_NFA is defined, the thompson NFA
-// algorithm will be used. This algorithm is not enabled by default,
-// and cannot be used if the regex contains back-references, but has better
-// (polynomial instead of exponential) worst case performance.
-// See __regex_algo_impl below.
-
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 namespace __detail
@@ -67,16 +60,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       for (auto& __it : __res)
 	__it.matched = false;
 
-      // __policy is used by testsuites so that they can use Thompson NFA
-      // without defining a macro. Users should define
-      // _GLIBCXX_REGEX_USE_THOMPSON_NFA if they need to use this approach.
       bool __ret;
-      if (!__re._M_automaton->_M_has_backref
-	  && !(__re._M_flags & regex_constants::ECMAScript)
-#ifndef _GLIBCXX_REGEX_USE_THOMPSON_NFA
-	  && __policy == _RegexExecutorPolicy::_S_alternate
-#endif
-	  )
+      if ((__re.flags() & regex_constants::__polynomial)
+	  || (__policy == _RegexExecutorPolicy::_S_alternate
+	      && !__re._M_automaton->_M_has_backref))
 	{
 	  _Executor<_BiIter, _Alloc, _TraitsT, false>
 	    __executor(__s, __e, __m, __re, __flags);
diff --git a/libstdc++-v3/include/bits/regex_automaton.tcc b/libstdc++-v3/include/bits/regex_automaton.tcc
index fbc3389..72fe978 100644
--- a/libstdc++-v3/include/bits/regex_automaton.tcc
+++ b/libstdc++-v3/include/bits/regex_automaton.tcc
@@ -148,6 +148,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _StateIdT
     _NFA<_TraitsT>::_M_insert_backref(size_t __index)
     {
+      if (this->_M_flags & regex_constants::__polynomial)
+	__throw_regex_error(regex_constants::error_complexity);
       // To figure out whether a backref is valid, a stack is used to store
       // unfinished sub-expressions. For example, when parsing
       // "(a(b)(c\\1(d)))" at '\\1', _M_subexpr_count is 3, indicating that 3
diff --git a/libstdc++-v3/include/bits/regex_constants.h b/libstdc++-v3/include/bits/regex_constants.h
index e2c7631..80ec761 100644
--- a/libstdc++-v3/include/bits/regex_constants.h
+++ b/libstdc++-v3/include/bits/regex_constants.h
@@ -63,6 +63,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _S_awk,
     _S_grep,
     _S_egrep,
+    _S_polynomial,
     _S_syntax_last
   };
 
@@ -169,6 +170,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   constexpr syntax_option_type egrep =
     static_cast<syntax_option_type>(1 << _S_egrep);
 
+  /**
+   * Extension: Ensure both space complexity of compiled regex and
+   * time complexity execution are not exponential.
+   * If specified in a regex with back-references, the exception
+   * regex_constants::error_complexity will be thrown.
+   */
+  constexpr syntax_option_type __polynomial =
+    static_cast<syntax_option_type>(1 << _S_polynomial);
+
   constexpr inline syntax_option_type
   operator&(syntax_option_type __a, syntax_option_type __b)
   {
diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc
index f065499..9b5c1c6 100644
--- a/libstdc++-v3/include/bits/regex_executor.tcc
+++ b/libstdc++-v3/include/bits/regex_executor.tcc
@@ -382,8 +382,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	case _S_opcode_alternative:
 	  if (_M_nfa._M_flags & regex_constants::ECMAScript)
 	    {
-	      // TODO: Let BFS support ECMAScript's alternative operation.
-	      _GLIBCXX_DEBUG_ASSERT(__dfs_mode);
+	      // TODO: Fix BFS support. It is wrong.
 	      _M_dfs(__match_mode, __state._M_alt);
 	      // Pick lhs if it matches. Only try rhs if it doesn't.
 	      if (!_M_has_sol)
diff --git a/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc b/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
index 784be29..82449ce 100644
--- a/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
+++ b/libstdc++-v3/testsuite/28_regex/algorithms/regex_search/61424.cc
@@ -45,7 +45,9 @@ int main()
     regex re("tour|tournament|tourn", g);
     const char str[] = "tournament";
     cmatch m;
-    VERIFY(regex_search_debug(str, m, re));
+    VERIFY(regex_search(str, m, re));
+    // TODO: Fix ECMAScript BFS matcher.
+    //VERIFY(regex_search_debug(str, m, re));
     VERIFY(sol[i] == m[0]);
     i++;
   }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch] Add regex_constants::__polynomial
  2015-04-27  0:55 ` Tim Shen
@ 2015-04-27 11:58   ` Jonathan Wakely
  2015-04-28  6:53     ` Tim Shen
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Wakely @ 2015-04-27 11:58 UTC (permalink / raw)
  To: Tim Shen; +Cc: libstdc++, gcc-patches

On 26/04/15 17:55 -0700, Tim Shen wrote:
>I didn't test with _GLIBCXX_DEBUG though. Updated the patch for
>removing DFS restriction for ECMAScript.

OK for trunk.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch] Add regex_constants::__polynomial
  2015-04-27 11:58   ` Jonathan Wakely
@ 2015-04-28  6:53     ` Tim Shen
  0 siblings, 0 replies; 4+ messages in thread
From: Tim Shen @ 2015-04-28  6:53 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

On Mon, Apr 27, 2015 at 4:57 AM, Jonathan Wakely <jwakely@redhat.com> wrote:
> OK for trunk.

Committed.


-- 
Regards,
Tim Shen

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-04-28  4:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-25  8:10 [Patch] Add regex_constants::__polynomial Tim Shen
2015-04-27  0:55 ` Tim Shen
2015-04-27 11:58   ` Jonathan Wakely
2015-04-28  6:53     ` Tim Shen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).