public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.
@ 2021-01-18 10:37 goughostt at gmail dot com
  2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: goughostt at gmail dot com @ 2021-01-18 10:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

            Bug ID: 98723
           Summary: On Windows with CP936 encoding, regex compiles very
                    slow.
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: goughostt at gmail dot com
  Target Milestone: ---

example code:

#include <regex>
#include <iostream>
#include <locale>
int main() {
   std::setlocale(LC_ALL, "");
   std::regex rgx{"[a-z][a-z][a-z]"};
   std::cerr<<rgx.mark_count()<<std::endl;
   return 0;
}

build and run in mingw64 environment (gcc 10.2.0), the program blocks while
compiling the regex for a long time.

my finding is that:

compiling '[a-z]' needs to cache info for all 256 chars;
for each char, a call to std::collate<char>::do_transform() is made;
do_transform() will use the result of strxfrm() to allocate buffer;
on Windows, strxfrm() returns INT_MAX to indicate error;
if char > 0x7f, and the system encoding is CP936, strxfrm() will fail;
thus, compiling '[a-z]' will repeatedly allocate large buffers.

issues:

1. the regex compilation will be affected by current locale even if
std::regex::collate is not set, by calling strxfrm.

2. code in bits/locale_classes.tcc should handle documented return conditions
of strxfrm() on Windows:

         size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm()
         // If the buffer was not large enough, try again with the
         // correct size.
         if (__res >= __len)
      {
        __len = __res + 1;
        delete [] __c, __c = 0;
        __c = new _CharT[__len];
        __res = _M_transform(__c, __p, __len);
      }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
@ 2021-01-18 11:06 ` redi at gcc dot gnu.org
  2021-01-18 11:39 ` redi at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: redi at gcc dot gnu.org @ 2021-01-18 11:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2021-01-18

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
  2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
@ 2021-01-18 11:39 ` redi at gcc dot gnu.org
  2021-01-18 14:31 ` goughostt at gmail dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: redi at gcc dot gnu.org @ 2021-01-18 11:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW

--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
The Windows behaviour fails to conform to the C and C++ standards. I think
_M_transform should check errno and throw an exception on error (which means
removing the non-throwing exceptions specification from that function).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
  2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
  2021-01-18 11:39 ` redi at gcc dot gnu.org
@ 2021-01-18 14:31 ` goughostt at gmail dot com
  2021-01-24 19:17 ` unlvsur at live dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: goughostt at gmail dot com @ 2021-01-18 14:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

--- Comment #2 from goughost <goughostt at gmail dot com> ---
That may be acceptable for issue 2.
But additional fixes are need; otherwise, users cannot use regex after calling
setlocale(LC_ALL,"") in such a situation.
Can regex compilers work without calling _M_transform? (at least when
std::regex::collate is not set)

On the other hand, maybe the error condition can be handled by regex compiler
code.
To some extent, the bug is in the regex compiler.
Building cache for '\xee' calls strxfrm() with "\xee\x00", which is not a valid
string if current encoding is utf8.
Also, in GNU/Linux, resulting strings of such (successful) calls might not help
building the cache.

Examples calling strxfrom in GNU/Linux with various locales.
(Note that, in cases when Windows fails, Linux gives trivial results.)

// C
input 61 00, errno 0, res 1, outbuf:  61
input 62 00, errno 0, res 1, outbuf:  62
input aa 00, errno 0, res 1, outbuf:  aa
input bb 00, errno 0, res 1, outbuf:  bb

// C.UTF-8
input 61 00, errno 0, res 1, outbuf:  63
input 62 00, errno 0, res 1, outbuf:  64
input aa 00, errno 0, res 1, outbuf:  03
input bb 00, errno 0, res 1, outbuf:  03

// en_US.UTF-8
input 61 00, errno 0, res 10, outbuf:  51 01 02 01 02 01 00 00 00 00
input 62 00, errno 0, res 10, outbuf:  5e 01 02 01 02 01 00 00 00 00
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

// zh_CN.GB2312 
input 61 00, errno 0, res 11, outbuf:  e1 a9 bd 01 02 01 02 01 00 00 61
input 62 00, errno 0, res 11, outbuf:  e1 a9 be 01 02 01 02 01 00 00 62
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (2 preceding siblings ...)
  2021-01-18 14:31 ` goughostt at gmail dot com
@ 2021-01-24 19:17 ` unlvsur at live dot com
  2021-11-26 13:34 ` egallager at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: unlvsur at live dot com @ 2021-01-24 19:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

cqwrteur <unlvsur at live dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |unlvsur at live dot com

--- Comment #3 from cqwrteur <unlvsur at live dot com> ---
This should be reported as a CVE.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (3 preceding siblings ...)
  2021-01-24 19:17 ` unlvsur at live dot com
@ 2021-11-26 13:34 ` egallager at gcc dot gnu.org
  2021-11-26 16:40 ` unlvsur at live dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: egallager at gcc dot gnu.org @ 2021-11-26 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Eric Gallager <egallager at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |egallager at gcc dot gnu.org

--- Comment #4 from Eric Gallager <egallager at gcc dot gnu.org> ---
This is affecting The Battle for Wesnoth:
https://github.com/wesnoth/wesnoth/issues/6291

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (4 preceding siblings ...)
  2021-11-26 13:34 ` egallager at gcc dot gnu.org
@ 2021-11-26 16:40 ` unlvsur at live dot com
  2022-05-29 11:13 ` jeroen at berkeley dot edu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: unlvsur at live dot com @ 2021-11-26 16:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

--- Comment #5 from cqwrteur <unlvsur at live dot com> ---
(In reply to Eric Gallager from comment #4)
> This is affecting The Battle for Wesnoth:
> https://github.com/wesnoth/wesnoth/issues/6291

C++ std::regex is just terrible and highly likely be deprecated in the future
standard. I think you better switch to some 3rd party implementation

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (5 preceding siblings ...)
  2021-11-26 16:40 ` unlvsur at live dot com
@ 2022-05-29 11:13 ` jeroen at berkeley dot edu
  2022-05-30 16:24 ` unlvsur at live dot com
  2023-12-09 13:52 ` luca.bacci at outlook dot com
  8 siblings, 0 replies; 10+ messages in thread
From: jeroen at berkeley dot edu @ 2022-05-29 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Jeroen Ooms <jeroen at berkeley dot edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jeroen at berkeley dot edu

--- Comment #6 from Jeroen Ooms <jeroen at berkeley dot edu> ---
This bug has become more problematic because it also affects any program
running under recent versions of Windows UCRT in UTF-8 locale[1], and therefore
all users of the R programming language.

The only solution right now seems to avoid std::rexex, e.g.:
https://github.com/tesseract-ocr/tesseract/issues/3830


[1]
https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (6 preceding siblings ...)
  2022-05-29 11:13 ` jeroen at berkeley dot edu
@ 2022-05-30 16:24 ` unlvsur at live dot com
  2023-12-09 13:52 ` luca.bacci at outlook dot com
  8 siblings, 0 replies; 10+ messages in thread
From: unlvsur at live dot com @ 2022-05-30 16:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

--- Comment #7 from cqwrteur <unlvsur at live dot com> ---
well the right solution is to write the regex by yourself. C++ regex might be
deprecated in the future.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
  2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
                   ` (7 preceding siblings ...)
  2022-05-30 16:24 ` unlvsur at live dot com
@ 2023-12-09 13:52 ` luca.bacci at outlook dot com
  8 siblings, 0 replies; 10+ messages in thread
From: luca.bacci at outlook dot com @ 2023-12-09 13:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Luca Bacci <luca.bacci at outlook dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |luca.bacci at outlook dot com

--- Comment #8 from Luca Bacci <luca.bacci at outlook dot com> ---
(In reply to Jonathan Wakely from comment #1)
> The Windows behaviour fails to conform to the C and C++ standards. I think
> _M_transform should check errno and throw an exception on error (which means
> removing the non-throwing exceptions specification from that function).

Hi Jonathan! I'm giving it a go, but I have one question: which encoding are
the strings passed to _M_transform() / _M_compare() in?
(libstdc++-v3/config/locale/generic/collate_members.cc) is it the execution
character set? Or is it always UTF-8?

I am asking because we have to convert to UTF-16 and call wcsxfrm().

Many thanks,
Luca

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-09 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
2021-01-18 11:39 ` redi at gcc dot gnu.org
2021-01-18 14:31 ` goughostt at gmail dot com
2021-01-24 19:17 ` unlvsur at live dot com
2021-11-26 13:34 ` egallager at gcc dot gnu.org
2021-11-26 16:40 ` unlvsur at live dot com
2022-05-29 11:13 ` jeroen at berkeley dot edu
2022-05-30 16:24 ` unlvsur at live dot com
2023-12-09 13:52 ` luca.bacci at outlook dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).