public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.
@ 2021-01-18 10:37 goughostt at gmail dot com
  2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: goughostt at gmail dot com @ 2021-01-18 10:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

            Bug ID: 98723
           Summary: On Windows with CP936 encoding, regex compiles very
                    slow.
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: goughostt at gmail dot com
  Target Milestone: ---

example code:

#include <regex>
#include <iostream>
#include <locale>
int main() {
   std::setlocale(LC_ALL, "");
   std::regex rgx{"[a-z][a-z][a-z]"};
   std::cerr<<rgx.mark_count()<<std::endl;
   return 0;
}

build and run in mingw64 environment (gcc 10.2.0), the program blocks while
compiling the regex for a long time.

my finding is that:

compiling '[a-z]' needs to cache info for all 256 chars;
for each char, a call to std::collate<char>::do_transform() is made;
do_transform() will use the result of strxfrm() to allocate buffer;
on Windows, strxfrm() returns INT_MAX to indicate error;
if char > 0x7f, and the system encoding is CP936, strxfrm() will fail;
thus, compiling '[a-z]' will repeatedly allocate large buffers.

issues:

1. the regex compilation will be affected by current locale even if
std::regex::collate is not set, by calling strxfrm.

2. code in bits/locale_classes.tcc should handle documented return conditions
of strxfrm() on Windows:

         size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm()
         // If the buffer was not large enough, try again with the
         // correct size.
         if (__res >= __len)
      {
        __len = __res + 1;
        delete [] __c, __c = 0;
        __c = new _CharT[__len];
        __res = _M_transform(__c, __p, __len);
      }

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-09 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-18 10:37 [Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow goughostt at gmail dot com
2021-01-18 11:06 ` [Bug libstdc++/98723] " redi at gcc dot gnu.org
2021-01-18 11:39 ` redi at gcc dot gnu.org
2021-01-18 14:31 ` goughostt at gmail dot com
2021-01-24 19:17 ` unlvsur at live dot com
2021-11-26 13:34 ` egallager at gcc dot gnu.org
2021-11-26 16:40 ` unlvsur at live dot com
2022-05-29 11:13 ` jeroen at berkeley dot edu
2022-05-30 16:24 ` unlvsur at live dot com
2023-12-09 13:52 ` luca.bacci at outlook dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).