[Bug regex/12811] New: regexec/re_search consumes huge amounts of memory

public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed

* [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory
@ 2011-05-26 15:30 emil at wojak dot eu
  2011-05-26 15:30 ` [Bug regex/12811] " emil at wojak dot eu
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: emil at wojak dot eu @ 2011-05-26 15:30 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12811

           Summary: regexec/re_search consumes huge amounts of memory
           Product: glibc
           Version: 2.13
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: emil@wojak.eu

Created attachment 5753
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5753
Fix for huge memory usage

The bug is triggered under the following circumstances:
- multibyte character encoding, like pl_PL.UTF-8
- either translation table is used or RE_ICASE flag is set
- input buffer which ends with a UTF-8 character cut in the middle, ex.
aaaaaaaaaaaa\xc4
- specific kind of regex, that does not match the input buffer, and that
re_search would apply starting at each position of the input buffer ex. [^b]*ab
or simply .*ab

Here's a sample program that consumes 1.4 GB on 32-bit architecture and 5.2 GB
on 64-bit machines (measured with valgrind --tool=massif).

#include <regex.h>
#include <locale.h>

int main(void) {
        regex_t preg;
        setlocale(LC_CTYPE, "en_US.UTF-8");
        regcomp(&preg, ".*ab", REG_ICASE);
        regexec(&preg, "aaaaaaaaaaaa\xc4", 0, NULL, 0);
        regfree(&preg);
        return 0;
}

The exhaustive memory usage is caused by calling extend_buffers with each
re_search_internal iteration, even though internal buffers already are long
enough to hold the whole string. When matching procedure reaches
mctx->input.valid_len, internal buffer size is doubled and the rest of the
input buffer is converted to wchar_t, except for the last byte, which is a
UTF-8 character cut in the middle. This last character is never converted,
because it's continuation never comes, but still internal buffers are
needlessly doubled.
A patch solving this problem is attached.

There's another issue. Once the internal buffers are long enough to hold at
least half of the input buffer, they shouldn't get doubled, because that's a
waste of memory. Instead it's enough to extend them to the actual length of the
input buffer. This can save significant amounts of memory for long input
buffers.
A patch for this issue is attached as well.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug regex/12811] regexec/re_search consumes huge amounts of memory
  2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
@ 2011-05-26 15:30 ` emil at wojak dot eu
  2011-05-26 15:33 ` bonzini at gnu dot org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: emil at wojak dot eu @ 2011-05-26 15:30 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12811

--- Comment #1 from Emil Wojak <emil at wojak dot eu> 2011-05-26 15:30:14 UTC ---
Created attachment 5754
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5754
A patch for optimal size of internal buffers.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug regex/12811] regexec/re_search consumes huge amounts of memory
  2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
  2011-05-26 15:30 ` [Bug regex/12811] " emil at wojak dot eu
@ 2011-05-26 15:33 ` bonzini at gnu dot org
  2011-05-26 15:37 ` ppluzhnikov at google dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: bonzini at gnu dot org @ 2011-05-26 15:33 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12811

Paolo Bonzini <bonzini at gnu dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bonzini at gnu dot org

--- Comment #2 from Paolo Bonzini <bonzini at gnu dot org> 2011-05-26 15:32:45 UTC ---
The patches look good apart from extra-long lines.  Thanks.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug regex/12811] regexec/re_search consumes huge amounts of memory
  2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
  2011-05-26 15:30 ` [Bug regex/12811] " emil at wojak dot eu
  2011-05-26 15:33 ` bonzini at gnu dot org
@ 2011-05-26 15:37 ` ppluzhnikov at google dot com
  2011-05-28 21:18 ` drepper.fsp at gmail dot com
  2014-06-13 10:53 ` fweimer at redhat dot com
  4 siblings, 0 replies; 6+ messages in thread
From: ppluzhnikov at google dot com @ 2011-05-26 15:37 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12811

Paul Pluzhnikov <ppluzhnikov at google dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ppluzhnikov at google dot
                   |                            |com

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug regex/12811] regexec/re_search consumes huge amounts of memory
  2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
                   ` (2 preceding siblings ...)
  2011-05-26 15:37 ` ppluzhnikov at google dot com
@ 2011-05-28 21:18 ` drepper.fsp at gmail dot com
  2014-06-13 10:53 ` fweimer at redhat dot com
  4 siblings, 0 replies; 6+ messages in thread
From: drepper.fsp at gmail dot com @ 2011-05-28 21:18 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12811

Ulrich Drepper <drepper.fsp at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #3 from Ulrich Drepper <drepper.fsp at gmail dot com> 2011-05-28 21:17:21 UTC ---
The patches missed the crucial change which fixed this specific problem.  I
added the patches and then this one additional test.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug regex/12811] regexec/re_search consumes huge amounts of memory
  2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
                   ` (3 preceding siblings ...)
  2011-05-28 21:18 ` drepper.fsp at gmail dot com
@ 2014-06-13 10:53 ` fweimer at redhat dot com
  4 siblings, 0 replies; 6+ messages in thread
From: fweimer at redhat dot com @ 2014-06-13 10:53 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=12811

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-06-13 10:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-26 15:30 [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory emil at wojak dot eu
2011-05-26 15:30 ` [Bug regex/12811] " emil at wojak dot eu
2011-05-26 15:33 ` bonzini at gnu dot org
2011-05-26 15:37 ` ppluzhnikov at google dot com
2011-05-28 21:18 ` drepper.fsp at gmail dot com
2014-06-13 10:53 ` fweimer at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).