public inbox for glibc-bugs-regex@sourceware.org help / color / mirror / Atom feed
From: "emil at wojak dot eu" <sourceware-bugzilla@sourceware.org> To: glibc-bugs-regex@sources.redhat.com Subject: [Bug regex/12811] New: regexec/re_search consumes huge amounts of memory Date: Thu, 26 May 2011 15:30:00 -0000 [thread overview] Message-ID: <bug-12811-132@http.sourceware.org/bugzilla/> (raw) http://sourceware.org/bugzilla/show_bug.cgi?id=12811 Summary: regexec/re_search consumes huge amounts of memory Product: glibc Version: 2.13 Status: NEW Severity: normal Priority: P2 Component: regex AssignedTo: drepper.fsp@gmail.com ReportedBy: emil@wojak.eu Created attachment 5753 --> http://sourceware.org/bugzilla/attachment.cgi?id=5753 Fix for huge memory usage The bug is triggered under the following circumstances: - multibyte character encoding, like pl_PL.UTF-8 - either translation table is used or RE_ICASE flag is set - input buffer which ends with a UTF-8 character cut in the middle, ex. aaaaaaaaaaaa\xc4 - specific kind of regex, that does not match the input buffer, and that re_search would apply starting at each position of the input buffer ex. [^b]*ab or simply .*ab Here's a sample program that consumes 1.4 GB on 32-bit architecture and 5.2 GB on 64-bit machines (measured with valgrind --tool=massif). #include <regex.h> #include <locale.h> int main(void) { regex_t preg; setlocale(LC_CTYPE, "en_US.UTF-8"); regcomp(&preg, ".*ab", REG_ICASE); regexec(&preg, "aaaaaaaaaaaa\xc4", 0, NULL, 0); regfree(&preg); return 0; } The exhaustive memory usage is caused by calling extend_buffers with each re_search_internal iteration, even though internal buffers already are long enough to hold the whole string. When matching procedure reaches mctx->input.valid_len, internal buffer size is doubled and the rest of the input buffer is converted to wchar_t, except for the last byte, which is a UTF-8 character cut in the middle. This last character is never converted, because it's continuation never comes, but still internal buffers are needlessly doubled. A patch solving this problem is attached. There's another issue. Once the internal buffers are long enough to hold at least half of the input buffer, they shouldn't get doubled, because that's a waste of memory. Instead it's enough to extend them to the actual length of the input buffer. This can save significant amounts of memory for long input buffers. A patch for this issue is attached as well. -- Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
next reply other threads:[~2011-05-26 15:30 UTC|newest] Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-05-26 15:30 emil at wojak dot eu [this message] 2011-05-26 15:30 ` [Bug regex/12811] " emil at wojak dot eu 2011-05-26 15:33 ` bonzini at gnu dot org 2011-05-26 15:37 ` ppluzhnikov at google dot com 2011-05-28 21:18 ` drepper.fsp at gmail dot com 2014-06-13 10:53 ` fweimer at redhat dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-12811-132@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs-regex@sources.redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).