From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23679 invoked by alias); 17 May 2010 12:14:17 -0000 Received: (qmail 23601 invoked by uid 48); 17 May 2010 12:13:55 -0000 Date: Mon, 17 May 2010 12:14:00 -0000 Message-ID: <20100517121355.23600.qmail@sourceware.org> From: "eblake at redhat dot com" To: glibc-bugs@sources.redhat.com In-Reply-To: <20100517092705.11607.xoswald@debian.org> References: <20100517092705.11607.xoswald@debian.org> Reply-To: sourceware-bugzilla@sourceware.org Subject: [Bug libc/11607] strstr function very slow. X-Bugzilla-Reason: CC Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org X-SW-Source: 2010-05/txt/msg00075.txt.bz2 ------- Additional Comments From eblake at redhat dot com 2010-05-17 12:13 ------- In glibc 2.7, the strstr() implementation has worst-case quadratic complexity of O(m*n) (m the length of the haystack, n the length of the string); and gnulib introduced a test case that demonstrated this quadratic complexity could be used as a denial of service attack on any program that searches for a large needle within a large string. With glibc 2.10, the strstr() has a guaranteed linear complexity of O(m+n), but that reduced algorithmic complexity comes with a larger startup cost of factoring the needle, where the shorter the needle is, the larger proportion of the overall strstr() is spent on that factoring. Furthermore, the glibc 2.7 implementation has had years of tweaking to be cache-line friendly and tuned for various architectures, while my 2.10 implementation is relatively new, and probably still has some room for improvements (at any rate, I know that when I wrote it, I was catering to large needles and large haystacks, and was focused on the algorithmic improvement, rather than the cache-line effects of small needles). With appropriate benchmarking, it may be worth altering the implementation to be a hybrid, where short-enough needles are handled with the older brute force implementation - even if it is quadratic, less startup cost might be better for short needles even if it results in more comparisons against the haystack for some of those needles. Furthermore, strstr() has a lousy interface when compared with regex searching. That is, if you repeatedly search for the same needle, the use of re_compile() is spent only once, while you can reuse that compiled needle across multiple re_search() calls. But with strstr(), the cost of factoring the needle is spent on every call. Perhaps we should consider exposing some new interfaces that separate the strstr() factorization startup code from the body that uses the factored needle for the actual search. -- What |Removed |Added ---------------------------------------------------------------------------- CC|ebb9 at byu dot net |eblake at redhat dot com http://sourceware.org/bugzilla/show_bug.cgi?id=11607 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.