From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13856 invoked by alias); 14 Nov 2013 23:14:49 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 13842 invoked by uid 89); 14 Nov 2013 23:14:49 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.0 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RDNS_NONE,SPF_NEUTRAL,URIBL_BLOCKED autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Received: from Unknown (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 14 Nov 2013 23:14:46 +0000 Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 333E46B0AF; Fri, 15 Nov 2013 00:14:35 +0100 (CET) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id E69BE5F96A; Fri, 15 Nov 2013 00:14:34 +0100 (CET) Date: Thu, 14 Nov 2013 23:14:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Steve Ellcey Cc: libc-ports@sourceware.org Subject: Re: [Patch, mips] Faster strcmp for mips Message-ID: <20131114231434.GA5331@domone.podge> References: <1384464221.2484.86.camel@ubuntu-sellcey> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1384464221.2484.86.camel@ubuntu-sellcey> User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes X-SW-Source: 2013-11/txt/msg00018.txt.bz2 On Thu, Nov 14, 2013 at 01:23:41PM -0800, Steve Ellcey wrote: > > I have created an optimized strcmp routine for MIPS and would like to check > it in. I verified it by running 'make check' and I also ran 'make bench' > to check the performance. This function is slightly slower then the current > generic strcmp for strings that are not aligned, but much faster when strings > are aligned because it will then load and compare either 2, 4, or 8 bytes at a > time. > Problem is that aligned access could be only third of total calls. Also around 90% calls need to check only first 16 bytes. These numbers differ by application on x64 they are rule rather than exception. I collected data of running gcc+gnuplot here: http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html > This means it could be loading bytes beyond the end of the strings being > compared but it looks like other architecture specific strcmp functions > are also doing this optimization and the newlib version of strcmp also does > this. > Also unaligned case can be made faster with bit of extra effort. How fast are unaligned loads on MIPS? They should be faster than assembling value by shifts from two loads manualy which should be faster than byte-by-byte checking. What is most improtant for performance is handle first bytes quickly. You need to check several alternatives, one is unroll checking of first 8-16 bytes. Second is to do a single vector check. Then you need to implement a loop, what I did on x64 is align one argument and load second as unaligned. You need to precompute when a unaligned access would cross page boundary and switch to byte-by-byte loop when that happens. Here is modified code that I used to generate x64 implementation, I hope it will be useful. You need to change a TEST(a,b) to code that checks 8 bytes starting at a and b. Also you may try to unroll these loops to do 16/32 bytes per iteration. int strcmp_new (char *a, char *b) { int i; int or = (((unsigned long) a) | ((unsigned long) b)) & 4095; if (or <= 4096 - 8) { TEST (a, b); } else { for (i = 0; i < 8; i++) { if (!a[i] || (a[i] - b[i])) return a[i] - b[i]; } } char *a_al = ALIGN_DOWN (a + 8, 8); int dif = a_al - a; a += dif; b += dif; unsigned long b_next = ((unsigned long) b) & (4096 - 1); unsigned long next_page = (4096 - b_next) / 8; while (1) { if (__builtin_expect (next_page-- == 0, 0)) { next_page = 4096 / 8; next_page--; for (i = 0; i < 8; i++) { if (!a[i] || (a[i] - b[i])) return a[i] - b[i]; } } TEST (a, b); a += 8; b += 8; } } Also here > + bx.v = x; by.v = y; > + if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0) > + return bx.b.B0 - by.b.B0; > + if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0) > + return bx.b.B1 - by.b.B1; > + if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0) > + return bx.b.B2 - by.b.B2; > + if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0) > + return bx.b.B3 - by.b.B3; > + if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0) > + return bx.b.B4 - by.b.B4; > + if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0) > + return bx.b.B5 - by.b.B5; > + if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0) > + return bx.b.B6 - by.b.B6; > + return bx.b.B7 - by.b.B7; There is alternative way to find answer without branching. You need to compute a number that has nonzero i-th byte if i-th bytes of x and y differ or they are 0. Then find first nonzero bit which can be done quickly by CLZ instruction. This could be implemented as following uint64_t bitmask = DETECTNULL8(x) | (x ^ y); uint64_t bitfirst = bitmask ^ (bitmask - 1); int pos = (ffsl(bitfirst) - 1) / 8; return a[pos] - b[pos];