From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27032 invoked by alias); 15 Nov 2013 19:02:15 -0000 Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-ports-owner@sourceware.org Received: (qmail 26994 invoked by uid 89); 15 Nov 2013 19:02:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.0 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RDNS_NONE,SPF_NEUTRAL,URIBL_BLOCKED autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Received: from Unknown (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 15 Nov 2013 19:02:12 +0000 Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131]) by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 5BFEF6B128; Fri, 15 Nov 2013 20:02:01 +0100 (CET) Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000) id 3A4B05F802; Fri, 15 Nov 2013 20:02:01 +0100 (CET) Date: Fri, 15 Nov 2013 19:02:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Steve Ellcey Cc: libc-ports@sourceware.org Subject: Re: [Patch, mips] Faster strcmp for mips Message-ID: <20131115190200.GA28546@domone.podge> References: <1384464221.2484.86.camel@ubuntu-sellcey> <20131114231434.GA5331@domone.podge> <1384539604.2484.102.camel@ubuntu-sellcey> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1384539604.2484.102.camel@ubuntu-sellcey> User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes X-SW-Source: 2013-11/txt/msg00024.txt.bz2 On Fri, Nov 15, 2013 at 10:20:04AM -0800, Steve Ellcey wrote: > On Fri, 2013-11-15 at 00:14 +0100, Ondřej Bílka wrote: > > > Problem is that aligned access could be only third of total calls. Also > > around 90% calls need to check only first 16 bytes. These numbers differ > > by application on x64 they are rule rather than exception. I collected > > data of running gcc+gnuplot here: > > http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html > > Very interesting, you have obviously spent a lot more time then I have > on strcmp. > > > > This means it could be loading bytes beyond the end of the strings being > > > compared but it looks like other architecture specific strcmp functions > > > are also doing this optimization and the newlib version of strcmp also does > > > this. > > > > > Also unaligned case can be made faster with bit of extra effort. > > How fast are unaligned loads on MIPS? They should be faster than > > assembling value by shifts from two loads manualy which should be faster > > than byte-by-byte checking. > > MIPS doesn't have unaligned loads but it has 'partial' loads where two > loads can give you four (or eight) bytes in two load operations > regardless of alignment. I may need to look at this some more though > that will probably mean going to assembly language instead of C. > You can also first try this in c by using shifts. > > > Also here > > > + bx.v = x; by.v = y; > > > + if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0) > > > + return bx.b.B0 - by.b.B0; > > > + if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0) > > > + return bx.b.B1 - by.b.B1; > > > + if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0) > > > + return bx.b.B2 - by.b.B2; > > > + if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0) > > > + return bx.b.B3 - by.b.B3; > > > + if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0) > > > + return bx.b.B4 - by.b.B4; > > > + if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0) > > > + return bx.b.B5 - by.b.B5; > > > + if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0) > > > + return bx.b.B6 - by.b.B6; > > > + return bx.b.B7 - by.b.B7; > > > > There is alternative way to find answer without branching. > > You need to compute a number that has nonzero i-th byte if i-th bytes of > > x and y differ or they are 0. Then find first nonzero bit which can be > > done quickly by CLZ instruction. > > > > This could be implemented as following > > > > uint64_t bitmask = DETECTNULL8(x) | (x ^ y); > > uint64_t bitfirst = bitmask ^ (bitmask - 1); > > int pos = (ffsl(bitfirst) - 1) / 8; > > return a[pos] - b[pos]; > > This looks very interesting, I tried just plugging it in but it didn't > work in some cases. I need to stare at this some more to understand > exactly how it should work and what cases are failing for me. > I wrote that late at nigth and first idea was use __builtin_clz which would yield following code. uint64_t bitmask = DETECTNULL8(x) | (x ^ y); uint64_t bitfirst = bitmask ^ (bitmask - 1); int pos = (63 - __builtin_clzl(bitfirst)) / 8; return a[pos] - b[pos]; I decided that using ffls was shorter but for some reasons I kept bitfirst there. A correct version is uint64_t bitmask = DETECTNULL8(x) | (x ^ y); int pos = (ffsl(bitmask) - 1) / 8; return a[pos] - b[pos];