From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4597-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 12161 invoked by alias); 15 Nov 2013 18:21:42 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 12046 invoked by uid 89); 15 Nov 2013 18:21:40 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,URIBL_BLOCKED autolearn=no version=3.3.2
X-HELO: multi.imgtec.com
Received: from Unknown (HELO multi.imgtec.com) (194.200.65.239) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Fri, 15 Nov 2013 18:21:34 +0000
Message-ID: <1384539604.2484.102.camel@ubuntu-sellcey>
Subject: Re: [Patch, mips] Faster strcmp for mips
From: Steve Ellcey <sellcey@mips.com>
To: =?iso-8859-2?Q?Ond=F8ej_B=EDlka?= <neleai@seznam.cz>
CC: <libc-ports@sourceware.org>
Date: Fri, 15 Nov 2013 18:21:00 -0000
In-Reply-To: <20131114231434.GA5331@domone.podge>
References: <1384464221.2484.86.camel@ubuntu-sellcey>	 <20131114231434.GA5331@domone.podge>
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
X-SEF-Processed: 7_3_0_01192__2013_11_15_18_21_26
X-SW-Source: 2013-11/txt/msg00020.txt.bz2

On Fri, 2013-11-15 at 00:14 +0100, Ondøej Bílka wrote:

> Problem is that aligned access could be only third of total calls. Also
> around 90% calls need to check only first 16 bytes. These numbers differ
> by application on x64 they are rule rather than exception. I collected
> data of running gcc+gnuplot here:
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html 

Very interesting, you have obviously spent a lot more time then I have
on strcmp.

> > This means it could be loading bytes beyond the end of the strings being
> > compared but it looks like other architecture specific strcmp functions
> > are also doing this optimization and the newlib version of strcmp also does
> > this.
> > 
> Also unaligned case can be made faster with bit of extra effort.
> How fast are unaligned loads on MIPS? They should be faster than
> assembling value by shifts from two loads manualy which should be faster
> than byte-by-byte checking.

MIPS doesn't have unaligned loads but it has 'partial' loads where two
loads can give you four (or eight) bytes in two load operations
regardless of alignment.  I may need to look at this some more though
that will probably mean going to assembly language instead of C.


> Also here
> > +  bx.v = x; by.v = y;
> > +  if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> > +    return bx.b.B0 - by.b.B0;
> > +  if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> > +    return bx.b.B1 - by.b.B1;
> > +  if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> > +    return bx.b.B2 - by.b.B2;
> > +  if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> > +    return bx.b.B3 - by.b.B3;
> > +  if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> > +    return bx.b.B4 - by.b.B4;
> > +  if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> > +    return bx.b.B5 - by.b.B5;
> > +  if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> > +    return bx.b.B6 - by.b.B6;
> > +  return bx.b.B7 - by.b.B7;
> 
> There is alternative way to find answer without branching.
> You need to compute a number that has nonzero i-th byte if i-th bytes of
> x and y differ or they are 0. Then find first nonzero bit which can be
> done quickly by CLZ instruction. 
> 
> This could be implemented as following
> 
> uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
> uint64_t bitfirst = bitmask ^ (bitmask - 1);
> int pos = (ffsl(bitfirst) - 1) / 8;
> return a[pos] - b[pos];

This looks very interesting, I tried just plugging it in but it didn't
work in some cases.  I need to stare at this some more to understand
exactly how it should work and what cases are failing for me.

Steve Ellcey
sellcey@mips.com