From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-ports-return-4595-listarch-libc-ports=sources.redhat.com@sourceware.org>
Received: (qmail 13856 invoked by alias); 14 Nov 2013 23:14:49 -0000
Mailing-List: contact libc-ports-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-ports.sourceware.org>
List-Subscribe: <mailto:libc-ports-subscribe@sourceware.org>
List-Post: <mailto:libc-ports@sourceware.org>
List-Help: <mailto:libc-ports-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-ports-owner@sourceware.org
Received: (qmail 13842 invoked by uid 89); 14 Nov 2013 23:14:49 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.0 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RDNS_NONE,SPF_NEUTRAL,URIBL_BLOCKED autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Received: from Unknown (HELO popelka.ms.mff.cuni.cz) (195.113.20.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 14 Nov 2013 23:14:46 +0000
Received: from domone.kolej.mff.cuni.cz (popelka.ms.mff.cuni.cz [195.113.20.131])	by popelka.ms.mff.cuni.cz (Postfix) with ESMTPS id 333E46B0AF;	Fri, 15 Nov 2013 00:14:35 +0100 (CET)
Received: by domone.kolej.mff.cuni.cz (Postfix, from userid 1000)	id E69BE5F96A; Fri, 15 Nov 2013 00:14:34 +0100 (CET)
Date: Thu, 14 Nov 2013 23:14:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Steve Ellcey <sellcey@mips.com>
Cc: libc-ports@sourceware.org
Subject: Re: [Patch, mips] Faster strcmp for mips
Message-ID: <20131114231434.GA5331@domone.podge>
References: <1384464221.2484.86.camel@ubuntu-sellcey>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1384464221.2484.86.camel@ubuntu-sellcey>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-IsSubscribed: yes
X-SW-Source: 2013-11/txt/msg00018.txt.bz2

On Thu, Nov 14, 2013 at 01:23:41PM -0800, Steve Ellcey wrote:
> 
> I have created an optimized strcmp routine for MIPS and would like to check
> it in.  I verified it by running 'make check' and I also ran 'make bench'
> to check the performance.  This function is slightly slower then the current
> generic strcmp for strings that are not aligned, but much faster when strings
> are aligned because it will then load and compare either 2, 4, or 8 bytes at a
> time.
>
Problem is that aligned access could be only third of total calls. Also
around 90% calls need to check only first 16 bytes. These numbers differ
by application on x64 they are rule rather than exception. I collected
data of running gcc+gnuplot here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcmp_profile/results_gcc/result.html 

> This means it could be loading bytes beyond the end of the strings being
> compared but it looks like other architecture specific strcmp functions
> are also doing this optimization and the newlib version of strcmp also does
> this.
> 
Also unaligned case can be made faster with bit of extra effort.
How fast are unaligned loads on MIPS? They should be faster than
assembling value by shifts from two loads manualy which should be faster
than byte-by-byte checking.

What is most improtant for performance is handle first bytes quickly.
You need to check several alternatives, one is unroll checking of first
8-16 bytes. Second is to do a single vector check.

Then you need to implement a loop, what I did on x64 is align one
argument and load second as unaligned.

You need to precompute when a unaligned access would cross page boundary
and switch to byte-by-byte loop when that happens.

Here is modified code that I used to generate x64 implementation, I hope
it will be useful.

You need to change a TEST(a,b) to code that checks 8 bytes starting at a
and b. 

Also you may try to unroll these loops to do 16/32 bytes per iteration.

int
strcmp_new (char *a, char *b)
{
  int i;
  int or = (((unsigned long) a) | ((unsigned long) b)) & 4095;
  if (or <= 4096 - 8)
    {
      TEST (a, b);
    }
  else
    {
      for (i = 0; i < 8; i++)
	{
	  if (!a[i] || (a[i] - b[i]))
	    return a[i] - b[i];
	}
    }

  char *a_al = ALIGN_DOWN (a + 8, 8);
  int dif = a_al - a;
  a += dif;
  b += dif;
  unsigned long b_next = ((unsigned long) b) & (4096 - 1);
  unsigned long next_page = (4096 - b_next) / 8;

  while (1)
    {
      if (__builtin_expect (next_page-- == 0, 0))
	{
	  next_page = 4096 / 8;
	  next_page--;
	  for (i = 0; i < 8; i++)
	    {
	      if (!a[i] || (a[i] - b[i]))
		return a[i] - b[i];
	    }
	}
      TEST (a, b);
      a += 8;
      b += 8;
    }
}

Also here
> +  bx.v = x; by.v = y;
> +  if (bx.b.B0 - by.b.B0 != 0 || bx.b.B0 == 0)
> +    return bx.b.B0 - by.b.B0;
> +  if (bx.b.B1 - by.b.B1 != 0 || bx.b.B1 == 0)
> +    return bx.b.B1 - by.b.B1;
> +  if (bx.b.B2 - by.b.B2 != 0 || bx.b.B2 == 0)
> +    return bx.b.B2 - by.b.B2;
> +  if (bx.b.B3 - by.b.B3 != 0 || bx.b.B3 == 0)
> +    return bx.b.B3 - by.b.B3;
> +  if (bx.b.B4 - by.b.B4 != 0 || bx.b.B4 == 0)
> +    return bx.b.B4 - by.b.B4;
> +  if (bx.b.B5 - by.b.B5 != 0 || bx.b.B5 == 0)
> +    return bx.b.B5 - by.b.B5;
> +  if (bx.b.B6 - by.b.B6 != 0 || bx.b.B6 == 0)
> +    return bx.b.B6 - by.b.B6;
> +  return bx.b.B7 - by.b.B7;

There is alternative way to find answer without branching.
You need to compute a number that has nonzero i-th byte if i-th bytes of
x and y differ or they are 0. Then find first nonzero bit which can be
done quickly by CLZ instruction. 

This could be implemented as following

uint64_t bitmask = DETECTNULL8(x) | (x ^ y);
uint64_t bitfirst = bitmask ^ (bitmask - 1);
int pos = (ffsl(bitfirst) - 1) / 8;
return a[pos] - b[pos];