From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-54676-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 26191 invoked by alias); 23 Nov 2014 23:47:42 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 26181 invoked by uid 89); 23 Nov 2014 23:47:41 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Sun, 23 Nov 2014 23:47:00 -0000
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Leonhard Holz <leonhard.holz@web.de>
Cc: libc-alpha@sourceware.org
Subject: Re: [RFC] Add fast path for strcoll and strcasecmp
Message-ID: <20141123234728.GA31572@domone>
References: <20141123214718.GA28222@domone>
 <54726516.5060409@web.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <54726516.5060409@web.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-SW-Source: 2014-11/txt/msg00650.txt.bz2

On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
> Hi OndÅej,
> 
> as far as I understood, the current strcoll implementation scans
> both strings for collation sequences and compares the weights of
> them, whereby a collation sequence can be multiple bytes long. So
> whatever strcmp_l returns as index, you would need a general way of
> finding the start of the collation sequence this index is in.
> Unfortunately I cannot tell if or how this can be done.
> 
As I wrote below you do not have to do that. Just precompute a table that
is zero for characters that are part of some collation sequence and use
old method when one of compared characters is in that table.

>From performance perspective these are not problem as they should be
infrequent enough. Ignored ones are worse as they could make otherwise
identical long prefixes different.


> BTW I have implemented a benchmark for strcoll that is
> not-yet-pushed because I didn't manage to patch the bench-tests
> Makefile to generate additionally needed locales
> (https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).
> 
> Leonhard
> 
> Am 23.11.2014 22:47, schrieb OndÅej BÃ­lka:
> >Hi,
> >
> >Now when I looked to strcoll improvement I recalled that strcoll(_l) and
> >strcasecmp could be made faster by following approach. We find first
> >position where strings differ and we will likely decide how they differ
> >by looking at few bytes.
> >
> >For that we need to do two things. First is to determine where character
> >containing differing byte starts. That is easy to do for single byte
> >encodings, UTF and prefix-free encodings in general. Then we need to
> >decide if decision can be made by characters alone. For that we need to
> >autogenerate new weigth table for each locale. It is basically primary
> >weigth table but ignored characters and characters that are part of
> >sequences get zero. If safe weigths are equal or one is zero then we
> >call original function.
> >
> >A strcasecmp works in same way with appropriate collation table.
> >
> >To determine performance gain I made a simple incorrect implementation
> >where I ignored calculating safe weigths. Another optimization is to use
> >strcmp_l optimized assembly which is easy to derive from strcmp by
> >returning index instead how these characters differ.
> >
> >A sample implementation is here, comments?
> >
> >
> >#include <locale.h>
> >#include <xlocale.h>
> >#include <string.h>
> >#include <stdlib.h>
> >
> >static size_t
> >strcmp_l (const char *a, const char *b)
> >{
> >  size_t size = 0;
> >  while (a[size] != b[size] && a[size] != '\000')
> >    size++;
> >  return size;
> >}
> >
> >char safe_characters[256];
> >
> >int (*rewind_weigth_p)(const char *);
> >
> >
> >int rewind_weigth(const char *p)
> >{
> >   return safe_characters[(unsigned char) *p];
> >}
> >
> >static void __attribute__ ((constructor)) init ()
> >{
> >   int i;
> >   rewind_weigth_p = rewind_weigth;
> >   for (i=0; i < 256; i++)
> >     safe_characters[i] = i + 1;
> >}
> >
> >
> >int strcoll (const char *a, const char *b)
> >{
> >   size_t dif = strcmp_l (a, b);
> >
> >   if (a[dif] == '\000')
> >     return 0;
> >
> >   int a_weigth = rewind_weigth_p (a + dif);
> >
> >   int b_weigth = rewind_weigth_p (b + dif);
> >
> >   if (a_weigth && b_weigth && a_weigth != b_weigth)
> >     return a_weigth - b_weigth;
> >   else
> >     return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0));
> >}
> >

-- 

runaway cat on system.