[RFC] Add fast path for strcoll and strcasecmp

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [RFC] Add fast path for strcoll and strcasecmp
@ 2014-11-23 21:47 Ondřej Bílka
  2014-11-23 22:52 ` Leonhard Holz
  0 siblings, 1 reply; 7+ messages in thread
From: Ondřej Bílka @ 2014-11-23 21:47 UTC (permalink / raw)
  To: libc-alpha; +Cc: leonhard.holz

Hi,

Now when I looked to strcoll improvement I recalled that strcoll(_l) and
strcasecmp could be made faster by following approach. We find first
position where strings differ and we will likely decide how they differ
by looking at few bytes. 

For that we need to do two things. First is to determine where character
containing differing byte starts. That is easy to do for single byte
encodings, UTF and prefix-free encodings in general. Then we need to
decide if decision can be made by characters alone. For that we need to
autogenerate new weigth table for each locale. It is basically primary
weigth table but ignored characters and characters that are part of
sequences get zero. If safe weigths are equal or one is zero then we
call original function.

A strcasecmp works in same way with appropriate collation table.

To determine performance gain I made a simple incorrect implementation
where I ignored calculating safe weigths. Another optimization is to use
strcmp_l optimized assembly which is easy to derive from strcmp by
returning index instead how these characters differ.

A sample implementation is here, comments?

#include <locale.h>
#include <xlocale.h>
#include <string.h>
#include <stdlib.h>

static size_t 
strcmp_l (const char *a, const char *b)
{
 size_t size = 0;
 while (a[size] != b[size] && a[size] != '\000')
   size++;
 return size;
}

char safe_characters[256];

int (*rewind_weigth_p)(const char *);

int rewind_weigth(const char *p)
{
  return safe_characters[(unsigned char) *p];
}

static void __attribute__ ((constructor)) init ()
{
  int i;
  rewind_weigth_p = rewind_weigth;
  for (i=0; i < 256; i++)
    safe_characters[i] = i + 1;
}

int strcoll (const char *a, const char *b)
{
  size_t dif = strcmp_l (a, b);

  if (a[dif] == '\000')
    return 0;

  int a_weigth = rewind_weigth_p (a + dif);

  int b_weigth = rewind_weigth_p (b + dif);

  if (a_weigth && b_weigth && a_weigth != b_weigth)
    return a_weigth - b_weigth;
  else
    return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0));
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-23 21:47 [RFC] Add fast path for strcoll and strcasecmp Ondřej Bílka
@ 2014-11-23 22:52 ` Leonhard Holz
  2014-11-23 23:47   ` Ondřej Bílka
  0 siblings, 1 reply; 7+ messages in thread
From: Leonhard Holz @ 2014-11-23 22:52 UTC (permalink / raw)
  To: libc-alpha

Hi OndÅ™ej,

as far as I understood, the current strcoll implementation scans both 
strings for collation sequences and compares the weights of them, 
whereby a collation sequence can be multiple bytes long. So whatever 
strcmp_l returns as index, you would need a general way of finding the 
start of the collation sequence this index is in. Unfortunately I cannot 
tell if or how this can be done.

BTW I have implemented a benchmark for strcoll that is not-yet-pushed 
because I didn't manage to patch the bench-tests Makefile to generate 
additionally needed locales 
(https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).

Leonhard

Am 23.11.2014 22:47, schrieb OndÅ™ej BÃlka:
> Hi,
>
> Now when I looked to strcoll improvement I recalled that strcoll(_l) and
> strcasecmp could be made faster by following approach. We find first
> position where strings differ and we will likely decide how they differ
> by looking at few bytes.
>
> For that we need to do two things. First is to determine where character
> containing differing byte starts. That is easy to do for single byte
> encodings, UTF and prefix-free encodings in general. Then we need to
> decide if decision can be made by characters alone. For that we need to
> autogenerate new weigth table for each locale. It is basically primary
> weigth table but ignored characters and characters that are part of
> sequences get zero. If safe weigths are equal or one is zero then we
> call original function.
>
> A strcasecmp works in same way with appropriate collation table.
>
> To determine performance gain I made a simple incorrect implementation
> where I ignored calculating safe weigths. Another optimization is to use
> strcmp_l optimized assembly which is easy to derive from strcmp by
> returning index instead how these characters differ.
>
> A sample implementation is here, comments?
>
>
> #include <locale.h>
> #include <xlocale.h>
> #include <string.h>
> #include <stdlib.h>
>
> static size_t
> strcmp_l (const char *a, const char *b)
> {
>   size_t size = 0;
>   while (a[size] != b[size] && a[size] != '\000')
>     size++;
>   return size;
> }
>
> char safe_characters[256];
>
> int (*rewind_weigth_p)(const char *);
>
>
> int rewind_weigth(const char *p)
> {
>    return safe_characters[(unsigned char) *p];
> }
>
> static void __attribute__ ((constructor)) init ()
> {
>    int i;
>    rewind_weigth_p = rewind_weigth;
>    for (i=0; i < 256; i++)
>      safe_characters[i] = i + 1;
> }
>
>
> int strcoll (const char *a, const char *b)
> {
>    size_t dif = strcmp_l (a, b);
>
>    if (a[dif] == '\000')
>      return 0;
>
>    int a_weigth = rewind_weigth_p (a + dif);
>
>    int b_weigth = rewind_weigth_p (b + dif);
>
>    if (a_weigth && b_weigth && a_weigth != b_weigth)
>      return a_weigth - b_weigth;
>    else
>      return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0));
> }
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-23 22:52 ` Leonhard Holz
@ 2014-11-23 23:47   ` Ondřej Bílka
  2014-11-25 20:04     ` Leonhard Holz
  0 siblings, 1 reply; 7+ messages in thread
From: Ondřej Bílka @ 2014-11-23 23:47 UTC (permalink / raw)
  To: Leonhard Holz; +Cc: libc-alpha

On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
> Hi OndÅ™ej,
> 
> as far as I understood, the current strcoll implementation scans
> both strings for collation sequences and compares the weights of
> them, whereby a collation sequence can be multiple bytes long. So
> whatever strcmp_l returns as index, you would need a general way of
> finding the start of the collation sequence this index is in.
> Unfortunately I cannot tell if or how this can be done.
> 
As I wrote below you do not have to do that. Just precompute a table that
is zero for characters that are part of some collation sequence and use
old method when one of compared characters is in that table.

From performance perspective these are not problem as they should be
infrequent enough. Ignored ones are worse as they could make otherwise
identical long prefixes different.


> BTW I have implemented a benchmark for strcoll that is
> not-yet-pushed because I didn't manage to patch the bench-tests
> Makefile to generate additionally needed locales
> (https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).
> 
> Leonhard
> 
> Am 23.11.2014 22:47, schrieb OndÅ™ej BÃlka:
> >Hi,
> >
> >Now when I looked to strcoll improvement I recalled that strcoll(_l) and
> >strcasecmp could be made faster by following approach. We find first
> >position where strings differ and we will likely decide how they differ
> >by looking at few bytes.
> >
> >For that we need to do two things. First is to determine where character
> >containing differing byte starts. That is easy to do for single byte
> >encodings, UTF and prefix-free encodings in general. Then we need to
> >decide if decision can be made by characters alone. For that we need to
> >autogenerate new weigth table for each locale. It is basically primary
> >weigth table but ignored characters and characters that are part of
> >sequences get zero. If safe weigths are equal or one is zero then we
> >call original function.
> >
> >A strcasecmp works in same way with appropriate collation table.
> >
> >To determine performance gain I made a simple incorrect implementation
> >where I ignored calculating safe weigths. Another optimization is to use
> >strcmp_l optimized assembly which is easy to derive from strcmp by
> >returning index instead how these characters differ.
> >
> >A sample implementation is here, comments?
> >
> >
> >#include <locale.h>
> >#include <xlocale.h>
> >#include <string.h>
> >#include <stdlib.h>
> >
> >static size_t
> >strcmp_l (const char *a, const char *b)
> >{
> >  size_t size = 0;
> >  while (a[size] != b[size] && a[size] != '\000')
> >    size++;
> >  return size;
> >}
> >
> >char safe_characters[256];
> >
> >int (*rewind_weigth_p)(const char *);
> >
> >
> >int rewind_weigth(const char *p)
> >{
> >   return safe_characters[(unsigned char) *p];
> >}
> >
> >static void __attribute__ ((constructor)) init ()
> >{
> >   int i;
> >   rewind_weigth_p = rewind_weigth;
> >   for (i=0; i < 256; i++)
> >     safe_characters[i] = i + 1;
> >}
> >
> >
> >int strcoll (const char *a, const char *b)
> >{
> >   size_t dif = strcmp_l (a, b);
> >
> >   if (a[dif] == '\000')
> >     return 0;
> >
> >   int a_weigth = rewind_weigth_p (a + dif);
> >
> >   int b_weigth = rewind_weigth_p (b + dif);
> >
> >   if (a_weigth && b_weigth && a_weigth != b_weigth)
> >     return a_weigth - b_weigth;
> >   else
> >     return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0));
> >}
> >

-- 

runaway cat on system.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-23 23:47   ` Ondřej Bílka
@ 2014-11-25 20:04     ` Leonhard Holz
  2014-11-25 20:36       ` Ondřej Bílka
  0 siblings, 1 reply; 7+ messages in thread
From: Leonhard Holz @ 2014-11-25 20:04 UTC (permalink / raw)
  To: libc-alpha

Am 24.11.2014 00:47, schrieb OndÅ™ej BÃlka:
> On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
>> Hi OndÅ™ej,
>>
>> as far as I understood, the current strcoll implementation scans
>> both strings for collation sequences and compares the weights of
>> them, whereby a collation sequence can be multiple bytes long. So
>> whatever strcmp_l returns as index, you would need a general way of
>> finding the start of the collation sequence this index is in.
>> Unfortunately I cannot tell if or how this can be done.
>>
> As I wrote below you do not have to do that. Just precompute a table that
> is zero for characters that are part of some collation sequence and use
> old method when one of compared characters is in that table.
>

Ok, I understand the idea and it would be great if it worked. BTW do you 
know how UTF-8 chars above 7F are handled?

>  From performance perspective these are not problem as they should be
> infrequent enough. Ignored ones are worse as they could make otherwise
> identical long prefixes different.
>
>
>> BTW I have implemented a benchmark for strcoll that is
>> not-yet-pushed because I didn't manage to patch the bench-tests
>> Makefile to generate additionally needed locales
>> (https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).
>>

I can send you the test files if you like.

Leonhard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-25 20:04     ` Leonhard Holz
@ 2014-11-25 20:36       ` Ondřej Bílka
  2014-11-26  9:07         ` Leonhard Holz
  0 siblings, 1 reply; 7+ messages in thread
From: Ondřej Bílka @ 2014-11-25 20:36 UTC (permalink / raw)
  To: Leonhard Holz; +Cc: libc-alpha

On Tue, Nov 25, 2014 at 09:04:04PM +0100, Leonhard Holz wrote:
> Am 24.11.2014 00:47, schrieb OndÅ™ej BÃlka:
> >On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
> >>Hi OndÅ™ej,
> >>
> >>as far as I understood, the current strcoll implementation scans
> >>both strings for collation sequences and compares the weights of
> >>them, whereby a collation sequence can be multiple bytes long. So
> >>whatever strcmp_l returns as index, you would need a general way of
> >>finding the start of the collation sequence this index is in.
> >>Unfortunately I cannot tell if or how this can be done.
> >>
> >As I wrote below you do not have to do that. Just precompute a table that
> >is zero for characters that are part of some collation sequence and use
> >old method when one of compared characters is in that table.
> >
> 
> Ok, I understand the idea and it would be great if it worked. BTW do
> you know how UTF-8 chars above 7F are handled?
>
A UTF-8 char consist of starting byte larger than 0xbf followed by
characters in 0x80-0xbf range, see

http://en.wikipedia.org/wiki/UTF-8

 
> > From performance perspective these are not problem as they should be
> >infrequent enough. Ignored ones are worse as they could make otherwise
> >identical long prefixes different.
> >
> >
> >>BTW I have implemented a benchmark for strcoll that is
> >>not-yet-pushed because I didn't manage to patch the bench-tests
> >>Makefile to generate additionally needed locales
> >>(https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html).
> >>
> 
> I can send you the test files if you like.
> 
> Leonhard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-25 20:36       ` Ondřej Bílka
@ 2014-11-26  9:07         ` Leonhard Holz
  2014-11-27 15:25           ` Ondřej Bílka
  0 siblings, 1 reply; 7+ messages in thread
From: Leonhard Holz @ 2014-11-26  9:07 UTC (permalink / raw)
  To: libc-alpha



Am 25.11.2014 21:36, schrieb OndÅ™ej BÃlka:
> On Tue, Nov 25, 2014 at 09:04:04PM +0100, Leonhard Holz wrote:
>> Am 24.11.2014 00:47, schrieb OndÅ™ej BÃlka:
>>> On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
>>>> Hi OndÅ™ej,
>>>>
>>>> as far as I understood, the current strcoll implementation scans
>>>> both strings for collation sequences and compares the weights of
>>>> them, whereby a collation sequence can be multiple bytes long. So
>>>> whatever strcmp_l returns as index, you would need a general way of
>>>> finding the start of the collation sequence this index is in.
>>>> Unfortunately I cannot tell if or how this can be done.
>>>>
>>> As I wrote below you do not have to do that. Just precompute a table that
>>> is zero for characters that are part of some collation sequence and use
>>> old method when one of compared characters is in that table.
>>>
>>
>> Ok, I understand the idea and it would be great if it worked. BTW do
>> you know how UTF-8 chars above 7F are handled?
>>
> A UTF-8 char consist of starting byte larger than 0xbf followed by
> characters in 0x80-0xbf range, see
>
> http://en.wikipedia.org/wiki/UTF-8
>

Sorry for confusion. The question was ought to ask how the algorithm 
handles them. E.g. what to do when strcmp stops at a char with value 0x81.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Add fast path for strcoll and strcasecmp
  2014-11-26  9:07         ` Leonhard Holz
@ 2014-11-27 15:25           ` Ondřej Bílka
  0 siblings, 0 replies; 7+ messages in thread
From: Ondřej Bílka @ 2014-11-27 15:25 UTC (permalink / raw)
  To: Leonhard Holz; +Cc: libc-alpha

On Wed, Nov 26, 2014 at 10:07:21AM +0100, Leonhard Holz wrote:
> 
> 
> Am 25.11.2014 21:36, schrieb OndÅ™ej BÃlka:
> >On Tue, Nov 25, 2014 at 09:04:04PM +0100, Leonhard Holz wrote:
> >>Am 24.11.2014 00:47, schrieb OndÅ™ej BÃlka:
> >>>On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote:
> >>>>Hi OndÅ™ej,
> >>>>
> >>>>as far as I understood, the current strcoll implementation scans
> >>>>both strings for collation sequences and compares the weights of
> >>>>them, whereby a collation sequence can be multiple bytes long. So
> >>>>whatever strcmp_l returns as index, you would need a general way of
> >>>>finding the start of the collation sequence this index is in.
> >>>>Unfortunately I cannot tell if or how this can be done.
> >>>>
> >>>As I wrote below you do not have to do that. Just precompute a table that
> >>>is zero for characters that are part of some collation sequence and use
> >>>old method when one of compared characters is in that table.
> >>>
> >>
> >>Ok, I understand the idea and it would be great if it worked. BTW do
> >>you know how UTF-8 chars above 7F are handled?
> >>
> >A UTF-8 char consist of starting byte larger than 0xbf followed by
> >characters in 0x80-0xbf range, see
> >
> >http://en.wikipedia.org/wiki/UTF-8
> >
> 
> Sorry for confusion. The question was ought to ask how the algorithm
> handles them. E.g. what to do when strcmp stops at a char with value
> 0x81.

check which of three characters before is starting one. 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-27 15:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-23 21:47 [RFC] Add fast path for strcoll and strcasecmp Ondřej Bílka
2014-11-23 22:52 ` Leonhard Holz
2014-11-23 23:47   ` Ondřej Bílka
2014-11-25 20:04     ` Leonhard Holz
2014-11-25 20:36       ` Ondřej Bílka
2014-11-26  9:07         ` Leonhard Holz
2014-11-27 15:25           ` Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).