From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26191 invoked by alias); 23 Nov 2014 23:47:42 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 26181 invoked by uid 89); 23 Nov 2014 23:47:41 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Sun, 23 Nov 2014 23:47:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Leonhard Holz Cc: libc-alpha@sourceware.org Subject: Re: [RFC] Add fast path for strcoll and strcasecmp Message-ID: <20141123234728.GA31572@domone> References: <20141123214718.GA28222@domone> <54726516.5060409@web.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54726516.5060409@web.de> User-Agent: Mutt/1.5.20 (2009-06-14) X-SW-Source: 2014-11/txt/msg00650.txt.bz2 On Sun, Nov 23, 2014 at 11:52:06PM +0100, Leonhard Holz wrote: > Hi Ondřej, > > as far as I understood, the current strcoll implementation scans > both strings for collation sequences and compares the weights of > them, whereby a collation sequence can be multiple bytes long. So > whatever strcmp_l returns as index, you would need a general way of > finding the start of the collation sequence this index is in. > Unfortunately I cannot tell if or how this can be done. > As I wrote below you do not have to do that. Just precompute a table that is zero for characters that are part of some collation sequence and use old method when one of compared characters is in that table. >From performance perspective these are not problem as they should be infrequent enough. Ignored ones are worse as they could make otherwise identical long prefixes different. > BTW I have implemented a benchmark for strcoll that is > not-yet-pushed because I didn't manage to patch the bench-tests > Makefile to generate additionally needed locales > (https://sourceware.org/ml/libc-alpha/2014-10/msg00431.html). > > Leonhard > > Am 23.11.2014 22:47, schrieb Ondřej Bílka: > >Hi, > > > >Now when I looked to strcoll improvement I recalled that strcoll(_l) and > >strcasecmp could be made faster by following approach. We find first > >position where strings differ and we will likely decide how they differ > >by looking at few bytes. > > > >For that we need to do two things. First is to determine where character > >containing differing byte starts. That is easy to do for single byte > >encodings, UTF and prefix-free encodings in general. Then we need to > >decide if decision can be made by characters alone. For that we need to > >autogenerate new weigth table for each locale. It is basically primary > >weigth table but ignored characters and characters that are part of > >sequences get zero. If safe weigths are equal or one is zero then we > >call original function. > > > >A strcasecmp works in same way with appropriate collation table. > > > >To determine performance gain I made a simple incorrect implementation > >where I ignored calculating safe weigths. Another optimization is to use > >strcmp_l optimized assembly which is easy to derive from strcmp by > >returning index instead how these characters differ. > > > >A sample implementation is here, comments? > > > > > >#include > >#include > >#include > >#include > > > >static size_t > >strcmp_l (const char *a, const char *b) > >{ > > size_t size = 0; > > while (a[size] != b[size] && a[size] != '\000') > > size++; > > return size; > >} > > > >char safe_characters[256]; > > > >int (*rewind_weigth_p)(const char *); > > > > > >int rewind_weigth(const char *p) > >{ > > return safe_characters[(unsigned char) *p]; > >} > > > >static void __attribute__ ((constructor)) init () > >{ > > int i; > > rewind_weigth_p = rewind_weigth; > > for (i=0; i < 256; i++) > > safe_characters[i] = i + 1; > >} > > > > > >int strcoll (const char *a, const char *b) > >{ > > size_t dif = strcmp_l (a, b); > > > > if (a[dif] == '\000') > > return 0; > > > > int a_weigth = rewind_weigth_p (a + dif); > > > > int b_weigth = rewind_weigth_p (b + dif); > > > > if (a_weigth && b_weigth && a_weigth != b_weigth) > > return a_weigth - b_weigth; > > else > > return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0)); > >} > > -- runaway cat on system.