From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13297 invoked by alias); 23 Nov 2014 21:47:30 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 13282 invoked by uid 89); 23 Nov 2014 21:47:29 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Sun, 23 Nov 2014 21:47:00 -0000 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: libc-alpha@sourceware.org Cc: leonhard.holz@web.de Subject: [RFC] Add fast path for strcoll and strcasecmp Message-ID: <20141123214718.GA28222@domone> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) X-SW-Source: 2014-11/txt/msg00648.txt.bz2 Hi, Now when I looked to strcoll improvement I recalled that strcoll(_l) and strcasecmp could be made faster by following approach. We find first position where strings differ and we will likely decide how they differ by looking at few bytes. For that we need to do two things. First is to determine where character containing differing byte starts. That is easy to do for single byte encodings, UTF and prefix-free encodings in general. Then we need to decide if decision can be made by characters alone. For that we need to autogenerate new weigth table for each locale. It is basically primary weigth table but ignored characters and characters that are part of sequences get zero. If safe weigths are equal or one is zero then we call original function. A strcasecmp works in same way with appropriate collation table. To determine performance gain I made a simple incorrect implementation where I ignored calculating safe weigths. Another optimization is to use strcmp_l optimized assembly which is easy to derive from strcmp by returning index instead how these characters differ. A sample implementation is here, comments? #include #include #include #include static size_t strcmp_l (const char *a, const char *b) { size_t size = 0; while (a[size] != b[size] && a[size] != '\000') size++; return size; } char safe_characters[256]; int (*rewind_weigth_p)(const char *); int rewind_weigth(const char *p) { return safe_characters[(unsigned char) *p]; } static void __attribute__ ((constructor)) init () { int i; rewind_weigth_p = rewind_weigth; for (i=0; i < 256; i++) safe_characters[i] = i + 1; } int strcoll (const char *a, const char *b) { size_t dif = strcmp_l (a, b); if (a[dif] == '\000') return 0; int a_weigth = rewind_weigth_p (a + dif); int b_weigth = rewind_weigth_p (b + dif); if (a_weigth && b_weigth && a_weigth != b_weigth) return a_weigth - b_weigth; else return strcoll_l (a, b, newlocale (LC_ALL, setlocale(LC_ALL, NULL), (locale_t) 0)); }