From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3339 invoked by alias); 4 Oct 2007 16:50:27 -0000 Received: (qmail 3321 invoked by uid 22791); 4 Oct 2007 16:50:26 -0000 X-Spam-Check-By: sourceware.org Received: from sunsite.ms.mff.cuni.cz (HELO sunsite.mff.cuni.cz) (195.113.15.26) by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 04 Oct 2007 16:50:13 +0000 Received: from sunsite.mff.cuni.cz (localhost.localdomain [127.0.0.1]) by sunsite.mff.cuni.cz (8.13.8/8.13.8) with ESMTP id l94GqS5S032197; Thu, 4 Oct 2007 18:52:28 +0200 Received: (from jakub@localhost) by sunsite.mff.cuni.cz (8.13.8/8.13.8/Submit) id l94GqSnh032196; Thu, 4 Oct 2007 18:52:28 +0200 Date: Thu, 04 Oct 2007 16:50:00 -0000 From: Jakub Jelinek To: Ulrich Drepper , belazougui djamel Cc: Glibc hackers Subject: [PATCH] qsort speedups Message-ID: <20071004165228.GF2896@sunsite.mff.cuni.cz> Reply-To: Jakub Jelinek References: <1178431001.6204.12.camel@localhost> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="vtzGhvizbBRQ85DL" Content-Disposition: inline In-Reply-To: <1178431001.6204.12.camel@localhost> User-Agent: Mutt/1.4.2.2i Mailing-List: contact libc-hacker-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-hacker-owner@sourceware.org X-SW-Source: 2007-10/txt/msg00004.txt.bz2 --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-length: 5556 Hi! On Sun, May 06, 2007 at 06:56:39AM +0100, belazougui djamel wrote: > I am writing again about glibc qsort algorithms and its performance. I > don't know exactly what to do to attract attention on this subject, but > speed of GNU qsort could greatly improved. At least in my experiments, > i can get quite large improvements. One of the reasons that could make > qsort so slow is the use of memcpy function in the inner loop. this is > inefficient because memcpy is slow when copied buffers are of variable > lengths.One of the solutions i have experimented was to use 4 different > sort functions, each one implemented for a class of operands lengths: > 1-sort function for operands of size 4. > 2-1-sort function for operands of size 8. > 3-sort function for operands of size 12,20,28,,8n+4. > 4-sort function for operands of size 16,24,32,,8n. > > for each of those functions i wrote my own memory copy function > optimized for size class and this made the algorithm run faster. Another > solution that made even larger speedup is to use indirect merge sorting. Here is a patch which implements that. For code size reasons it actually doesn't use different functions, as that made msort.os much bigger, just a switch. Below are some benchmark numbers from x86_64-linux, system glibc doesn't have that patch, while libc build tree in cwd does. ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 2048 Strip out best and worst realtime result minimum: 0.286614861 sec real / 0.000026731 sec CPU maximum: 0.296511410 sec real / 0.000049066 sec CPU average: 0.292100834 sec real / 0.000042651 sec CPU stdev : 0.001431638 sec real / 0.000003603 sec CPU ~/timing stdlib/tst-qsort2 100000 2048 Strip out best and worst realtime result minimum: 2.032786138 sec real / 0.000042467 sec CPU maximum: 2.039871554 sec real / 0.000055063 sec CPU average: 2.037650687 sec real / 0.000046376 sec CPU stdev : 0.001668754 sec real / 0.000001625 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 256 Strip out best and worst realtime result minimum: 0.079751795 sec real / 0.000037594 sec CPU maximum: 0.086948837 sec real / 0.000047276 sec CPU average: 0.080501879 sec real / 0.000041131 sec CPU stdev : 0.001015890 sec real / 0.000002226 sec CPU ~/timing stdlib/tst-qsort2 100000 256 Strip out best and worst realtime result minimum: 0.219361708 sec real / 0.000037121 sec CPU maximum: 0.221073085 sec real / 0.000048177 sec CPU average: 0.220079997 sec real / 0.000040878 sec CPU stdev : 0.000306212 sec real / 0.000002211 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 32 Strip out best and worst realtime result minimum: 0.032166783 sec real / 0.000036703 sec CPU maximum: 0.042934657 sec real / 0.000044877 sec CPU average: 0.040185916 sec real / 0.000041135 sec CPU stdev : 0.002841348 sec real / 0.000002156 sec CPU ~/timing stdlib/tst-qsort2 100000 32 Strip out best and worst realtime result minimum: 0.036521886 sec real / 0.000027210 sec CPU maximum: 0.049771413 sec real / 0.000050816 sec CPU average: 0.045404896 sec real / 0.000040574 sec CPU stdev : 0.003352199 sec real / 0.000003306 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 16 Strip out best and worst realtime result minimum: 0.022185376 sec real / 0.000034847 sec CPU maximum: 0.036350224 sec real / 0.000050040 sec CPU average: 0.032392322 sec real / 0.000041894 sec CPU stdev : 0.003810001 sec real / 0.000002991 sec CPU ~/timing stdlib/tst-qsort2 100000 16 Strip out best and worst realtime result minimum: 0.024453106 sec real / 0.000035883 sec CPU maximum: 0.038619840 sec real / 0.000044240 sec CPU average: 0.035602517 sec real / 0.000040279 sec CPU stdev : 0.002733362 sec real / 0.000001737 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 8 Strip out best and worst realtime result minimum: 0.021042132 sec real / 0.000021910 sec CPU maximum: 0.024142820 sec real / 0.000038728 sec CPU average: 0.023244031 sec real / 0.000030432 sec CPU stdev : 0.001035261 sec real / 0.000002934 sec CPU ~/timing stdlib/tst-qsort2 100000 8 Strip out best and worst realtime result minimum: 0.014727083 sec real / 0.000021330 sec CPU maximum: 0.024788724 sec real / 0.000039112 sec CPU average: 0.023562170 sec real / 0.000030172 sec CPU stdev : 0.001994661 sec real / 0.000003953 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 4 Strip out best and worst realtime result minimum: 0.020438105 sec real / 0.000018788 sec CPU maximum: 0.023082930 sec real / 0.000039773 sec CPU average: 0.022272783 sec real / 0.000027106 sec CPU stdev : 0.000949993 sec real / 0.000004309 sec CPU ~/timing stdlib/tst-qsort2 100000 4 Strip out best and worst realtime result minimum: 0.020050736 sec real / 0.000016737 sec CPU maximum: 0.033928475 sec real / 0.000034287 sec CPU average: 0.027609688 sec real / 0.000023064 sec CPU stdev : 0.004832571 sec real / 0.000004405 sec CPU ~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 2 Strip out best and worst realtime result minimum: 0.026925996 sec real / 0.000021625 sec CPU maximum: 0.032363087 sec real / 0.000032854 sec CPU average: 0.029911853 sec real / 0.000023378 sec CPU stdev : 0.002086839 sec real / 0.000001389 sec CPU ~/timing stdlib/tst-qsort2 100000 2 Strip out best and worst realtime result minimum: 0.026910302 sec real / 0.000020686 sec CPU maximum: 0.032150177 sec real / 0.000040594 sec CPU average: 0.029878779 sec real / 0.000023112 sec CPU stdev : 0.002115564 sec real / 0.000001469 sec CPU Jakub --vtzGhvizbBRQ85DL Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=Q2 Content-length: 10357 2007-10-04 Jakub Jelinek * stdlib/msort.c: Include stdint.h. (struct msort_param): New type. (msort_with_tmp): Use struct msort_param pointer for unchanging parameters. Add optimized handling for several common sizes and indirect sorting mode. (qsort): Adjust msort_with_tmp callers. For big S use indirect sorting. Suggested by Belazougui Djamel . * stdlib/Makefile (tests): Add tst-qsort2. * stdlib/tst-qsort2.c: New test. --- libc/stdlib/msort.c.jj 2004-02-07 16:57:34.000000000 +0100 +++ libc/stdlib/msort.c 2007-10-04 18:28:43.000000000 +0200 @@ -1,6 +1,6 @@ /* An alternative to qsort, with an identical interface. This file is part of the GNU C Library. - Copyright (C) 1992,95-97,99,2000,01,02,04 Free Software Foundation, Inc. + Copyright (C) 1992,95-97,99,2000,01,02,04,07 Free Software Foundation, Inc. Written by Mike Haertel, September 1988. The GNU C Library is free software; you can redistribute it and/or @@ -19,20 +19,25 @@ 02111-1307 USA. */ #include +#include #include #include #include #include #include -static void msort_with_tmp (void *b, size_t n, size_t s, - __compar_fn_t cmp, char *t); +struct msort_param +{ + size_t s; + size_t var; + __compar_fn_t cmp; + char *t; +}; +static void msort_with_tmp (const struct msort_param *p, void *b, size_t n); static void -msort_with_tmp (void *b, size_t n, size_t s, __compar_fn_t cmp, - char *t) +msort_with_tmp (const struct msort_param *p, void *b, size_t n) { - char *tmp; char *b1, *b2; size_t n1, n2; @@ -42,65 +47,131 @@ msort_with_tmp (void *b, size_t n, size_ n1 = n / 2; n2 = n - n1; b1 = b; - b2 = (char *) b + (n1 * s); + b2 = (char *) b + (n1 * p->s); - msort_with_tmp (b1, n1, s, cmp, t); - msort_with_tmp (b2, n2, s, cmp, t); + msort_with_tmp (p, b1, n1); + msort_with_tmp (p, b2, n2); - tmp = t; + char *tmp = p->t; + const size_t s = p->s; + __compar_fn_t cmp = p->cmp; + switch (p->var) + { + case 0: + while (n1 > 0 && n2 > 0) + { + if ((*cmp) (b1, b2) <= 0) + { + *(uint32_t *) tmp = *(uint32_t *) b1; + b1 += sizeof (uint32_t); + --n1; + } + else + { + *(uint32_t *) tmp = *(uint32_t *) b2; + b2 += sizeof (uint32_t); + --n2; + } + tmp += sizeof (uint32_t); + } + break; + case 1: + while (n1 > 0 && n2 > 0) + { + if ((*cmp) (b1, b2) <= 0) + { + *(uint64_t *) tmp = *(uint64_t *) b1; + b1 += sizeof (uint64_t); + --n1; + } + else + { + *(uint64_t *) tmp = *(uint64_t *) b2; + b2 += sizeof (uint64_t); + --n2; + } + tmp += sizeof (uint64_t); + } + break; + case 2: + while (n1 > 0 && n2 > 0) + { + unsigned long *tmpl = (unsigned long *) tmp; + unsigned long *bl; + + tmp += s; + if ((*cmp) (b1, b2) <= 0) + { + bl = (unsigned long *) b1; + b1 += s; + --n1; + } + else + { + bl = (unsigned long *) b2; + b2 += s; + --n2; + } + while (tmpl < (unsigned long *) tmp) + *tmpl++ = *bl++; + } + break; + case 3: + while (n1 > 0 && n2 > 0) + { + if ((*cmp) (*(const void **) b1, *(const void **) b2) <= 0) + { + *(void **) tmp = *(void **) b1; + b1 += sizeof (void *); + --n1; + } + else + { + *(void **) tmp = *(void **) b2; + b2 += sizeof (void *); + --n2; + } + tmp += sizeof (void *); + } + break; + default: + while (n1 > 0 && n2 > 0) + { + if ((*cmp) (b1, b2) <= 0) + { + tmp = (char *) __mempcpy (tmp, b1, s); + b1 += s; + --n1; + } + else + { + tmp = (char *) __mempcpy (tmp, b2, s); + b2 += s; + --n2; + } + } + break; + } - if (s == OPSIZ && (b1 - (char *) 0) % OPSIZ == 0) - /* We are operating on aligned words. Use direct word stores. */ - while (n1 > 0 && n2 > 0) - { - if ((*cmp) (b1, b2) <= 0) - { - --n1; - *((op_t *) tmp) = *((op_t *) b1); - tmp += sizeof (op_t); - b1 += sizeof (op_t); - } - else - { - --n2; - *((op_t *) tmp) = *((op_t *) b2); - tmp += sizeof (op_t); - b2 += sizeof (op_t); - } - } - else - while (n1 > 0 && n2 > 0) - { - if ((*cmp) (b1, b2) <= 0) - { - tmp = (char *) __mempcpy (tmp, b1, s); - b1 += s; - --n1; - } - else - { - tmp = (char *) __mempcpy (tmp, b2, s); - b2 += s; - --n2; - } - } if (n1 > 0) memcpy (tmp, b1, n1 * s); - memcpy (b, t, (n - n2) * s); + memcpy (b, p->t, (n - n2) * s); } void qsort (void *b, size_t n, size_t s, __compar_fn_t cmp) { - const size_t size = n * s; + size_t size = n * s; + char *tmp = NULL; + struct msort_param p; + + /* For large object sizes use indirect sorting. */ + if (s > 32) + size = 2 * n * sizeof (void *) + s; if (size < 1024) - { - void *buf = __alloca (size); - - /* The temporary array is small, so put it on the stack. */ - msort_with_tmp (b, n, s, cmp, buf); - } + /* The temporary array is small, so put it on the stack. */ + p.t = __alloca (size); else { /* We should avoid allocating too much memory since this might @@ -135,26 +206,89 @@ qsort (void *b, size_t n, size_t s, __co /* If the memory requirements are too high don't allocate memory. */ if (size / pagesize > (size_t) phys_pages) - _quicksort (b, n, s, cmp); - else { - /* It's somewhat large, so malloc it. */ - int save = errno; - char *tmp = malloc (size); - if (tmp == NULL) - { - /* Couldn't get space, so use the slower algorithm - that doesn't need a temporary array. */ - __set_errno (save); - _quicksort (b, n, s, cmp); - } - else - { - __set_errno (save); - msort_with_tmp (b, n, s, cmp, tmp); - free (tmp); - } + _quicksort (b, n, s, cmp); + return; + } + + /* It's somewhat large, so malloc it. */ + int save = errno; + tmp = malloc (size); + __set_errno (save); + if (tmp == NULL) + { + /* Couldn't get space, so use the slower algorithm + that doesn't need a temporary array. */ + _quicksort (b, n, s, cmp); + return; + } + p.t = tmp; + } + + p.s = s; + p.cmp = cmp; + p.var = 4; + + if (s > 32) + { + /* Indirect sorting. */ + char *ip = (char *) b; + void **tp = (void **) (p.t + n * sizeof (void *)); + void **t = tp; + void *tmp_storage = (void *) (tp + n); + + while ((void *) t < tmp_storage) + { + *t++ = ip; + ip += s; + } + p.s = sizeof (void *); + p.var = 3; + msort_with_tmp (&p, p.t + n * sizeof (void *), n); + + /* tp[0] .. tp[n - 1] is now sorted, copy around entries of + the original array. Knuth vol. 3 (2nd ed.) exercise 5.2-10. */ + char *kp; + size_t i; + for (i = 0, ip = (char *) b; i < n; i++, ip += s) + if ((kp = tp[i]) != ip) + { + size_t j = i; + char *jp = ip; + memcpy (tmp_storage, ip, s); + + do + { + size_t k = (kp - (char *) b) / s; + tp[j] = jp; + memcpy (jp, kp, s); + j = k; + jp = kp; + kp = tp[k]; + } + while (kp != ip); + + tp[j] = jp; + memcpy (jp, tmp_storage, s); + } + } + else + { + if ((s & (sizeof (uint32_t) - 1)) == 0 + && ((char *) b - (char *) 0) % __alignof__ (uint32_t) == 0) + { + if (s == sizeof (uint32_t)) + p.var = 0; + else if (s == sizeof (uint64_t) + && ((char *) b - (char *) 0) % __alignof__ (uint64_t) == 0) + p.var = 1; + else if ((s & (sizeof (unsigned long) - 1)) == 0 + && ((char *) b - (char *) 0) + % __alignof__ (unsigned long) == 0) + p.var = 2; } + msort_with_tmp (&p, b, n); } + free (tmp); } libc_hidden_def (qsort) --- libc/stdlib/Makefile.jj 2007-08-04 21:42:30.000000000 +0200 +++ libc/stdlib/Makefile 2007-10-04 17:47:27.000000000 +0200 @@ -68,7 +68,7 @@ tests := tst-strtol tst-strtod testmb t tst-limits tst-rand48 bug-strtod tst-setcontext \ test-a64l tst-qsort tst-system testmb2 bug-strtod2 \ tst-atof1 tst-atof2 tst-strtod2 tst-strtod3 tst-rand48-2 \ - tst-makecontext tst-strtod4 tst-strtod5 + tst-makecontext tst-strtod4 tst-strtod5 tst-qsort2 include ../Makeconfig --- libc/stdlib/tst-qsort2.c.jj 2007-10-04 17:01:39.000000000 +0200 +++ libc/stdlib/tst-qsort2.c 2007-10-04 17:34:43.000000000 +0200 @@ -0,0 +1,89 @@ +#include +#include + +char *array; +char *array_end; +size_t member_size; + +int +compare (const void *a1, const void *b1) +{ + const char *a = a1; + const char *b = b1; + + if (! (array <= a && a < array_end + && array <= b && b < array_end)) + { + puts ("compare arguments not inside of the array"); + exit (EXIT_FAILURE); + } + int ret = b[0] - a[0]; + if (ret) + return ret; + if (member_size > 1) + return b[1] - a[1]; + return 0; +} + +int +test (size_t nmemb, size_t size) +{ + array = malloc (nmemb * size); + if (array == NULL) + { + printf ("%zd x %zd: no memory", nmemb, size); + return 1; + } + + array_end = array + nmemb * size; + member_size = size; + + char *p; + size_t i; + size_t bias = random (); + for (i = 0, p = array; i < nmemb; i++, p += size) + { + p[0] = (char) (i + bias); + if (size > 1) + p[1] = (char) ((i + bias) >> 8); + } + + qsort (array, nmemb, size, compare); + + for (i = 0, p = array; i < nmemb - 1; i++, p += size) + { + if (p[0] < p[size] + || (size > 1 && p[0] == p[size] && p[1] < p[size + 1])) + { + printf ("%zd x %zd: failure at offset %zd\n", nmemb, + size, i); + free (array); + return 1; + } + } + + free (array); + return 0; +} + +int +main (int argc, char **argv) +{ + int ret = 0; + if (argc >= 2) + ret |= test (atoi (argv[1]), atoi (argv[2])); + else + { + ret |= test (10000, 1); + ret |= test (200000, 2); + ret |= test (2000000, 3); + ret |= test (2132310, 4); + ret |= test (1202730, 7); + ret |= test (1184710, 8); + ret |= test (272710, 12); + ret |= test (14170, 32); + ret |= test (4170, 320); + } + + return ret; +} --vtzGhvizbBRQ85DL--