public inbox for libc-hacker@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] qsort speedups
       [not found] <1178431001.6204.12.camel@localhost>
@ 2007-10-04 16:50 ` Jakub Jelinek
  0 siblings, 0 replies; only message in thread
From: Jakub Jelinek @ 2007-10-04 16:50 UTC (permalink / raw)
  To: Ulrich Drepper, belazougui djamel; +Cc: Glibc hackers

[-- Attachment #1: Type: text/plain, Size: 5556 bytes --]

Hi!

On Sun, May 06, 2007 at 06:56:39AM +0100, belazougui djamel wrote:
> I am writing again about glibc qsort algorithms and its performance. I
> don't know exactly what to do to attract attention on this subject, but
> speed of GNU qsort  could greatly improved. At least in my experiments,
> i can get quite large improvements. One of the reasons that could make
> qsort so slow is the use of memcpy function in the inner loop. this is
> inefficient because memcpy is slow when copied buffers are of variable
> lengths.One of the solutions i have experimented was to use 4 different
> sort functions, each one implemented for a class of operands lengths:
> 1-sort function for operands of size 4.
> 2-1-sort function for operands of size 8.
> 3-sort function for operands of size 12,20,28,,8n+4.
> 4-sort function for operands of size 16,24,32,,8n.
> 
> for each of those functions i wrote my own memory copy function
> optimized for size class and this made the algorithm run faster. Another
> solution that made even larger speedup is to use indirect merge sorting.

Here is a patch which implements that.  For code size reasons it actually
doesn't use different functions, as that made msort.os much bigger,
just a switch.
Below are some benchmark numbers from x86_64-linux, system glibc doesn't
have that patch, while libc build tree in cwd does.

~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 2048
Strip out best and worst realtime result
minimum: 0.286614861 sec real / 0.000026731 sec CPU
maximum: 0.296511410 sec real / 0.000049066 sec CPU
average: 0.292100834 sec real / 0.000042651 sec CPU
stdev  : 0.001431638 sec real / 0.000003603 sec CPU
~/timing stdlib/tst-qsort2 100000 2048
Strip out best and worst realtime result
minimum: 2.032786138 sec real / 0.000042467 sec CPU
maximum: 2.039871554 sec real / 0.000055063 sec CPU
average: 2.037650687 sec real / 0.000046376 sec CPU
stdev  : 0.001668754 sec real / 0.000001625 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 256
Strip out best and worst realtime result
minimum: 0.079751795 sec real / 0.000037594 sec CPU
maximum: 0.086948837 sec real / 0.000047276 sec CPU
average: 0.080501879 sec real / 0.000041131 sec CPU
stdev  : 0.001015890 sec real / 0.000002226 sec CPU
~/timing stdlib/tst-qsort2 100000 256
Strip out best and worst realtime result
minimum: 0.219361708 sec real / 0.000037121 sec CPU
maximum: 0.221073085 sec real / 0.000048177 sec CPU
average: 0.220079997 sec real / 0.000040878 sec CPU
stdev  : 0.000306212 sec real / 0.000002211 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 32
Strip out best and worst realtime result
minimum: 0.032166783 sec real / 0.000036703 sec CPU
maximum: 0.042934657 sec real / 0.000044877 sec CPU
average: 0.040185916 sec real / 0.000041135 sec CPU
stdev  : 0.002841348 sec real / 0.000002156 sec CPU
~/timing stdlib/tst-qsort2 100000 32
Strip out best and worst realtime result
minimum: 0.036521886 sec real / 0.000027210 sec CPU
maximum: 0.049771413 sec real / 0.000050816 sec CPU
average: 0.045404896 sec real / 0.000040574 sec CPU
stdev  : 0.003352199 sec real / 0.000003306 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 16
Strip out best and worst realtime result
minimum: 0.022185376 sec real / 0.000034847 sec CPU
maximum: 0.036350224 sec real / 0.000050040 sec CPU
average: 0.032392322 sec real / 0.000041894 sec CPU
stdev  : 0.003810001 sec real / 0.000002991 sec CPU
~/timing stdlib/tst-qsort2 100000 16
Strip out best and worst realtime result
minimum: 0.024453106 sec real / 0.000035883 sec CPU
maximum: 0.038619840 sec real / 0.000044240 sec CPU
average: 0.035602517 sec real / 0.000040279 sec CPU
stdev  : 0.002733362 sec real / 0.000001737 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 8
Strip out best and worst realtime result
minimum: 0.021042132 sec real / 0.000021910 sec CPU
maximum: 0.024142820 sec real / 0.000038728 sec CPU
average: 0.023244031 sec real / 0.000030432 sec CPU
stdev  : 0.001035261 sec real / 0.000002934 sec CPU
~/timing stdlib/tst-qsort2 100000 8
Strip out best and worst realtime result
minimum: 0.014727083 sec real / 0.000021330 sec CPU
maximum: 0.024788724 sec real / 0.000039112 sec CPU
average: 0.023562170 sec real / 0.000030172 sec CPU
stdev  : 0.001994661 sec real / 0.000003953 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 4
Strip out best and worst realtime result
minimum: 0.020438105 sec real / 0.000018788 sec CPU
maximum: 0.023082930 sec real / 0.000039773 sec CPU
average: 0.022272783 sec real / 0.000027106 sec CPU
stdev  : 0.000949993 sec real / 0.000004309 sec CPU
~/timing stdlib/tst-qsort2 100000 4
Strip out best and worst realtime result
minimum: 0.020050736 sec real / 0.000016737 sec CPU
maximum: 0.033928475 sec real / 0.000034287 sec CPU
average: 0.027609688 sec real / 0.000023064 sec CPU
stdev  : 0.004832571 sec real / 0.000004405 sec CPU
~/timing elf/ld.so --library-path . stdlib/tst-qsort2 100000 2
Strip out best and worst realtime result
minimum: 0.026925996 sec real / 0.000021625 sec CPU
maximum: 0.032363087 sec real / 0.000032854 sec CPU
average: 0.029911853 sec real / 0.000023378 sec CPU
stdev  : 0.002086839 sec real / 0.000001389 sec CPU
~/timing stdlib/tst-qsort2 100000 2
Strip out best and worst realtime result
minimum: 0.026910302 sec real / 0.000020686 sec CPU
maximum: 0.032150177 sec real / 0.000040594 sec CPU
average: 0.029878779 sec real / 0.000023112 sec CPU
stdev  : 0.002115564 sec real / 0.000001469 sec CPU

	Jakub

[-- Attachment #2: Q2 --]
[-- Type: text/plain, Size: 10357 bytes --]

2007-10-04  Jakub Jelinek  <jakub@redhat.com>

	* stdlib/msort.c: Include stdint.h.
	(struct msort_param): New type.
	(msort_with_tmp): Use struct msort_param pointer for unchanging
	parameters.  Add optimized handling for several common sizes
	and indirect sorting mode.
	(qsort): Adjust msort_with_tmp callers.  For big S use indirect
	sorting.
	Suggested by Belazougui Djamel <djamel8192@yahoo.fr>.

	* stdlib/Makefile (tests): Add tst-qsort2.
	* stdlib/tst-qsort2.c: New test.

--- libc/stdlib/msort.c.jj	2004-02-07 16:57:34.000000000 +0100
+++ libc/stdlib/msort.c	2007-10-04 18:28:43.000000000 +0200
@@ -1,6 +1,6 @@
 /* An alternative to qsort, with an identical interface.
    This file is part of the GNU C Library.
-   Copyright (C) 1992,95-97,99,2000,01,02,04 Free Software Foundation, Inc.
+   Copyright (C) 1992,95-97,99,2000,01,02,04,07 Free Software Foundation, Inc.
    Written by Mike Haertel, September 1988.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -19,20 +19,25 @@
    02111-1307 USA.  */
 
 #include <alloca.h>
+#include <stdint.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <memcopy.h>
 #include <errno.h>
 
-static void msort_with_tmp (void *b, size_t n, size_t s,
-			    __compar_fn_t cmp, char *t);
+struct msort_param
+{
+  size_t s;
+  size_t var;
+  __compar_fn_t cmp;
+  char *t;
+};
+static void msort_with_tmp (const struct msort_param *p, void *b, size_t n);
 
 static void
-msort_with_tmp (void *b, size_t n, size_t s, __compar_fn_t cmp,
-		char *t)
+msort_with_tmp (const struct msort_param *p, void *b, size_t n)
 {
-  char *tmp;
   char *b1, *b2;
   size_t n1, n2;
 
@@ -42,65 +47,131 @@ msort_with_tmp (void *b, size_t n, size_
   n1 = n / 2;
   n2 = n - n1;
   b1 = b;
-  b2 = (char *) b + (n1 * s);
+  b2 = (char *) b + (n1 * p->s);
 
-  msort_with_tmp (b1, n1, s, cmp, t);
-  msort_with_tmp (b2, n2, s, cmp, t);
+  msort_with_tmp (p, b1, n1);
+  msort_with_tmp (p, b2, n2);
 
-  tmp = t;
+  char *tmp = p->t;
+  const size_t s = p->s;
+  __compar_fn_t cmp = p->cmp;
+  switch (p->var)
+    {
+    case 0:
+      while (n1 > 0 && n2 > 0)
+	{
+	  if ((*cmp) (b1, b2) <= 0)
+	    {
+	      *(uint32_t *) tmp = *(uint32_t *) b1;
+	      b1 += sizeof (uint32_t);
+	      --n1;
+	    }
+	  else
+	    {
+	      *(uint32_t *) tmp = *(uint32_t *) b2;
+	      b2 += sizeof (uint32_t);
+	      --n2;
+	    }
+	  tmp += sizeof (uint32_t);
+	}
+      break;
+    case 1:
+      while (n1 > 0 && n2 > 0)
+	{
+	  if ((*cmp) (b1, b2) <= 0)
+	    {
+	      *(uint64_t *) tmp = *(uint64_t *) b1;
+	      b1 += sizeof (uint64_t);
+	      --n1;
+	    }
+	  else
+	    {
+	      *(uint64_t *) tmp = *(uint64_t *) b2;
+	      b2 += sizeof (uint64_t);
+	      --n2;
+	    }
+	  tmp += sizeof (uint64_t);
+	}
+      break;
+    case 2:
+      while (n1 > 0 && n2 > 0)
+	{
+	  unsigned long *tmpl = (unsigned long *) tmp;
+	  unsigned long *bl;
+
+	  tmp += s;
+	  if ((*cmp) (b1, b2) <= 0)
+	    {
+	      bl = (unsigned long *) b1;
+	      b1 += s;
+	      --n1;
+	    }
+	  else
+	    {
+	      bl = (unsigned long *) b2;
+	      b2 += s;
+	      --n2;
+	    }
+	  while (tmpl < (unsigned long *) tmp)
+	    *tmpl++ = *bl++;
+	}
+      break;
+    case 3:
+      while (n1 > 0 && n2 > 0)
+	{
+	  if ((*cmp) (*(const void **) b1, *(const void **) b2) <= 0)
+	    {
+	      *(void **) tmp = *(void **) b1;
+	      b1 += sizeof (void *);
+	      --n1;
+	    }
+	  else
+	    {
+	      *(void **) tmp = *(void **) b2;
+	      b2 += sizeof (void *);
+	      --n2;
+	    }
+	  tmp += sizeof (void *);
+	}
+      break;
+    default:
+      while (n1 > 0 && n2 > 0)
+	{
+	  if ((*cmp) (b1, b2) <= 0)
+	    {
+	      tmp = (char *) __mempcpy (tmp, b1, s);
+	      b1 += s;
+	      --n1;
+	    }
+	  else
+	    {
+	      tmp = (char *) __mempcpy (tmp, b2, s);
+	      b2 += s;
+	      --n2;
+	    }
+	}
+      break;
+    }
 
-  if (s == OPSIZ && (b1 - (char *) 0) % OPSIZ == 0)
-    /* We are operating on aligned words.  Use direct word stores.  */
-    while (n1 > 0 && n2 > 0)
-      {
-	if ((*cmp) (b1, b2) <= 0)
-	  {
-	    --n1;
-	    *((op_t *) tmp) = *((op_t *) b1);
-	    tmp += sizeof (op_t);
-	    b1 += sizeof (op_t);
-	  }
-	else
-	  {
-	    --n2;
-	    *((op_t *) tmp) = *((op_t *) b2);
-	    tmp += sizeof (op_t);
-	    b2 += sizeof (op_t);
-	  }
-      }
-  else
-    while (n1 > 0 && n2 > 0)
-      {
-	if ((*cmp) (b1, b2) <= 0)
-	  {
-	    tmp = (char *) __mempcpy (tmp, b1, s);
-	    b1 += s;
-	    --n1;
-	  }
-	else
-	  {
-	    tmp = (char *) __mempcpy (tmp, b2, s);
-	    b2 += s;
-	    --n2;
-	  }
-      }
   if (n1 > 0)
     memcpy (tmp, b1, n1 * s);
-  memcpy (b, t, (n - n2) * s);
+  memcpy (b, p->t, (n - n2) * s);
 }
 
 void
 qsort (void *b, size_t n, size_t s, __compar_fn_t cmp)
 {
-  const size_t size = n * s;
+  size_t size = n * s;
+  char *tmp = NULL;
+  struct msort_param p;
+
+  /* For large object sizes use indirect sorting.  */
+  if (s > 32)
+    size = 2 * n * sizeof (void *) + s;
 
   if (size < 1024)
-    {
-      void *buf = __alloca (size);
-
-      /* The temporary array is small, so put it on the stack.  */
-      msort_with_tmp (b, n, s, cmp, buf);
-    }
+    /* The temporary array is small, so put it on the stack.  */
+    p.t = __alloca (size);
   else
     {
       /* We should avoid allocating too much memory since this might
@@ -135,26 +206,89 @@ qsort (void *b, size_t n, size_t s, __co
 
       /* If the memory requirements are too high don't allocate memory.  */
       if (size / pagesize > (size_t) phys_pages)
-	_quicksort (b, n, s, cmp);
-      else
 	{
-	  /* It's somewhat large, so malloc it.  */
-	  int save = errno;
-	  char *tmp = malloc (size);
-	  if (tmp == NULL)
-	    {
-	      /* Couldn't get space, so use the slower algorithm
-		 that doesn't need a temporary array.  */
-	      __set_errno (save);
-	      _quicksort (b, n, s, cmp);
-	    }
-	  else
-	    {
-	      __set_errno (save);
-	      msort_with_tmp (b, n, s, cmp, tmp);
-	      free (tmp);
-	    }
+	  _quicksort (b, n, s, cmp);
+	  return;
+	}
+
+      /* It's somewhat large, so malloc it.  */
+      int save = errno;
+      tmp = malloc (size);
+      __set_errno (save);
+      if (tmp == NULL)
+	{
+	  /* Couldn't get space, so use the slower algorithm
+	     that doesn't need a temporary array.  */
+	  _quicksort (b, n, s, cmp);
+	  return;
+	}
+      p.t = tmp;
+    }
+
+  p.s = s;
+  p.cmp = cmp;
+  p.var = 4;
+
+  if (s > 32)
+    {
+      /* Indirect sorting.  */
+      char *ip = (char *) b;
+      void **tp = (void **) (p.t + n * sizeof (void *));
+      void **t = tp;
+      void *tmp_storage = (void *) (tp + n);
+
+      while ((void *) t < tmp_storage)
+	{
+	  *t++ = ip;
+	  ip += s;
+	}
+      p.s = sizeof (void *);
+      p.var = 3;
+      msort_with_tmp (&p, p.t + n * sizeof (void *), n);
+
+      /* tp[0] .. tp[n - 1] is now sorted, copy around entries of
+	 the original array.  Knuth vol. 3 (2nd ed.) exercise 5.2-10.  */
+      char *kp;
+      size_t i;
+      for (i = 0, ip = (char *) b; i < n; i++, ip += s)
+	if ((kp = tp[i]) != ip)
+	  {
+	    size_t j = i;
+	    char *jp = ip;
+	    memcpy (tmp_storage, ip, s);
+
+	    do
+	      {
+		size_t k = (kp - (char *) b) / s;
+		tp[j] = jp;
+		memcpy (jp, kp, s);
+		j = k;
+		jp = kp;
+		kp = tp[k];
+	      }
+	    while (kp != ip);
+
+	    tp[j] = jp;
+	    memcpy (jp, tmp_storage, s); 
+	  }
+    }
+  else
+    {
+      if ((s & (sizeof (uint32_t) - 1)) == 0
+	  && ((char *) b - (char *) 0) % __alignof__ (uint32_t) == 0)
+	{
+	  if (s == sizeof (uint32_t))
+	    p.var = 0;
+	  else if (s == sizeof (uint64_t)
+		   && ((char *) b - (char *) 0) % __alignof__ (uint64_t) == 0)
+	    p.var = 1;
+	  else if ((s & (sizeof (unsigned long) - 1)) == 0
+		   && ((char *) b - (char *) 0)
+		      % __alignof__ (unsigned long) == 0)
+	    p.var = 2;
 	}
+      msort_with_tmp (&p, b, n);
     }
+  free (tmp);
 }
 libc_hidden_def (qsort)
--- libc/stdlib/Makefile.jj	2007-08-04 21:42:30.000000000 +0200
+++ libc/stdlib/Makefile	2007-10-04 17:47:27.000000000 +0200
@@ -68,7 +68,7 @@ tests		:= tst-strtol tst-strtod testmb t
 		   tst-limits tst-rand48 bug-strtod tst-setcontext	    \
 		   test-a64l tst-qsort tst-system testmb2 bug-strtod2	    \
 		   tst-atof1 tst-atof2 tst-strtod2 tst-strtod3 tst-rand48-2 \
-		   tst-makecontext tst-strtod4 tst-strtod5
+		   tst-makecontext tst-strtod4 tst-strtod5 tst-qsort2
 
 include ../Makeconfig
 
--- libc/stdlib/tst-qsort2.c.jj	2007-10-04 17:01:39.000000000 +0200
+++ libc/stdlib/tst-qsort2.c	2007-10-04 17:34:43.000000000 +0200
@@ -0,0 +1,89 @@
+#include <stdio.h>
+#include <stdlib.h>
+
+char *array;
+char *array_end;
+size_t member_size;
+
+int
+compare (const void *a1, const void *b1)
+{
+  const char *a = a1;
+  const char *b = b1;
+
+  if (! (array <= a && a < array_end
+	 && array <= b && b < array_end))
+    {
+      puts ("compare arguments not inside of the array");
+      exit (EXIT_FAILURE);
+    }
+  int ret = b[0] - a[0];
+  if (ret)
+    return ret;
+  if (member_size > 1)
+    return b[1] - a[1];
+  return 0;
+}
+
+int
+test (size_t nmemb, size_t size)
+{
+  array = malloc (nmemb * size);
+  if (array == NULL)
+    {
+      printf ("%zd x %zd: no memory", nmemb, size);
+      return 1;
+    }
+
+  array_end = array + nmemb * size;
+  member_size = size;
+
+  char *p;
+  size_t i;
+  size_t bias = random ();
+  for (i = 0, p = array; i < nmemb; i++, p += size)
+    {
+      p[0] = (char) (i + bias);
+      if (size > 1)
+	p[1] = (char) ((i + bias) >> 8);
+    }
+
+  qsort (array, nmemb, size, compare);
+
+  for (i = 0, p = array; i < nmemb - 1; i++, p += size)
+    {
+      if (p[0] < p[size]
+	  || (size > 1 && p[0] == p[size] && p[1] < p[size + 1]))
+	{
+	  printf ("%zd x %zd: failure at offset %zd\n", nmemb,
+		  size, i);
+	  free (array);
+	  return 1;
+	}
+    }
+
+  free (array);
+  return 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  int ret = 0;
+  if (argc >= 2)
+    ret |= test (atoi (argv[1]), atoi (argv[2]));
+  else
+    {
+      ret |= test (10000, 1);
+      ret |= test (200000, 2);
+      ret |= test (2000000, 3);
+      ret |= test (2132310, 4);
+      ret |= test (1202730, 7);
+      ret |= test (1184710, 8);
+      ret |= test (272710, 12);
+      ret |= test (14170, 32);
+      ret |= test (4170, 320);
+    }
+
+  return ret;
+}

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2007-10-04 16:50 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1178431001.6204.12.camel@localhost>
2007-10-04 16:50 ` [PATCH] qsort speedups Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).