From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=YXQL=GM=gmail.com=goldstein.w.n@sourceware.org>
Received: from mail-oa1-x33.google.com (mail-oa1-x33.google.com [IPv6:2001:4860:4864:20::33])
	by sourceware.org (Postfix) with ESMTPS id C5C07385840D
	for <libc-alpha@sourceware.org>; Mon, 30 Oct 2023 19:03:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C5C07385840D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C5C07385840D
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2001:4860:4864:20::33
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698692596; cv=none;
	b=SgW/CAt2XEBfC25olac5jVFlejNYMZzlF5r977dKUpIp7VR72T0spyjZH0KKQJ/amIj68KhPOX2B1xY5HXJMs6QJ/L6LzxI2Qcn9xrr25LR04qsJ+xAg5cHRE/d9yEEJtrT2sobbO0gtrKKYWbkF6X/QH6imyJrklfit5abwR/Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1698692596; c=relaxed/simple;
	bh=JHYfiOCWC0hqJL/4H3+omzTd6gjil32ol9QV7HClwjc=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=xIEL4U4Jc5PpU2xitRF+7TQI+p03Tkz9pJdh+fNgrW9G9jK7lULGGNIlKR9NJS+Q8NbdVQCH6ybjjoX7rHDrbM3TQztevyVtKsvsKK4cwZoFzkoY36lo/pKRzvBGP51qPMiZp86oeyC3PLbz8tdkFyYP7nOLCDdqi3JWfyeOUeA=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-oa1-x33.google.com with SMTP id 586e51a60fabf-1e98e97c824so2979188fac.1
        for <libc-alpha@sourceware.org>; Mon, 30 Oct 2023 12:03:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1698692589; x=1699297389; darn=sourceware.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=d9Dry7KbZpDqsqEaPorgR6PMGQl/LBFWs1bLF6zrvsg=;
        b=XGUH7GniLL8ABMa42l/sQi9/sSzjv61E/8lDQ0ahurlEdjyGsMAQXRp52GwEAfKvHJ
         ZgmwkOIA54Gj9tG0pYeh7gIsA6sxu/B3GvmiXMTyvolLnGV9aeAs1qBfsUi47HvbyKEC
         OP/XQ8bU59gMYf2DTxStW49SS9/XSotc+3wV9aRHryRWlOcSJWgC/dxysRT5G8+OVzGK
         mDhSHfpDgfPAMJuoMgRUVC87dcylJ7JWRiPq4Akr34oIRGIjibb17Rz5zzL1eCge72yX
         Bm+Y6IPDIOU8SX+cvwnT/Yj2bS+MJjbIVoqgxw0+qO9xW+ni8oDx/Pk81QNpOWgogVZi
         W8nw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698692589; x=1699297389;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=d9Dry7KbZpDqsqEaPorgR6PMGQl/LBFWs1bLF6zrvsg=;
        b=xA1ix5/BP3HV72q/i+XrRMb9R6kEbORjyS6c8cybEEp/01PvGHqBfVWA+k1NopZ23r
         5xRWLtY6PHWgeWdeXFcK54Wz/vsbS+Fj0Azh3f8TcZs0+I1Mv051D5pG+SJgjjnfyE4W
         uMlwwbZZilhw1imJC3/LcOYsoso7DnWYeCJsoGSa27weKlkVW4appun6Cg3vcgamnuaj
         TFxgn6u1UY31Ie3byvR/5KehNB+UCM3ClwYFZ/GkpRX+D/NzikDytN3M4S3Z0kq2spd4
         uOylOiOZnQWjMkUoxs3xjhBIGost+zCvqZd4l07NuZ0kJJGCnFmBrDuNkKTWZINbjd3i
         TbMw==
X-Gm-Message-State: AOJu0YyuqtUuyWpVQfYMVg20HSGEPyrCEBK1YE/Lh5qKJS+zFwcchSxM
	9Z6JaLCbKeJnkCIsw+mKI1/M73LeXYAeoHxUyKc=
X-Google-Smtp-Source: AGHT+IHAd9mnyTbAMz2g7v+Ke7X1iXrKC+99WkAlllmKLNJHkVBknl2/rNuQtCKPQ6JSOHTjrOii66110oBYTaLe5rU=
X-Received: by 2002:a05:6870:114f:b0:1e9:9c24:4498 with SMTP id
 15-20020a056870114f00b001e99c244498mr237687oag.9.1698692588948; Mon, 30 Oct
 2023 12:03:08 -0700 (PDT)
MIME-Version: 1.0
References: <20231003122251.3325435-1-adhemerval.zanella@linaro.org> <20231003122251.3325435-3-adhemerval.zanella@linaro.org>
In-Reply-To: <20231003122251.3325435-3-adhemerval.zanella@linaro.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon, 30 Oct 2023 14:02:57 -0500
Message-ID: <CAFUsyfKDo1UofOxv1+q-WCgbOQn-qxBZsNUN1K3JhDuwmWj=aQ@mail.gmail.com>
Subject: Re: [PATCH v8 2/7] stdlib: Optimization qsort{_r} swap implementation
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, Paul Eggert <eggert@cs.ucla.edu>, 
	Florian Weimer <fweimer@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

On Tue, Oct 3, 2023 at 7:23=E2=80=AFAM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
> The optimization takes in consideration both the most common elements
> are either 32 or 64 bit in size and inputs are aligned to the word
> boundary.  This is similar to what msort does.
>
> For large buffer the swap operation uses memcpy/mempcpy with a
> small fixed size buffer (so compiler might inline the operations).
>
> Checked on x86_64-linux-gnu.
> ---
>  stdlib/qsort.c | 95 ++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 77 insertions(+), 18 deletions(-)
>
> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> index 728a0ed370..072ccdfb95 100644
> --- a/stdlib/qsort.c
> +++ b/stdlib/qsort.c
> @@ -21,22 +21,73 @@
>
>  #include <alloca.h>
>  #include <limits.h>
> +#include <memswap.h>
>  #include <stdlib.h>
>  #include <string.h>
> +#include <stdbool.h>
>
> -/* Byte-wise swap two items of size SIZE. */
> -#define SWAP(a, b, size)                                                =
     \
> -  do                                                                    =
     \
> -    {                                                                   =
     \
> -      size_t __size =3D (size);                                         =
       \
> -      char *__a =3D (a), *__b =3D (b);                                  =
         \
> -      do                                                                =
     \
> -       {                                                                =
     \
> -         char __tmp =3D *__a;                                           =
       \
> -         *__a++ =3D *__b;                                               =
       \
> -         *__b++ =3D __tmp;                                              =
       \
> -       } while (--__size > 0);                                          =
     \
> -    } while (0)
> +/* Swap SIZE bytes between addresses A and B.  These helpers are provide=
d
> +   along the generic one as an optimization.  */
> +
> +enum swap_type_t
> +  {
> +    SWAP_WORDS_64,
> +    SWAP_WORDS_32,
> +    SWAP_BYTES
> +  };
> +
> +/* If this function returns true, elements can be safely copied using wo=
rd
> +   loads and stores.  Otherwise, it might not be safe.  BASE (as an inte=
ger)
> +   must be a multiple of the word alignment.  SIZE must be a multiple of
> +   WORDSIZE.  Since WORDSIZE must be a multiple of the word alignment, a=
nd
> +   WORDSIZE is a power of two on all supported platforms, this function =
for
> +   speed merely checks that BASE and SIZE are both multiples of the word
> +   size.  */
> +static inline bool
> +is_aligned (const void *base, size_t size, size_t wordsize)
> +{
> +  return (((uintptr_t) base | size) & (wordsize - 1)) =3D=3D 0;
> +}
> +
> +static inline void
> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> +{
> +  typedef uint64_t __attribute__ ((__may_alias__)) u64_alias_t;
> +  do
> +   {
> +     n -=3D 8;
> +     u64_alias_t t =3D *(u64_alias_t *)(a + n);
> +     *(u64_alias_t *)(a + n) =3D *(u64_alias_t *)(b + n);
> +     *(u64_alias_t *)(b + n) =3D t;
> +   } while (n);
> +}
> +
> +static inline void
> +swap_words_32 (void * restrict a, void * restrict b, size_t n)
> +{
> +  typedef uint32_t __attribute__ ((__may_alias__)) u32_alias_t;
> +  do
> +   {
> +     n -=3D 4;
> +     u32_alias_t t =3D *(u32_alias_t *)(a + n);
> +     *(u32_alias_t *)(a + n) =3D *(u32_alias_t *)(b + n);
> +     *(u32_alias_t *)(b + n) =3D t;
> +   } while (n);
> +}
> +
> +/* Replace the indirect call with a serie of if statements.  It should h=
elp
> +   the branch predictor.  */
> +static void
> +do_swap (void * restrict a, void * restrict b, size_t size,
> +        enum swap_type_t swap_type)
> +{
> +  if (swap_type =3D=3D SWAP_WORDS_64)
> +    swap_words_64 (a, b, size);
> +  else if (swap_type =3D=3D SWAP_WORDS_32)
> +    swap_words_32 (a, b, size);
> +  else
> +    __memswap (a, b, size);
> +}
>
>  /* Discontinue quicksort algorithm when partition gets below this size.
>     This particular magic number was chosen to work best on a Sun 4/260. =
*/
> @@ -96,6 +147,14 @@ _quicksort (void *const pbase, size_t total_elems, si=
ze_t size,
>      /* Avoid lossage with unsigned arithmetic below.  */
>      return;
>
> +  enum swap_type_t swap_type;
> +  if (is_aligned (pbase, size, 8))
> +    swap_type =3D SWAP_WORDS_64;
> +  else if (is_aligned (pbase, size, 4))
> +    swap_type =3D SWAP_WORDS_32;
> +  else
> +    swap_type =3D SWAP_BYTES;
> +
>    if (total_elems > MAX_THRESH)
>      {
>        char *lo =3D base_ptr;
> @@ -119,13 +178,13 @@ _quicksort (void *const pbase, size_t total_elems, =
size_t size,
>           char *mid =3D lo + size * ((hi - lo) / size >> 1);
>
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_type);
>           if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
> -           SWAP (mid, hi, size);
> +           do_swap (mid, hi, size, swap_type);
>           else
>             goto jump_over;
>           if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
> -           SWAP (mid, lo, size);
> +           do_swap (mid, lo, size, swap_type);
>         jump_over:;
>
>           left_ptr  =3D lo + size;
> @@ -144,7 +203,7 @@ _quicksort (void *const pbase, size_t total_elems, si=
ze_t size,
>
>               if (left_ptr < right_ptr)
>                 {
> -                 SWAP (left_ptr, right_ptr, size);
> +                 do_swap (left_ptr, right_ptr, size, swap_type);
>                   if (mid =3D=3D left_ptr)
>                     mid =3D right_ptr;
>                   else if (mid =3D=3D right_ptr)
> @@ -216,7 +275,7 @@ _quicksort (void *const pbase, size_t total_elems, si=
ze_t size,
>          tmp_ptr =3D run_ptr;
>
>      if (tmp_ptr !=3D base_ptr)
> -      SWAP (tmp_ptr, base_ptr, size);
> +      do_swap (tmp_ptr, base_ptr, size, swap_type);
>
>      /* Insertion sort, running from left-hand-side up to right-hand-side=
.  */
>
> --
> 2.34.1
>
LGTM
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>