From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pc@us.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 5CAEA3858409
 for <libc-alpha@sourceware.org>; Wed, 17 Nov 2021 19:02:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5CAEA3858409
Received: from pps.filterd (m0098420.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 1AHIlOQZ028365; 
 Wed, 17 Nov 2021 19:02:54 GMT
Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com
 [169.62.189.10])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3cd7aw0a42-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 17 Nov 2021 19:02:53 +0000
Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1])
 by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1AHIoQ0m007112;
 Wed, 17 Nov 2021 19:02:53 GMT
Received: from b03cxnp08026.gho.boulder.ibm.com
 (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18])
 by ppma02dal.us.ibm.com with ESMTP id 3ca50cxdu5-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 17 Nov 2021 19:02:53 +0000
Received: from b03ledav006.gho.boulder.ibm.com
 (b03ledav006.gho.boulder.ibm.com [9.17.130.237])
 by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 1AHJ2qOw50594100
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 17 Nov 2021 19:02:52 GMT
Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 24F21C6059;
 Wed, 17 Nov 2021 19:02:52 +0000 (GMT)
Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 95F07C6074;
 Wed, 17 Nov 2021 19:02:51 +0000 (GMT)
Received: from li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com (unknown
 [9.65.92.51])
 by b03ledav006.gho.boulder.ibm.com (Postfix) with ESMTPS;
 Wed, 17 Nov 2021 19:02:51 +0000 (GMT)
Date: Wed, 17 Nov 2021 13:02:49 -0600
From: "Paul A. Clarke" <pc@us.ibm.com>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 "'GNU C Library'" <libc-alpha@sourceware.org>
Subject: Re: RFC: Improve hypot performance
Message-ID: <20211117190249.GC7755@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>
References: <VE1PR08MB5599C92B47C5EE081422D0DE839A9@VE1PR08MB5599.eurprd08.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <VE1PR08MB5599C92B47C5EE081422D0DE839A9@VE1PR08MB5599.eurprd08.prod.outlook.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: OcxN4TytWZyuYbQQn3OynCi95G2FmJJc
X-Proofpoint-GUID: OcxN4TytWZyuYbQQn3OynCi95G2FmJJc
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-11-17_06,2021-11-17_01,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxlogscore=801
 malwarescore=0 priorityscore=1501 bulkscore=0 phishscore=0 clxscore=1015
 spamscore=0 suspectscore=0 mlxscore=0 lowpriorityscore=0 adultscore=0
 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2110150000 definitions=main-2111170084
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Nov 2021 19:02:59 -0000

On Wed, Nov 17, 2021 at 03:58:53PM +0000, Wilco Dijkstra via Libc-alpha wrote:
> Hi Adhemerval,
> 
> Here is an early version of a much faster hypot implementation. It uses fma
> to significantly outperform the powerpc version both in throughput and
> latency. It has a worst-case ULP of ~0.949 and passes the testsuite. The
> powerpc version has a worst-case ULP of ~1.21 and several test failures.
> 
> It applies on top of your hypot patch series. I didn't optimize the non-fma
> case since modern targets have fma. It'll be interesting to compare it on
> Power. You'll need to correctly set FAST_FMINMAX to indicate support for
> inlined fmin/fmax instructions (this will be added to math_private.h for
> targets that have it), without it the code tries to use a conditional move
> since using a branch here is really bad for performance.

On Power10, this implementation is still has a large delta compared to the
current implementation:

current on Power10:
  "hypot": {
   "workload-random": {
    "duration": 5.27306e+08,
    "iterations": 4.8e+07,
    "reciprocal-throughput": 8.26834,
    "latency": 13.7027,
    "max-throughput": 1.20943e+08,
    "min-throughput": 7.29781e+07
   }
  }

with Adhemerval's patches and this patch:
  "hypot": {
   "workload-random": {
    "duration": 5.34629e+08,
    "iterations": 3.6e+07,
    "reciprocal-throughput": 12.5006,
    "latency": 17.201,
    "max-throughput": 7.99959e+07,
    "min-throughput": 5.81362e+07
   }
  }

PC

> ---
> 
> diff --git a/sysdeps/ieee754/dbl-64/e_hypot.c b/sysdeps/ieee754/dbl-64/e_hypot.c
> index d20bc3e3657e350a1103a8f8477db35ee60399e0..3906711788cff5d66725e5879bb4d6c36dd24dc7 100644
> --- a/sysdeps/ieee754/dbl-64/e_hypot.c
> +++ b/sysdeps/ieee754/dbl-64/e_hypot.c
> @@ -37,73 +37,40 @@
>  #include <math-svid-compat.h>
>  #include <errno.h>
>  
> +#define FAST_FMINMAX 1
> +//#undef __FP_FAST_FMA
> +
> +#define SCALE     0x1p-600
> +#define LARGE_VAL 0x1p+511
> +#define TINY_VAL  0x1p-459
> +#define EPS       0x1p-54
> +
> +
>  static inline double
>  handle_errno (double r)
>  {
> +  r = math_narrow_eval (r);
>    if (isinf (r))
>      __set_errno (ERANGE);
>    return r;
>  }
>  
> -/* sqrt (DBL_EPSILON / 2.0)  */
> -#define SQRT_EPS_DIV_2     0x1.6a09e667f3bcdp-27
> -/* DBL_MIN / (sqrt (DBL_EPSILON / 2.0))   */
> -#define DBL_MIN_THRESHOLD  0x1.6a09e667f3bcdp-996
> -/* eps (double) * sqrt (DBL_MIN))  */
> -#define SCALE              0x1p-563
> -/* 1 / eps (sqrt (DBL_MIN)  */
> -#define INV_SCALE          0x1p+563
> -/* sqrt (DBL_MAX)  */
> -#define SQRT_DBL_MAX       0x1.6a09e667f3bccp+511
> -/* sqrt (DBL_MIN)  */
> -#define SQRT_DBL_MIN       0x1p-511
> -
> -double
> -__hypot (double x, double y)
> +static inline double
> +kernel (double ax, double ay)
>  {
> -  if ((isinf (x) || isinf (y))
> -      && !issignaling (x) && !issignaling (y))
> -    return INFINITY;
> -  if (isnan (x) || isnan (y))
> -    return x + y;
> -
> -  double ax = fabs (x);
> -  double ay = fabs (y);
> -  if (ay > ax)
> -    {
> -      double tmp = ax;
> -      ax = ay;
> -      ay = tmp;
> -    }
> -
> -  /* Widely varying operands.  The DBL_MIN_THRESHOLD check is used to avoid
> -     a spurious underflow from the multiplication.  */
> -  if (ax >= DBL_MIN_THRESHOLD && ay <= ax * SQRT_EPS_DIV_2)
> -    return (ay == 0.0)
> -	   ? ax
> -	   : handle_errno (math_narrow_eval (ax + DBL_TRUE_MIN));
> +  double t1, t2;
> +#ifdef __FP_FAST_FMA
> +  t1 = ay + ay;
> +  t2 = ax - ay;
>  
> -  double scale = SCALE;
> -  if (ax > SQRT_DBL_MAX)
> -    {
> -      ax *= scale;
> -      ay *= scale;
> -      scale = INV_SCALE;
> -    }
> -  else if (ay < SQRT_DBL_MIN)
> -    {
> -      ax /= scale;
> -      ay /= scale;
> -    }
> +  if (t1 >= ax)
> +    return sqrt (fma (t1, ax, t2 * t2));
>    else
> -    scale = 1.0;
> -
> +    return sqrt (fma (ax, ax, ay * ay));
> +#else
>    double h = sqrt (ax * ax + ay * ay);
>  
> -  double t1, t2;
> -  if (h == 0.0)
> -    return h;
> -  else if (h <= 2.0 * ay)
> +  if (h <= 2.0 * ay)
>      {
>        double delta = h - ay;
>        t1 = ax * (2.0 * delta - ax);
> @@ -112,14 +79,57 @@ __hypot (double x, double y)
>    else
>      {
>        double delta = h - ax;
> -      t1 = 2.0 * delta * (ax - 2 * ay);
> +      t1 = 2.0 * delta * (ax - 2.0 * ay);
>        t2 = (4.0 * delta - ay) * ay + delta * delta;
>      }
>    h -= (t1 + t2) / (2.0 * h);
> -  h = math_narrow_eval (h * scale);
> -  math_check_force_underflow_nonneg (h);
> -  return handle_errno (h);
> +  return h;
> +#endif
> +}
> +
> +
> +double
> +__hypot (double x, double y)
> +{
> +  if (!isfinite (x) || !isfinite (y))
> +    {
> +      if ((isinf (x) || isinf (y))
> +	  && !issignaling_inline (x) && !issignaling_inline (y))
> +	return INFINITY;
> +      return x + y;
> +    }
> +
> +  x = fabs (x);
> +  y = fabs (y);
> +
> +  double ax = FAST_FMINMAX ? fmax (x, y) : (x < y ? y : x);
> +  double ay = FAST_FMINMAX ? fmin (x, y) : (x < y ? x : y);
> +
> +  if (__glibc_unlikely (ax > LARGE_VAL))
> +    {
> +      if (__glibc_unlikely (ay <= ax * EPS))
> +	return handle_errno (ax + ay);
> +
> +      return handle_errno (kernel (ax * SCALE, ay * SCALE) / SCALE);
> +    }
> +
> +  if (__glibc_unlikely (ay < TINY_VAL))
> +    {
> +      if (__glibc_unlikely (ax >= ay / EPS))
> +	return math_narrow_eval (ax + ay);
> +
> +      ax = math_narrow_eval (kernel (ax / SCALE, ay / SCALE) * SCALE);
> +      math_check_force_underflow_nonneg (ax);
> +      return ax;
> +    }
> +
> +  /* Common case: ax is not huge and ay is not tiny.  */
> +  if (__glibc_unlikely (ay <= ax * EPS))
> +    return math_narrow_eval (ax + ay);
> +
> +  return math_narrow_eval (kernel (ax, ay));
>  }
> +
>  strong_alias (__hypot, __ieee754_hypot)
>  libm_alias_finite (__ieee754_hypot, __hypot)
>  #if LIBM_SVID_COMPAT