From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9e1D=CS=arm.com=Szabolcs.Nagy@sourceware.org>
Received: from EUR04-VI1-obe.outbound.protection.outlook.com (mail-vi1eur04on2079.outbound.protection.outlook.com [40.107.8.79])
	by sourceware.org (Postfix) with ESMTPS id C15D63858D20
	for <libc-alpha@sourceware.org>; Fri, 30 Jun 2023 07:51:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C15D63858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=W99zV9snewMbYj6WQBoUH//aqJSu3vTv7BqIbNl+sTk=;
 b=SF3eh5O71nPThve6TRpNwhAcgPtaIaYXYT6B+QjxAcV0a8CwVwbU7EXRcm/T2WCbCbwJh6bPYXVZ3J3n+BLUp7YU8r7gb2Aa0lZKb3g6+5XijJ4sfnpImYKOkROT1LCn7iQ0p7htuIEFxK51fhmm9bpeKOETYWTGnL8L1Nu5Dwk=
Received: from AM6PR04CA0068.eurprd04.prod.outlook.com (2603:10a6:20b:f0::45)
 by DB9PR08MB9779.eurprd08.prod.outlook.com (2603:10a6:10:460::11) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6521.24; Fri, 30 Jun
 2023 07:51:47 +0000
Received: from VI1EUR03FT041.eop-EUR03.prod.protection.outlook.com
 (2603:10a6:20b:f0:cafe::9c) by AM6PR04CA0068.outlook.office365.com
 (2603:10a6:20b:f0::45) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6544.22 via Frontend
 Transport; Fri, 30 Jun 2023 07:51:47 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
 63.35.35.123 as permitted sender) receiver=protection.outlook.com;
 client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
 pr=C
Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by
 VI1EUR03FT041.mail.protection.outlook.com (100.127.145.15) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.6544.22 via Frontend Transport; Fri, 30 Jun 2023 07:51:47 +0000
Received: ("Tessian outbound b11b8bb4dfe8:v142"); Fri, 30 Jun 2023 07:51:46 +0000
X-CheckRecipientChecked: true
X-CR-MTA-CID: f3412e78c121f0bd
X-CR-MTA-TID: 64aa7808
Received: from e2c8569ecf23.1
	by 64aa7808-outbound-1.mta.getcheckrecipient.com id 342B31F2-1BDB-40BB-84BC-6A015B9F2704.1;
	Fri, 30 Jun 2023 07:51:37 +0000
Received: from EUR05-AM6-obe.outbound.protection.outlook.com
    by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id e2c8569ecf23.1
    (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384);
    Fri, 30 Jun 2023 07:51:37 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=num6j+NocFPwJPExKMtIHIwAMbgi6LXMKALUI1NSvmy37YkabcLQL4A+fnbJl1/XM52iQPeyU6i+EV2M8Lg+Uvlj2gRgFS3kyR6++cOlusu08pbi/j7BMH3/2uOHB+1yPES8Z2mHi8MDH2ZIEknKpZkpEU4r595dlMX5Yw5/bC9Nv9AX7D6XlYAfUHQizDeRNYobp5YDC3QsJJ/oGx5bblU7uigLf8iwowMZjTcSQzcIT8wFnn6I+zJw30lbvyuqwZa76Et+tV/b/I1IgEBVggJ8AxOScqSAtn/MxYyzZGV+EQsi518bEBRUiYs296MvcREa2lyOBVPXIb6GA0EvaA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=W99zV9snewMbYj6WQBoUH//aqJSu3vTv7BqIbNl+sTk=;
 b=RmKsbVZEBnRUOHOhttetDZM7OOKw1Asgt4zXd07WI/agYOl364cgBMzbJ8VbLOXpGoqDOBRj4fW3WHatYTswiTGfgMZLtvphjENXLRvBrCOULf9jlFL/r5ZEx1htx4T42HdyzjVMcZxidY/28V0Db3Oq/+B99a4wu3K0c3BGYC+eGMGc84hKq0UOuL/laYpKNLEuqROl+nwJIwkYA0SY2Ko2tLN18kemaxXOpg946e4Vg4X/2B9z07ylFPVYlw1sRlSFgZPzOTW2OEoPYs/kq3jSQENsfbOmqlWydlQXs8jDt9j4sWitunTGzyx6+YJHy1Vs1ZMSXZTK9sR9h8aKpw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=W99zV9snewMbYj6WQBoUH//aqJSu3vTv7BqIbNl+sTk=;
 b=SF3eh5O71nPThve6TRpNwhAcgPtaIaYXYT6B+QjxAcV0a8CwVwbU7EXRcm/T2WCbCbwJh6bPYXVZ3J3n+BLUp7YU8r7gb2Aa0lZKb3g6+5XijJ4sfnpImYKOkROT1LCn7iQ0p7htuIEFxK51fhmm9bpeKOETYWTGnL8L1Nu5Dwk=
Authentication-Results-Original: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
Received: from DB9PR08MB7179.eurprd08.prod.outlook.com (2603:10a6:10:2cc::19)
 by DB8PR08MB5356.eurprd08.prod.outlook.com (2603:10a6:10:f9::15) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6544.19; Fri, 30 Jun
 2023 07:51:34 +0000
Received: from DB9PR08MB7179.eurprd08.prod.outlook.com
 ([fe80::43b7:3a83:5cbe:4559]) by DB9PR08MB7179.eurprd08.prod.outlook.com
 ([fe80::43b7:3a83:5cbe:4559%4]) with mapi id 15.20.6544.019; Fri, 30 Jun 2023
 07:51:34 +0000
Date: Fri, 30 Jun 2023 08:51:18 +0100
From: Szabolcs Nagy <szabolcs.nagy@arm.com>
To: Joe Ramsay <Joe.Ramsay@arm.com>, <libc-alpha@sourceware.org>
Subject: Re: [PATCH v4 1/4] aarch64: Add vector implementations of cos
 routines
Message-ID: <ZJ6Jdro/vPsBVTGM@arm.com>
References: <20230628111939.48140-1-Joe.Ramsay@arm.com>
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20230628111939.48140-1-Joe.Ramsay@arm.com>
X-ClientProxiedBy: LO4P265CA0119.GBRP265.PROD.OUTLOOK.COM
 (2603:10a6:600:2c6::6) To DB9PR08MB7179.eurprd08.prod.outlook.com
 (2603:10a6:10:2cc::19)
MIME-Version: 1.0
X-MS-TrafficTypeDiagnostic:
	DB9PR08MB7179:EE_|DB8PR08MB5356:EE_|VI1EUR03FT041:EE_|DB9PR08MB9779:EE_
X-MS-Office365-Filtering-Correlation-Id: 2632c962-b722-4adf-7622-08db793edb16
x-checkrecipientrouted: true
NoDisclaimer: true
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original:
 grpuLxhVSZwN3tn0tQ9FzWn3MV26meCSn//nCEy6QPhI2NCsUTbYikFsmlNxTWAlXeeDRDYOFRhgSBgF72Xgm2MrfhsnKvBA+6DO0yiP7R20PRD3BwLmm5/Tg93k/bYrzBYQbw+iVZkzpFchyu1SgUYQivuvM0j6643Uv2jX5Uq0Zs70USpetZYdbHi6txVi4+a7vkWQcZPrnwM6W9VhMnACP2CUWP3H5NQcy/RQKXO8T7Tq/XPDR84LTM5lSTjvOkVrCUXATqAq0E1+EH+e78jc4WhAcK3YMPyMVH3MEUQc/CkpduNzn9sMlbVsqXpj5af5oxeRZcFnlw4y78x3mJL5kB51iFsugclzbtof2GCXfmOhEFGYiTxaLWfzV3ujfWP6qawRZ+501j9dZ4wiY3yXtbC+AeKr3MGwWlbz4etkDmu14DIA+PvcPf7e9AIBBOEMPzZcMZvHyhPhy2Wej3wjj9zOSi++YQwM8Lb673JhnloBHcdH3rkNDOKTER13CYmhewYi7xv5ba7LBBv2b7gPa3aw7vMO+US3rjijaVvlbW8fXo1IcrVqD/6DYqOcg42xYU/w+nULEtYMxngAXQ==
X-Forefront-Antispam-Report-Untrusted:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DB9PR08MB7179.eurprd08.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(4636009)(346002)(39860400002)(136003)(396003)(376002)(366004)(451199021)(44832011)(66946007)(36756003)(66556008)(66476007)(478600001)(316002)(5660300002)(8676002)(8936002)(2906002)(30864003)(41300700001)(86362001)(110136005)(6486002)(6512007)(6506007)(26005)(186003)(38100700002)(6666004)(83380400001)(2616005)(2004002);DIR:OUT;SFP:1101;
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB8PR08MB5356
Original-Authentication-Results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
X-EOPAttributedMessage: 0
X-MS-Exchange-Transport-CrossTenantHeadersStripped:
 VI1EUR03FT041.eop-EUR03.prod.protection.outlook.com
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id-Prvs:
	901668bb-b912-4833-3706-08db793ed317
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info:
	s6DwL7ginLDp9vt/lSPuB/2uIrkkem2o6Z4fQBIxF7FW/ZSkoFGns2QL2lgvZ7XKPwbTDnPj204vPIvuvb0LHSJlKG4qnjfqZBQ4KagZ0PEmvPTz99cY7rRsD1mSAkXsaLI53GGpTWdyOjDRCoZAQzw7LzYVtT5WoUxNSFOlm2VusOMrvPWbfwiAg/p5rjkiYPbf4TBf7pyPrSHpjL/EKrRwP+FXI8bVHr+AguD6YcTznhdmHM+OU8uMxzDBIO0t8+Cp6xJ/6yJmyjKdxdLabewgTkgpssiYqzuWm41cUerU9FFvLCArDmixIEb1Rib1VKvoynlNxmdRKXZd1E/JzeKsJRXaDJgD7HfNI0EoD1oqB0fOkrSxjdzicFF7dmv3mm5HVOQRa+wx89PZEB926q/GpltqaEPaok7dGonVhz9fiYUlM6GoM+xjOlMs0SddoYS6LHDR+UeOaBocSu0KXUFjl1lwfTxHTaeWvdUBc8VVg+u4U+sBK3eDeuFp5W4txsDNlQXGmB3DzkQTdq15ODCbA40crxNYNv0grEEnPsysnpkArgtCAnUzB7AEdrtWNQzSdIQdB1PtmJWF9uhb+T8PmW8UK+21OYNsIMmOhRlzbXOSpOAM8VK5BxisrblBQs40qs8cmKfGdvLaY03Sz7BTBHqBuQjryCIj8yT+6MFanQRYv3qEXzlOBQptPQ7FIy7fpF2rJw+KHSdDyUjo/ccZJUAQpWiCrpwgQ8WETScs/UVMXq1slJb5dqgU/rKK
X-Forefront-Antispam-Report:
	CIP:63.35.35.123;CTRY:IE;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:64aa7808-outbound-1.mta.getcheckrecipient.com;PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com;CAT:NONE;SFS:(13230028)(4636009)(39860400002)(396003)(376002)(346002)(136003)(451199021)(36840700001)(40470700004)(46966006)(70206006)(30864003)(5660300002)(26005)(2906002)(186003)(47076005)(2616005)(40460700003)(6506007)(6512007)(8936002)(83380400001)(356005)(40480700001)(36860700001)(70586007)(41300700001)(44832011)(336012)(36756003)(6666004)(82740400003)(6486002)(478600001)(8676002)(86362001)(110136005)(81166007)(82310400005)(316002)(2004002);DIR:OUT;SFP:1101;
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jun 2023 07:51:47.2266
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 2632c962-b722-4adf-7622-08db793edb16
X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d;Ip=[63.35.35.123];Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com]
X-MS-Exchange-CrossTenant-AuthSource:
	VI1EUR03FT041.eop-EUR03.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB9PR08MB9779
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,FORGED_SPF_HELO,GIT_PATCH_0,KAM_DMARC_NONE,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

The 06/28/2023 12:19, Joe Ramsay via Libc-alpha wrote:
> Replace the loop-over-scalar placeholder routines with optimised
> implementations from Arm Optimized Routines (AOR).
> 
> Also add some headers containing utilities for aarch64 libmvec
> routines, and update libm-test-ulps.
> 
> Data tables for new routines are used via a pointer with a
> barrier on it, in order to prevent overly aggressive constant
> inlining in GCC. This allows a single adrp, combined with offset
> loads, to be used for every constant in the table.
> 
> Special-case handlers are marked NOINLINE in order to confine the
> save/restore overhead of switching from vector to normal calling
> standard. This way we only incur the extra memory access in the
> exceptional cases. NOINLINE definitions have been moved to
> math_private.h in order to reduce duplication.
> 
> AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
> selective masking (and later fixing up) of invalid lanes, in
> order to trigger fp exceptions correctly (AdvSIMD only). This is
> tested and maintained in AOR, however it is configured off at
> source level here for performance reasons. We keep the
> WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
> the upstreaming process from AOR to glibc.


looks good.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>


> ---
> Changes to v3:
>  * Remove volatile from AdvSIMD data - use ptr_barrier instead
> Thanks,
> Joe
>  sysdeps/aarch64/fpu/advsimd_utils.h  |  39 -------
>  sysdeps/aarch64/fpu/cos_advsimd.c    |  82 +++++++++++++--
>  sysdeps/aarch64/fpu/cos_sve.c        |  74 +++++++++++++-
>  sysdeps/aarch64/fpu/cosf_advsimd.c   |  77 ++++++++++++--
>  sysdeps/aarch64/fpu/cosf_sve.c       |  72 ++++++++++++-
>  sysdeps/aarch64/fpu/sv_math.h        | 141 ++++++++++++++++++++++++++
>  sysdeps/aarch64/fpu/sve_utils.h      |  55 ----------
>  sysdeps/aarch64/fpu/v_math.h         | 145 +++++++++++++++++++++++++++
>  sysdeps/aarch64/fpu/vecmath_config.h |  38 +++++++
>  sysdeps/aarch64/libm-test-ulps       |   2 +-
>  sysdeps/generic/math_private.h       |   2 +-
>  sysdeps/ieee754/dbl-64/math_config.h |   3 -
>  sysdeps/ieee754/flt-32/math_config.h |   2 -
>  13 files changed, 609 insertions(+), 123 deletions(-)
>  delete mode 100644 sysdeps/aarch64/fpu/advsimd_utils.h
>  create mode 100644 sysdeps/aarch64/fpu/sv_math.h
>  delete mode 100644 sysdeps/aarch64/fpu/sve_utils.h
>  create mode 100644 sysdeps/aarch64/fpu/v_math.h
>  create mode 100644 sysdeps/aarch64/fpu/vecmath_config.h
> 
> diff --git a/sysdeps/aarch64/fpu/advsimd_utils.h b/sysdeps/aarch64/fpu/advsimd_utils.h
> deleted file mode 100644
> index 8a0fcc0e06..0000000000
> --- a/sysdeps/aarch64/fpu/advsimd_utils.h
> +++ /dev/null
> @@ -1,39 +0,0 @@
> -/* Helpers for Advanced SIMD vector math functions.
> -
> -   Copyright (C) 2023 Free Software Foundation, Inc.
> -   This file is part of the GNU C Library.
> -
> -   The GNU C Library is free software; you can redistribute it and/or
> -   modify it under the terms of the GNU Lesser General Public
> -   License as published by the Free Software Foundation; either
> -   version 2.1 of the License, or (at your option) any later version.
> -
> -   The GNU C Library is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -   Lesser General Public License for more details.
> -
> -   You should have received a copy of the GNU Lesser General Public
> -   License along with the GNU C Library; if not, see
> -   <https://www.gnu.org/licenses/>.  */
> -
> -#include <arm_neon.h>
> -
> -#define VPCS_ATTR __attribute__ ((aarch64_vector_pcs))
> -
> -#define V_NAME_F1(fun) _ZGVnN4v_##fun##f
> -#define V_NAME_D1(fun) _ZGVnN2v_##fun
> -#define V_NAME_F2(fun) _ZGVnN4vv_##fun##f
> -#define V_NAME_D2(fun) _ZGVnN2vv_##fun
> -
> -static __always_inline float32x4_t
> -v_call_f32 (float (*f) (float), float32x4_t x)
> -{
> -  return (float32x4_t){ f (x[0]), f (x[1]), f (x[2]), f (x[3]) };
> -}
> -
> -static __always_inline float64x2_t
> -v_call_f64 (double (*f) (double), float64x2_t x)
> -{
> -  return (float64x2_t){ f (x[0]), f (x[1]) };
> -}
> diff --git a/sysdeps/aarch64/fpu/cos_advsimd.c b/sysdeps/aarch64/fpu/cos_advsimd.c
> index 40831e6b0d..e8f1d10540 100644
> --- a/sysdeps/aarch64/fpu/cos_advsimd.c
> +++ b/sysdeps/aarch64/fpu/cos_advsimd.c
> @@ -17,13 +17,83 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#include <math.h>
> +#include "v_math.h"
>  
> -#include "advsimd_utils.h"
> +static const struct data
> +{
> +  float64x2_t poly[7];
> +  float64x2_t range_val, shift, inv_pi, half_pi, pi_1, pi_2, pi_3;
> +} data = {
> +  /* Worst-case error is 3.3 ulp in [-pi/2, pi/2].  */
> +  .poly = { V2 (-0x1.555555555547bp-3), V2 (0x1.1111111108a4dp-7),
> +	    V2 (-0x1.a01a019936f27p-13), V2 (0x1.71de37a97d93ep-19),
> +	    V2 (-0x1.ae633919987c6p-26), V2 (0x1.60e277ae07cecp-33),
> +	    V2 (-0x1.9e9540300a1p-41) },
> +  .inv_pi = V2 (0x1.45f306dc9c883p-2),
> +  .half_pi = V2 (0x1.921fb54442d18p+0),
> +  .pi_1 = V2 (0x1.921fb54442d18p+1),
> +  .pi_2 = V2 (0x1.1a62633145c06p-53),
> +  .pi_3 = V2 (0x1.c1cd129024e09p-106),
> +  .shift = V2 (0x1.8p52),
> +  .range_val = V2 (0x1p23)
> +};
> +
> +#define C(i) d->poly[i]
> +
> +static float64x2_t VPCS_ATTR NOINLINE
> +special_case (float64x2_t x, float64x2_t y, uint64x2_t odd, uint64x2_t cmp)
> +{
> +  y = vreinterpretq_f64_u64 (veorq_u64 (vreinterpretq_u64_f64 (y), odd));
> +  return v_call_f64 (cos, x, y, cmp);
> +}
>  
> -VPCS_ATTR
> -float64x2_t
> -V_NAME_D1 (cos) (float64x2_t x)
> +float64x2_t VPCS_ATTR V_NAME_D1 (cos) (float64x2_t x)
>  {
> -  return v_call_f64 (cos, x);
> +  const struct data *d = ptr_barrier (&data);
> +  float64x2_t n, r, r2, r3, r4, t1, t2, t3, y;
> +  uint64x2_t odd, cmp;
> +
> +#if WANT_SIMD_EXCEPT
> +  r = vabsq_f64 (x);
> +  cmp = vcgeq_u64 (vreinterpretq_u64_f64 (r),
> +		   vreinterpretq_u64_f64 (d->range_val));
> +  if (__glibc_unlikely (v_any_u64 (cmp)))
> +    /* If fenv exceptions are to be triggered correctly, set any special lanes
> +       to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by
> +       special-case handler later.  */
> +    r = vbslq_f64 (cmp, v_f64 (1.0), r);
> +#else
> +  cmp = vcageq_f64 (d->range_val, x);
> +  cmp = vceqzq_u64 (cmp); /* cmp = ~cmp.  */
> +  r = x;
> +#endif
> +
> +  /* n = rint((|x|+pi/2)/pi) - 0.5.  */
> +  n = vfmaq_f64 (d->shift, d->inv_pi, vaddq_f64 (r, d->half_pi));
> +  odd = vshlq_n_u64 (vreinterpretq_u64_f64 (n), 63);
> +  n = vsubq_f64 (n, d->shift);
> +  n = vsubq_f64 (n, v_f64 (0.5));
> +
> +  /* r = |x| - n*pi  (range reduction into -pi/2 .. pi/2).  */
> +  r = vfmsq_f64 (r, d->pi_1, n);
> +  r = vfmsq_f64 (r, d->pi_2, n);
> +  r = vfmsq_f64 (r, d->pi_3, n);
> +
> +  /* sin(r) poly approx.  */
> +  r2 = vmulq_f64 (r, r);
> +  r3 = vmulq_f64 (r2, r);
> +  r4 = vmulq_f64 (r2, r2);
> +
> +  t1 = vfmaq_f64 (C (4), C (5), r2);
> +  t2 = vfmaq_f64 (C (2), C (3), r2);
> +  t3 = vfmaq_f64 (C (0), C (1), r2);
> +
> +  y = vfmaq_f64 (t1, C (6), r4);
> +  y = vfmaq_f64 (t2, y, r4);
> +  y = vfmaq_f64 (t3, y, r4);
> +  y = vfmaq_f64 (r, y, r3);
> +
> +  if (__glibc_unlikely (v_any_u64 (cmp)))
> +    return special_case (x, y, odd, cmp);
> +  return vreinterpretq_f64_u64 (veorq_u64 (vreinterpretq_u64_f64 (y), odd));
>  }
> diff --git a/sysdeps/aarch64/fpu/cos_sve.c b/sysdeps/aarch64/fpu/cos_sve.c
> index 55501e5000..d7a9b134da 100644
> --- a/sysdeps/aarch64/fpu/cos_sve.c
> +++ b/sysdeps/aarch64/fpu/cos_sve.c
> @@ -17,12 +17,76 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#include <math.h>
> +#include "sv_math.h"
>  
> -#include "sve_utils.h"
> +static const struct data
> +{
> +  double inv_pio2, pio2_1, pio2_2, pio2_3, shift;
> +} data = {
> +  /* Polynomial coefficients are hardwired in FTMAD instructions.  */
> +  .inv_pio2 = 0x1.45f306dc9c882p-1,
> +  .pio2_1 = 0x1.921fb50000000p+0,
> +  .pio2_2 = 0x1.110b460000000p-26,
> +  .pio2_3 = 0x1.1a62633145c07p-54,
> +  /* Original shift used in AdvSIMD cos,
> +     plus a contribution to set the bit #0 of q
> +     as expected by trigonometric instructions.  */
> +  .shift = 0x1.8000000000001p52
> +};
> +
> +#define RangeVal 0x4160000000000000 /* asuint64 (0x1p23).  */
> +
> +static svfloat64_t NOINLINE
> +special_case (svfloat64_t x, svfloat64_t y, svbool_t out_of_bounds)
> +{
> +  return sv_call_f64 (cos, x, y, out_of_bounds);
> +}
>  
> -svfloat64_t
> -SV_NAME_D1 (cos) (svfloat64_t x, svbool_t pg)
> +/* A fast SVE implementation of cos based on trigonometric
> +   instructions (FTMAD, FTSSEL, FTSMUL).
> +   Maximum measured error: 2.108 ULPs.
> +   SV_NAME_D1 (cos)(0x1.9b0ba158c98f3p+7) got -0x1.fddd4c65c7f07p-3
> +					 want -0x1.fddd4c65c7f05p-3.  */
> +svfloat64_t SV_NAME_D1 (cos) (svfloat64_t x, const svbool_t pg)
>  {
> -  return sv_call_f64 (cos, x, svdup_n_f64 (0), pg);
> +  const struct data *d = ptr_barrier (&data);
> +
> +  svfloat64_t r = svabs_f64_x (pg, x);
> +  svbool_t out_of_bounds
> +      = svcmpge_n_u64 (pg, svreinterpret_u64_f64 (r), RangeVal);
> +
> +  /* Load some constants in quad-word chunks to minimise memory access.  */
> +  svbool_t ptrue = svptrue_b64 ();
> +  svfloat64_t invpio2_and_pio2_1 = svld1rq_f64 (ptrue, &d->inv_pio2);
> +  svfloat64_t pio2_23 = svld1rq_f64 (ptrue, &d->pio2_2);
> +
> +  /* n = rint(|x|/(pi/2)).  */
> +  svfloat64_t q = svmla_lane_f64 (sv_f64 (d->shift), r, invpio2_and_pio2_1, 0);
> +  svfloat64_t n = svsub_n_f64_x (pg, q, d->shift);
> +
> +  /* r = |x| - n*(pi/2)  (range reduction into -pi/4 .. pi/4).  */
> +  r = svmls_lane_f64 (r, n, invpio2_and_pio2_1, 1);
> +  r = svmls_lane_f64 (r, n, pio2_23, 0);
> +  r = svmls_lane_f64 (r, n, pio2_23, 1);
> +
> +  /* cos(r) poly approx.  */
> +  svfloat64_t r2 = svtsmul_f64 (r, svreinterpret_u64_f64 (q));
> +  svfloat64_t y = sv_f64 (0.0);
> +  y = svtmad_f64 (y, r2, 7);
> +  y = svtmad_f64 (y, r2, 6);
> +  y = svtmad_f64 (y, r2, 5);
> +  y = svtmad_f64 (y, r2, 4);
> +  y = svtmad_f64 (y, r2, 3);
> +  y = svtmad_f64 (y, r2, 2);
> +  y = svtmad_f64 (y, r2, 1);
> +  y = svtmad_f64 (y, r2, 0);
> +
> +  /* Final multiplicative factor: 1.0 or x depending on bit #0 of q.  */
> +  svfloat64_t f = svtssel_f64 (r, svreinterpret_u64_f64 (q));
> +  /* Apply factor.  */
> +  y = svmul_f64_x (pg, f, y);
> +
> +  if (__glibc_unlikely (svptest_any (pg, out_of_bounds)))
> +    return special_case (x, y, out_of_bounds);
> +  return y;
>  }
> diff --git a/sysdeps/aarch64/fpu/cosf_advsimd.c b/sysdeps/aarch64/fpu/cosf_advsimd.c
> index 35bb81aead..f05dd2bcda 100644
> --- a/sysdeps/aarch64/fpu/cosf_advsimd.c
> +++ b/sysdeps/aarch64/fpu/cosf_advsimd.c
> @@ -17,13 +17,78 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#include <math.h>
> +#include "v_math.h"
>  
> -#include "advsimd_utils.h"
> +static const struct data
> +{
> +  float32x4_t poly[4];
> +  float32x4_t range_val, inv_pi, half_pi, shift, pi_1, pi_2, pi_3;
> +} data = {
> +  /* 1.886 ulp error.  */
> +  .poly = { V4 (-0x1.555548p-3f), V4 (0x1.110df4p-7f), V4 (-0x1.9f42eap-13f),
> +	    V4 (0x1.5b2e76p-19f) },
> +
> +  .pi_1 = V4 (0x1.921fb6p+1f),
> +  .pi_2 = V4 (-0x1.777a5cp-24f),
> +  .pi_3 = V4 (-0x1.ee59dap-49f),
> +
> +  .inv_pi = V4 (0x1.45f306p-2f),
> +  .shift = V4 (0x1.8p+23f),
> +  .half_pi = V4 (0x1.921fb6p0f),
> +  .range_val = V4 (0x1p20f)
> +};
> +
> +#define C(i) d->poly[i]
> +
> +static float32x4_t VPCS_ATTR NOINLINE
> +special_case (float32x4_t x, float32x4_t y, uint32x4_t odd, uint32x4_t cmp)
> +{
> +  /* Fall back to scalar code.  */
> +  y = vreinterpretq_f32_u32 (veorq_u32 (vreinterpretq_u32_f32 (y), odd));
> +  return v_call_f32 (cosf, x, y, cmp);
> +}
>  
> -VPCS_ATTR
> -float32x4_t
> -V_NAME_F1 (cos) (float32x4_t x)
> +float32x4_t VPCS_ATTR V_NAME_F1 (cos) (float32x4_t x)
>  {
> -  return v_call_f32 (cosf, x);
> +  const struct data *d = ptr_barrier (&data);
> +  float32x4_t n, r, r2, r3, y;
> +  uint32x4_t odd, cmp;
> +
> +#if WANT_SIMD_EXCEPT
> +  r = vabsq_f32 (x);
> +  cmp = vcgeq_u32 (vreinterpretq_u32_f32 (r),
> +		   vreinterpretq_u32_f32 (d->range_val));
> +  if (__glibc_unlikely (v_any_u32 (cmp)))
> +    /* If fenv exceptions are to be triggered correctly, set any special lanes
> +       to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by
> +       special-case handler later.  */
> +    r = vbslq_f32 (cmp, v_f32 (1.0f), r);
> +#else
> +  cmp = vcageq_f32 (d->range_val, x);
> +  cmp = vceqzq_u32 (cmp); /* cmp = ~cmp.  */
> +  r = x;
> +#endif
> +
> +  /* n = rint((|x|+pi/2)/pi) - 0.5.  */
> +  n = vfmaq_f32 (d->shift, d->inv_pi, vaddq_f32 (r, d->half_pi));
> +  odd = vshlq_n_u32 (vreinterpretq_u32_f32 (n), 31);
> +  n = vsubq_f32 (n, d->shift);
> +  n = vsubq_f32 (n, v_f32 (0.5f));
> +
> +  /* r = |x| - n*pi  (range reduction into -pi/2 .. pi/2).  */
> +  r = vfmsq_f32 (r, d->pi_1, n);
> +  r = vfmsq_f32 (r, d->pi_2, n);
> +  r = vfmsq_f32 (r, d->pi_3, n);
> +
> +  /* y = sin(r).  */
> +  r2 = vmulq_f32 (r, r);
> +  r3 = vmulq_f32 (r2, r);
> +  y = vfmaq_f32 (C (2), C (3), r2);
> +  y = vfmaq_f32 (C (1), y, r2);
> +  y = vfmaq_f32 (C (0), y, r2);
> +  y = vfmaq_f32 (r, y, r3);
> +
> +  if (__glibc_unlikely (v_any_u32 (cmp)))
> +    return special_case (x, y, odd, cmp);
> +  return vreinterpretq_f32_u32 (veorq_u32 (vreinterpretq_u32_f32 (y), odd));
>  }
> diff --git a/sysdeps/aarch64/fpu/cosf_sve.c b/sysdeps/aarch64/fpu/cosf_sve.c
> index 16c68f387b..577cbd864e 100644
> --- a/sysdeps/aarch64/fpu/cosf_sve.c
> +++ b/sysdeps/aarch64/fpu/cosf_sve.c
> @@ -17,12 +17,74 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#include <math.h>
> +#include "sv_math.h"
>  
> -#include "sve_utils.h"
> +static const struct data
> +{
> +  float neg_pio2_1, neg_pio2_2, neg_pio2_3, inv_pio2, shift;
> +} data = {
> +  /* Polynomial coefficients are hard-wired in FTMAD instructions.  */
> +  .neg_pio2_1 = -0x1.921fb6p+0f,
> +  .neg_pio2_2 = 0x1.777a5cp-25f,
> +  .neg_pio2_3 = 0x1.ee59dap-50f,
> +  .inv_pio2 = 0x1.45f306p-1f,
> +  /* Original shift used in AdvSIMD cosf,
> +     plus a contribution to set the bit #0 of q
> +     as expected by trigonometric instructions.  */
> +  .shift = 0x1.800002p+23f
> +};
> +
> +#define RangeVal 0x49800000 /* asuint32(0x1p20f).  */
>  
> -svfloat32_t
> -SV_NAME_F1 (cos) (svfloat32_t x, svbool_t pg)
> +static svfloat32_t NOINLINE
> +special_case (svfloat32_t x, svfloat32_t y, svbool_t out_of_bounds)
>  {
> -  return sv_call_f32 (cosf, x, svdup_n_f32 (0), pg);
> +  return sv_call_f32 (cosf, x, y, out_of_bounds);
> +}
> +
> +/* A fast SVE implementation of cosf based on trigonometric
> +   instructions (FTMAD, FTSSEL, FTSMUL).
> +   Maximum measured error: 2.06 ULPs.
> +   SV_NAME_F1 (cos)(0x1.dea2f2p+19) got 0x1.fffe7ap-6
> +				   want 0x1.fffe76p-6.  */
> +svfloat32_t SV_NAME_F1 (cos) (svfloat32_t x, const svbool_t pg)
> +{
> +  const struct data *d = ptr_barrier (&data);
> +
> +  svfloat32_t r = svabs_f32_x (pg, x);
> +  svbool_t out_of_bounds
> +    = svcmpge_n_u32 (pg, svreinterpret_u32_f32 (r), RangeVal);
> +
> +  /* Load some constants in quad-word chunks to minimise memory access.  */
> +  svfloat32_t negpio2_and_invpio2
> +      = svld1rq_f32 (svptrue_b32 (), &d->neg_pio2_1);
> +
> +  /* n = rint(|x|/(pi/2)).  */
> +  svfloat32_t q
> +      = svmla_lane_f32 (sv_f32 (d->shift), r, negpio2_and_invpio2, 3);
> +  svfloat32_t n = svsub_n_f32_x (pg, q, d->shift);
> +
> +  /* r = |x| - n*(pi/2)  (range reduction into -pi/4 .. pi/4).  */
> +  r = svmla_lane_f32 (r, n, negpio2_and_invpio2, 0);
> +  r = svmla_lane_f32 (r, n, negpio2_and_invpio2, 1);
> +  r = svmla_lane_f32 (r, n, negpio2_and_invpio2, 2);
> +
> +  /* Final multiplicative factor: 1.0 or x depending on bit #0 of q.  */
> +  svfloat32_t f = svtssel_f32 (r, svreinterpret_u32_f32 (q));
> +
> +  /* cos(r) poly approx.  */
> +  svfloat32_t r2 = svtsmul_f32 (r, svreinterpret_u32_f32 (q));
> +  svfloat32_t y = sv_f32 (0.0f);
> +  y = svtmad_f32 (y, r2, 4);
> +  y = svtmad_f32 (y, r2, 3);
> +  y = svtmad_f32 (y, r2, 2);
> +  y = svtmad_f32 (y, r2, 1);
> +  y = svtmad_f32 (y, r2, 0);
> +
> +  /* Apply factor.  */
> +  y = svmul_f32_x (pg, f, y);
> +
> +  if (__glibc_unlikely (svptest_any (pg, out_of_bounds)))
> +    return special_case (x, y, out_of_bounds);
> +  return y;
>  }
> diff --git a/sysdeps/aarch64/fpu/sv_math.h b/sysdeps/aarch64/fpu/sv_math.h
> new file mode 100644
> index 0000000000..fc8e690e80
> --- /dev/null
> +++ b/sysdeps/aarch64/fpu/sv_math.h
> @@ -0,0 +1,141 @@
> +/* Utilities for SVE libmvec routines.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef SV_MATH_H
> +#define SV_MATH_H
> +
> +#include <arm_sve.h>
> +#include <stdbool.h>
> +
> +#include "vecmath_config.h"
> +
> +#define SV_NAME_F1(fun) _ZGVsMxv_##fun##f
> +#define SV_NAME_D1(fun) _ZGVsMxv_##fun
> +#define SV_NAME_F2(fun) _ZGVsMxvv_##fun##f
> +#define SV_NAME_D2(fun) _ZGVsMxvv_##fun
> +
> +/* Double precision.  */
> +static inline svint64_t
> +sv_s64 (int64_t x)
> +{
> +  return svdup_n_s64 (x);
> +}
> +
> +static inline svuint64_t
> +sv_u64 (uint64_t x)
> +{
> +  return svdup_n_u64 (x);
> +}
> +
> +static inline svfloat64_t
> +sv_f64 (double x)
> +{
> +  return svdup_n_f64 (x);
> +}
> +
> +static inline svfloat64_t
> +sv_call_f64 (double (*f) (double), svfloat64_t x, svfloat64_t y, svbool_t cmp)
> +{
> +  svbool_t p = svpfirst (cmp, svpfalse ());
> +  while (svptest_any (cmp, p))
> +    {
> +      double elem = svclastb_n_f64 (p, 0, x);
> +      elem = (*f) (elem);
> +      svfloat64_t y2 = svdup_n_f64 (elem);
> +      y = svsel_f64 (p, y2, y);
> +      p = svpnext_b64 (cmp, p);
> +    }
> +  return y;
> +}
> +
> +static inline svfloat64_t
> +sv_call2_f64 (double (*f) (double, double), svfloat64_t x1, svfloat64_t x2,
> +	      svfloat64_t y, svbool_t cmp)
> +{
> +  svbool_t p = svpfirst (cmp, svpfalse ());
> +  while (svptest_any (cmp, p))
> +    {
> +      double elem1 = svclastb_n_f64 (p, 0, x1);
> +      double elem2 = svclastb_n_f64 (p, 0, x2);
> +      double ret = (*f) (elem1, elem2);
> +      svfloat64_t y2 = svdup_n_f64 (ret);
> +      y = svsel_f64 (p, y2, y);
> +      p = svpnext_b64 (cmp, p);
> +    }
> +  return y;
> +}
> +
> +static inline svuint64_t
> +sv_mod_n_u64_x (svbool_t pg, svuint64_t x, uint64_t y)
> +{
> +  svuint64_t q = svdiv_n_u64_x (pg, x, y);
> +  return svmls_n_u64_x (pg, x, q, y);
> +}
> +
> +/* Single precision.  */
> +static inline svint32_t
> +sv_s32 (int32_t x)
> +{
> +  return svdup_n_s32 (x);
> +}
> +
> +static inline svuint32_t
> +sv_u32 (uint32_t x)
> +{
> +  return svdup_n_u32 (x);
> +}
> +
> +static inline svfloat32_t
> +sv_f32 (float x)
> +{
> +  return svdup_n_f32 (x);
> +}
> +
> +static inline svfloat32_t
> +sv_call_f32 (float (*f) (float), svfloat32_t x, svfloat32_t y, svbool_t cmp)
> +{
> +  svbool_t p = svpfirst (cmp, svpfalse ());
> +  while (svptest_any (cmp, p))
> +    {
> +      float elem = svclastb_n_f32 (p, 0, x);
> +      elem = f (elem);
> +      svfloat32_t y2 = svdup_n_f32 (elem);
> +      y = svsel_f32 (p, y2, y);
> +      p = svpnext_b32 (cmp, p);
> +    }
> +  return y;
> +}
> +
> +static inline svfloat32_t
> +sv_call2_f32 (float (*f) (float, float), svfloat32_t x1, svfloat32_t x2,
> +	      svfloat32_t y, svbool_t cmp)
> +{
> +  svbool_t p = svpfirst (cmp, svpfalse ());
> +  while (svptest_any (cmp, p))
> +    {
> +      float elem1 = svclastb_n_f32 (p, 0, x1);
> +      float elem2 = svclastb_n_f32 (p, 0, x2);
> +      float ret = f (elem1, elem2);
> +      svfloat32_t y2 = svdup_n_f32 (ret);
> +      y = svsel_f32 (p, y2, y);
> +      p = svpnext_b32 (cmp, p);
> +    }
> +  return y;
> +}
> +
> +#endif
> diff --git a/sysdeps/aarch64/fpu/sve_utils.h b/sysdeps/aarch64/fpu/sve_utils.h
> deleted file mode 100644
> index 5ce3d2e8d6..0000000000
> --- a/sysdeps/aarch64/fpu/sve_utils.h
> +++ /dev/null
> @@ -1,55 +0,0 @@
> -/* Helpers for SVE vector math functions.
> -
> -   Copyright (C) 2023 Free Software Foundation, Inc.
> -   This file is part of the GNU C Library.
> -
> -   The GNU C Library is free software; you can redistribute it and/or
> -   modify it under the terms of the GNU Lesser General Public
> -   License as published by the Free Software Foundation; either
> -   version 2.1 of the License, or (at your option) any later version.
> -
> -   The GNU C Library is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -   Lesser General Public License for more details.
> -
> -   You should have received a copy of the GNU Lesser General Public
> -   License along with the GNU C Library; if not, see
> -   <https://www.gnu.org/licenses/>.  */
> -
> -#include <arm_sve.h>
> -
> -#define SV_NAME_F1(fun) _ZGVsMxv_##fun##f
> -#define SV_NAME_D1(fun) _ZGVsMxv_##fun
> -#define SV_NAME_F2(fun) _ZGVsMxvv_##fun##f
> -#define SV_NAME_D2(fun) _ZGVsMxvv_##fun
> -
> -static __always_inline svfloat32_t
> -sv_call_f32 (float (*f) (float), svfloat32_t x, svfloat32_t y, svbool_t cmp)
> -{
> -  svbool_t p = svpfirst (cmp, svpfalse ());
> -  while (svptest_any (cmp, p))
> -    {
> -      float elem = svclastb_n_f32 (p, 0, x);
> -      elem = (*f) (elem);
> -      svfloat32_t y2 = svdup_n_f32 (elem);
> -      y = svsel_f32 (p, y2, y);
> -      p = svpnext_b32 (cmp, p);
> -    }
> -  return y;
> -}
> -
> -static __always_inline svfloat64_t
> -sv_call_f64 (double (*f) (double), svfloat64_t x, svfloat64_t y, svbool_t cmp)
> -{
> -  svbool_t p = svpfirst (cmp, svpfalse ());
> -  while (svptest_any (cmp, p))
> -    {
> -      double elem = svclastb_n_f64 (p, 0, x);
> -      elem = (*f) (elem);
> -      svfloat64_t y2 = svdup_n_f64 (elem);
> -      y = svsel_f64 (p, y2, y);
> -      p = svpnext_b64 (cmp, p);
> -    }
> -  return y;
> -}
> diff --git a/sysdeps/aarch64/fpu/v_math.h b/sysdeps/aarch64/fpu/v_math.h
> new file mode 100644
> index 0000000000..43efd8f99d
> --- /dev/null
> +++ b/sysdeps/aarch64/fpu/v_math.h
> @@ -0,0 +1,145 @@
> +/* Utilities for Advanced SIMD libmvec routines.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _V_MATH_H
> +#define _V_MATH_H
> +
> +#include <arm_neon.h>
> +#include "vecmath_config.h"
> +
> +#define VPCS_ATTR __attribute__ ((aarch64_vector_pcs))
> +
> +#define V_NAME_F1(fun) _ZGVnN4v_##fun##f
> +#define V_NAME_D1(fun) _ZGVnN2v_##fun
> +#define V_NAME_F2(fun) _ZGVnN4vv_##fun##f
> +#define V_NAME_D2(fun) _ZGVnN2vv_##fun
> +
> +/* Shorthand helpers for declaring constants.  */
> +#define V2(x)                                                                  \
> +  {                                                                            \
> +    x, x                                                                       \
> +  }
> +
> +#define V4(x)                                                                  \
> +  {                                                                            \
> +    x, x, x, x                                                                 \
> +  }
> +
> +static inline float32x4_t
> +v_f32 (float x)
> +{
> +  return (float32x4_t) V4 (x);
> +}
> +static inline uint32x4_t
> +v_u32 (uint32_t x)
> +{
> +  return (uint32x4_t) V4 (x);
> +}
> +static inline int32x4_t
> +v_s32 (int32_t x)
> +{
> +  return (int32x4_t) V4 (x);
> +}
> +
> +/* true if any elements of a vector compare result is non-zero.  */
> +static inline int
> +v_any_u32 (uint32x4_t x)
> +{
> +  /* assume elements in x are either 0 or -1u.  */
> +  return vpaddd_u64 (vreinterpretq_u64_u32 (x)) != 0;
> +}
> +static inline float32x4_t
> +v_lookup_f32 (const float *tab, uint32x4_t idx)
> +{
> +  return (float32x4_t){ tab[idx[0]], tab[idx[1]], tab[idx[2]], tab[idx[3]] };
> +}
> +static inline uint32x4_t
> +v_lookup_u32 (const uint32_t *tab, uint32x4_t idx)
> +{
> +  return (uint32x4_t){ tab[idx[0]], tab[idx[1]], tab[idx[2]], tab[idx[3]] };
> +}
> +static inline float32x4_t
> +v_call_f32 (float (*f) (float), float32x4_t x, float32x4_t y, uint32x4_t p)
> +{
> +  return (float32x4_t){ p[0] ? f (x[0]) : y[0], p[1] ? f (x[1]) : y[1],
> +			p[2] ? f (x[2]) : y[2], p[3] ? f (x[3]) : y[3] };
> +}
> +static inline float32x4_t
> +v_call2_f32 (float (*f) (float, float), float32x4_t x1, float32x4_t x2,
> +	     float32x4_t y, uint32x4_t p)
> +{
> +  return (float32x4_t){ p[0] ? f (x1[0], x2[0]) : y[0],
> +			p[1] ? f (x1[1], x2[1]) : y[1],
> +			p[2] ? f (x1[2], x2[2]) : y[2],
> +			p[3] ? f (x1[3], x2[3]) : y[3] };
> +}
> +
> +static inline float64x2_t
> +v_f64 (double x)
> +{
> +  return (float64x2_t) V2 (x);
> +}
> +static inline uint64x2_t
> +v_u64 (uint64_t x)
> +{
> +  return (uint64x2_t) V2 (x);
> +}
> +static inline int64x2_t
> +v_s64 (int64_t x)
> +{
> +  return (int64x2_t) V2 (x);
> +}
> +
> +/* true if any elements of a vector compare result is non-zero.  */
> +static inline int
> +v_any_u64 (uint64x2_t x)
> +{
> +  /* assume elements in x are either 0 or -1u.  */
> +  return vpaddd_u64 (x) != 0;
> +}
> +/* true if all elements of a vector compare result is 1.  */
> +static inline int
> +v_all_u64 (uint64x2_t x)
> +{
> +  /* assume elements in x are either 0 or -1u.  */
> +  return vpaddd_s64 (vreinterpretq_s64_u64 (x)) == -2;
> +}
> +static inline float64x2_t
> +v_lookup_f64 (const double *tab, uint64x2_t idx)
> +{
> +  return (float64x2_t){ tab[idx[0]], tab[idx[1]] };
> +}
> +static inline uint64x2_t
> +v_lookup_u64 (const uint64_t *tab, uint64x2_t idx)
> +{
> +  return (uint64x2_t){ tab[idx[0]], tab[idx[1]] };
> +}
> +static inline float64x2_t
> +v_call_f64 (double (*f) (double), float64x2_t x, float64x2_t y, uint64x2_t p)
> +{
> +  return (float64x2_t){ p[0] ? f (x[0]) : y[0], p[1] ? f (x[1]) : y[1] };
> +}
> +static inline float64x2_t
> +v_call2_f64 (double (*f) (double, double), float64x2_t x1, float64x2_t x2,
> +	     float64x2_t y, uint64x2_t p)
> +{
> +  return (float64x2_t){ p[0] ? f (x1[0], x2[0]) : y[0],
> +			p[1] ? f (x1[1], x2[1]) : y[1] };
> +}
> +
> +#endif
> diff --git a/sysdeps/aarch64/fpu/vecmath_config.h b/sysdeps/aarch64/fpu/vecmath_config.h
> new file mode 100644
> index 0000000000..d0bdbb4ae8
> --- /dev/null
> +++ b/sysdeps/aarch64/fpu/vecmath_config.h
> @@ -0,0 +1,38 @@
> +/* Configuration for libmvec routines.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef _VECMATH_CONFIG_H
> +#define _VECMATH_CONFIG_H
> +
> +#include <math_private.h>
> +
> +/* Deprecated config option from Arm Optimized Routines which ensures
> +   fp exceptions are correctly triggered. This is not intended to be
> +   supported in GLIBC, however we keep it for ease of development.  */
> +#define WANT_SIMD_EXCEPT 0
> +
> +/* Return ptr but hide its value from the compiler so accesses through it
> +   cannot be optimized based on the contents.  */
> +#define ptr_barrier(ptr)                                                      \
> +  ({                                                                          \
> +    __typeof (ptr) __ptr = (ptr);                                             \
> +    __asm("" : "+r"(__ptr));                                                  \
> +    __ptr;                                                                    \
> +  })
> +
> +#endif
> diff --git a/sysdeps/aarch64/libm-test-ulps b/sysdeps/aarch64/libm-test-ulps
> index da7c64942c..07da4ab843 100644
> --- a/sysdeps/aarch64/libm-test-ulps
> +++ b/sysdeps/aarch64/libm-test-ulps
> @@ -642,7 +642,7 @@ float: 1
>  ldouble: 2
>  
>  Function: "cos_advsimd":
> -double: 1
> +double: 2
>  float: 1
>  
>  Function: "cos_downward":
> diff --git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h
> index 98b54875f5..3865dcf905 100644
> --- a/sysdeps/generic/math_private.h
> +++ b/sysdeps/generic/math_private.h
> @@ -193,7 +193,7 @@ do {								\
>  # undef _Mdouble_
>  #endif
>  
> -
> +#define NOINLINE __attribute__ ((noinline))
>  
>  /* Prototypes for functions of the IBM Accurate Mathematical Library.  */
>  extern double __sin (double __x);
> diff --git a/sysdeps/ieee754/dbl-64/math_config.h b/sysdeps/ieee754/dbl-64/math_config.h
> index 499ca7591b..19af33fd86 100644
> --- a/sysdeps/ieee754/dbl-64/math_config.h
> +++ b/sysdeps/ieee754/dbl-64/math_config.h
> @@ -148,9 +148,6 @@ make_double (uint64_t x, int64_t ep, uint64_t s)
>    return asdouble (s + x + (ep << MANTISSA_WIDTH));
>  }
>  
> -
> -#define NOINLINE __attribute__ ((noinline))
> -
>  /* Error handling tail calls for special cases, with a sign argument.
>     The sign of the return value is set if the argument is non-zero.  */
>  
> diff --git a/sysdeps/ieee754/flt-32/math_config.h b/sysdeps/ieee754/flt-32/math_config.h
> index a7fb332767..d1b06a1a90 100644
> --- a/sysdeps/ieee754/flt-32/math_config.h
> +++ b/sysdeps/ieee754/flt-32/math_config.h
> @@ -151,8 +151,6 @@ make_float (uint32_t x, int ep, uint32_t s)
>    return asfloat (s + x + (ep << MANTISSA_WIDTH));
>  }
>  
> -#define NOINLINE __attribute__ ((noinline))
> -
>  attribute_hidden float __math_oflowf (uint32_t);
>  attribute_hidden float __math_uflowf (uint32_t);
>  attribute_hidden float __math_may_uflowf (uint32_t);
> -- 
> 2.27.0
>