From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR03-AM7-obe.outbound.protection.outlook.com (mail-am7eur03on2085.outbound.protection.outlook.com [40.107.105.85]) by sourceware.org (Postfix) with ESMTPS id 458EE3858D20 for ; Tue, 30 Apr 2024 12:47:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 458EE3858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 458EE3858D20 Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=40.107.105.85 ARC-Seal: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1714481260; cv=pass; b=KBRHzWVs80UswTA6dg0fh5CT4i9nFh8ce3TkdOC54ffUyc8Y8CVh42ZA7U62oB5hvP8m4e0Ec7VxEFf0KGwDASmRRpVzA2NaVbuXNS0GnbYnjFl48d4Y7/NrJ/mxNcFTQf8rNA5NBRWPQkOaZ8WfmvagCvQ9hEwyYbGy9nzIkh4= ARC-Message-Signature: i=3; a=rsa-sha256; d=sourceware.org; s=key; t=1714481260; c=relaxed/simple; bh=0abcl08SEU7liB5um4samlk0VNbnA7OkJJjkYHL3Hjc=; h=DKIM-Signature:DKIM-Signature:From:To:Subject:Date:Message-ID: MIME-Version; b=jHOViNrz93rNsCQTfsPHurjjcoeSvNlVYmcWIudaJb+vt8a6MeE1q835cRcD9S2Co0qxZfNiZ97uF9IDCI/a6dTHUZj9WSxHO12y03hf472+nhFasjqnWF0x4KJelpgUk+lFO3aaiItcnFrrdQY60TZ/TOhOO6QFTX2p9+9hKGE= ARC-Authentication-Results: i=3; server2.sourceware.org ARC-Seal: i=2; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=pass; b=HWxORonxiQui8lyaUbrEcqURKkdtVmbUvEmkcvDgjkp6t9cy4Py3OihNjLuw4YKxNmPwTXeiN7JfwTcUR4cy93WZZY9HMY3zOzIhJ1k1DV79yczWaR9QfQFO9tkvoCdIil66mVJIPyuyz49SAMEcip0qrFIIiSDtd3cBkrierHY2KcdUMGauYCIvdmoX0fiyB5J88vGKdk4pMEU75/CW8V1qvx8jtqXPrYxUirka+YNQmL9BZ+JF+QgZEoHeWlnl72zrgH7BwNNCoUCKVyGyflFxduDtneFce/GnGyo+ggGRe0tp3OfjgWYRd/oCQd1yNNO4V7/9R/HwmRV0JKcCLw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=2f72C2yht0DcSkLAvRNXVuFHfsq2mYl79cqdh0NUtRY=; b=itoNm5NKFASPBL2VbY7Z/bl/BRwdngfaAhwYIG/aELOxsvWmWtMhMpYHcyM+ZgG+OBUmHSZVWa6liM41xA61B0s8gGkNq4Cg4Nh6WnEsySw/8klqAmUQLRZaW8vKXib14CF4U131Cy5Z/ayGZJAR4sIIQDf0BIf2v9g2l6e9z3I8kTJpDN/0xSgH4OdNDiyBS0wlAxwlhr0hf2PjT29UXK1kA7B/BjziFT0KDGxJt5fV3QN+FgMdWKB/7Yt5tvPGO4QZYOAfPR70M79PIEFpCBQBkpuwSk984Esd9DfOFHbWhlSDxxk9/XEiNZuIW/HuD5IjrPZvDiR4Wpvqm8eKeg== ARC-Authentication-Results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 63.35.35.123) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=arm.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dmarc=[1,1,header.from=arm.com]) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2f72C2yht0DcSkLAvRNXVuFHfsq2mYl79cqdh0NUtRY=; b=UntxCv9PqB2EBt5/YPhZ7kEc0Zbu471nNxbHs2HtYS9diuXz4ff0miv53QIp1jV/TgH2GzD3lrdc0CsIlvLHsb51zqOmCv4zuqfgzI4K46SYUvvVEU+GqE+nIaR6BsAhtWxRkV7ROh5XJRQ/mrIV/QUyinlXMJTqzM5D0dx1eiU= Received: from DU6P191CA0025.EURP191.PROD.OUTLOOK.COM (2603:10a6:10:53f::25) by DB4PR08MB8078.eurprd08.prod.outlook.com (2603:10a6:10:386::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.36; Tue, 30 Apr 2024 12:47:31 +0000 Received: from DU6PEPF00009527.eurprd02.prod.outlook.com (2603:10a6:10:53f:cafe::4) by DU6P191CA0025.outlook.office365.com (2603:10a6:10:53f::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.36 via Frontend Transport; Tue, 30 Apr 2024 12:47:31 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=arm.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; pr=C Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DU6PEPF00009527.mail.protection.outlook.com (10.167.8.8) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.7544.18 via Frontend Transport; Tue, 30 Apr 2024 12:47:31 +0000 Received: ("Tessian outbound 082664cc04c1:v315"); Tue, 30 Apr 2024 12:47:31 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 3823ca6c49b6d00a X-CR-MTA-TID: 64aa7808 Received: from a44a3ce76c48.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id C7D4DF51-4F5F-4D4C-83F3-6D9DB48F590E.1; Tue, 30 Apr 2024 12:47:24 +0000 Received: from EUR04-HE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id a44a3ce76c48.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Tue, 30 Apr 2024 12:47:24 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kn4vHRBUkdWFK5hNrIXrsEMdN6e985vS1KgbLtTCgrtphKQqj3BJu1BEe+At3tXIAPIBB4qSolvLsjt2zG2BWI3CiNqz8Vjt8EKt2TxpDZZmjU9TmSDa0jSEBmM9MZ07NHn+yY54JL0bFJbnkGu98w4K7RQ5BD4Gm560DnOQiBvJ2PU2ix2IjHTM27JPcuQCwajbMOsUvrl1w5bzpumMCruUArT//MmqFOpDddmJLqsL+ZGVymDiaJksP56E0ZOYTj3sFlFfmwb9t/CE4lqvLwy3Z3hIZjjzXXegzQacfp3/cfCQjHI3A4WeZ68uhKRFhFgVONvH8vGAdYVK10IdzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=2f72C2yht0DcSkLAvRNXVuFHfsq2mYl79cqdh0NUtRY=; b=S75usnHmJ7ffv+1zWuGmgG6jy7EaXAkOgAcTctjZxKPMmV+WLfT4uSYD4mS2R7InKAh2nb0gOuqxbsKnRQCrbjuQERXzOPCiXrQzEoeWAey1VfuF2jxsrXd4SRwrdnIdmQ6Cw+4xshKfj4s3AvSMgA/0BBvAlr1mwuLmBPKFg+dtnxCjj2PS5qmLYLlPwn7LVRJqNfEp0cBys3GVCIIIQCLUz9CcCv9XYl0f2Im/DwdbrTmNhofjLM45ic2AmZcHWogiM0ir1w8JrjYLghH8s17PHzp/R6EGy4sHI92BRhJ5kT7saU0+ij2rS9c5R3q8u0rT4z3fwxEZ4duY9ZRTUQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 40.67.248.234) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2f72C2yht0DcSkLAvRNXVuFHfsq2mYl79cqdh0NUtRY=; b=UntxCv9PqB2EBt5/YPhZ7kEc0Zbu471nNxbHs2HtYS9diuXz4ff0miv53QIp1jV/TgH2GzD3lrdc0CsIlvLHsb51zqOmCv4zuqfgzI4K46SYUvvVEU+GqE+nIaR6BsAhtWxRkV7ROh5XJRQ/mrIV/QUyinlXMJTqzM5D0dx1eiU= Received: from DU2PR04CA0186.eurprd04.prod.outlook.com (2603:10a6:10:28d::11) by AS2PR08MB8952.eurprd08.prod.outlook.com (2603:10a6:20b:5fb::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.34; Tue, 30 Apr 2024 12:47:22 +0000 Received: from DU6PEPF0000A7E3.eurprd02.prod.outlook.com (2603:10a6:10:28d:cafe::5c) by DU2PR04CA0186.outlook.office365.com (2603:10a6:10:28d::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.34 via Frontend Transport; Tue, 30 Apr 2024 12:47:22 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 40.67.248.234) smtp.mailfrom=arm.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 40.67.248.234 as permitted sender) receiver=protection.outlook.com; client-ip=40.67.248.234; helo=nebula.arm.com; pr=C Received: from nebula.arm.com (40.67.248.234) by DU6PEPF0000A7E3.mail.protection.outlook.com (10.167.8.41) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7544.18 via Frontend Transport; Tue, 30 Apr 2024 12:47:22 +0000 Received: from AZ-NEU-EX04.Arm.com (10.251.24.32) by AZ-NEU-EX04.Arm.com (10.251.24.32) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 30 Apr 2024 12:47:17 +0000 Received: from vcn-man-apps.manchester.arm.com (10.32.108.22) by mail.arm.com (10.251.24.32) with Microsoft SMTP Server id 15.1.2507.35 via Frontend Transport; Tue, 30 Apr 2024 12:47:17 +0000 From: Joe Ramsay To: CC: Joe Ramsay Subject: [PATCH] aarch64: Fix AdvSIMD libmvec routines for big-endian Date: Tue, 30 Apr 2024 13:47:14 +0100 Message-ID: <20240430124714.49857-1-Joe.Ramsay@arm.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-EOPAttributedMessage: 1 X-MS-TrafficTypeDiagnostic: DU6PEPF0000A7E3:EE_|AS2PR08MB8952:EE_|DU6PEPF00009527:EE_|DB4PR08MB8078:EE_ X-MS-Office365-Filtering-Correlation-Id: d0a76780-4195-4e7e-5afd-08dc6913b366 x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0;ARA:13230031|82310400014|1800799015|376005|36860700004; X-Microsoft-Antispam-Message-Info-Original: =?us-ascii?Q?y/lKJ/IWG+o4LCAa/7m8pruQgaFbZJTvq7grJqXgpIyYoZIgXnVpBDyXtiXH?= =?us-ascii?Q?Ai/BrNA0S4dMwo11kAlbPIIVFZ5gBtHHWqt1bqDrzNSh/k9FTMUbxsGW34qf?= =?us-ascii?Q?wPojk8FiGpvjybvB2UiFUjGRp74JqvP3lPzAIpKod3krFGQJ/wymCfxdAN4l?= =?us-ascii?Q?ExGXNMabdpRAlopOnwagC1Ujqp7/dcvExL9rLyGwGjhUVXl5eVQ0G1rnD8/f?= =?us-ascii?Q?Q+/cqmEHQc5S7zdCDRrLfpBK7CEFoiHeqyiRbgYdq3L2a/ANPyLBNDDPkSj+?= =?us-ascii?Q?XC9t1SjfZOJzM+SwofbeB5bfTHyXflIEZzV2gJYPdCapTFl8PQD70KhRuCM7?= =?us-ascii?Q?ymtEkSj5y+649mZ2lSbbn1iMZKbQs4Hh7bcbTBa9v44g069rjeRiz1+j+wUI?= =?us-ascii?Q?r+KLSXnAX9Xcla8xqRAiNpKIFC20dxyd7d+N3KakhlKVtTAbYicsD3bPltg+?= =?us-ascii?Q?bQUjkwPk8FEoR9isquQEliMJrqT4mwSJ/nYrbVycyjOwbEiWMm8yZ7Zsu0WO?= =?us-ascii?Q?kKIypitnGOtw3smS8+tuYzNN20rNvn7OvFPx8VuEF1MKgeXkQ0j24VaQboAq?= =?us-ascii?Q?lI9UB11smdMvpeclxSQXK+00M9SNXsuMySU/CjGuBVLy8HmqHKdXwzLQGNDh?= =?us-ascii?Q?BQr+qnQ3YVqRlfeeUjYd9i3JRVpXZh2w/OY91iZj0+JiX3TzTwJsNsBf7JSm?= =?us-ascii?Q?fVo9uuC1dbMR1EX0Lrq9Zdo7hbnexrpxxmY1N75NkegeNB1hMDhMQVuyfzRQ?= =?us-ascii?Q?x7si9lAsz89XWZN4FxEszjzu+XcKBxxQ9BMbQRM1YhEeSjJzTsxzviof3wo8?= =?us-ascii?Q?xd7qTHXu9ifZtRbDHObgbrn9qP3XA8/U2oBCgWTOMUQ0FqjrYjjjNVz1Kc2h?= =?us-ascii?Q?02bynv/tTw53mXqnWudonpbsPW/W9iUBPYPfMDITLPy89n6CfChh0aHSWLj2?= =?us-ascii?Q?rEJveWbviW1IloTPdxygiG/sXh0JMVlpvB57oTQVijxb19LeQBMIIP4+83ix?= =?us-ascii?Q?6QY9XczMblt6Ydfk7rwwdsH8wZ/B2I4lseQ0LCXR4GOwRIAwURBhiqzBcrDA?= =?us-ascii?Q?XOKj3b1TrIBdB7MNwqzhybFzOcTA5lrbkjpp/0UWcv31Rj03ZZb06q2igD8o?= =?us-ascii?Q?502n0VPntnoPeNvqhgrcvloUm3LgyhpMWW8EmEX6CJ9WiM/Bl7dEbs+OS15W?= =?us-ascii?Q?E0j+9Dao5sTiQDQ3tWPG6BV6I7/Uuvnzw0maeZp2VJKWucep2rAJwrBbTehw?= =?us-ascii?Q?pyX0Gd3hK4dzvSMm7oZ2cqy1BxO3GEyT2xEKa4Ikmnw2eyM4oUuoJtq40zvv?= =?us-ascii?Q?kDRLzVo1Vf7j0Hqqd8HRe/agCNhLVbzwmi8x+TSnjbAcYVM7mOjfWcDrfcA0?= =?us-ascii?Q?qsmP6CU=3D?= X-Forefront-Antispam-Report-Untrusted: CIP:40.67.248.234;CTRY:IE;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:nebula.arm.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(82310400014)(1800799015)(376005)(36860700004);DIR:OUT;SFP:1101; X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS2PR08MB8952 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DU6PEPF00009527.eurprd02.prod.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: c120dbe7-670e-452f-8cbc-08dc6913adfd X-Microsoft-Antispam: BCL:0;ARA:13230031|1800799015|36860700004|82310400014|376005|35042699010; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?7xq5J0A+hpNISDl34BzASxikCe1OX44Ouf3/LnHz4YN8aSSIi6JHiKNtCaPW?= =?us-ascii?Q?JFGl9K4JyNUYkw9FthifBgY6Bp7Sby6ILqykUj2SsEYSa8f4CeHNPMCQr3ho?= =?us-ascii?Q?Jdlf/olpRLkyCJfS3OJV0/CnEeqTn+/AHjf2euO4sivFq+1/vlW/U82noIlI?= =?us-ascii?Q?oGYoRAcRYOT11ppRMrIP3bJAGKD5Pt7lUxOwL4i6zS2tutPSWeTjDbI63qLa?= =?us-ascii?Q?L6pb+knWjRTbmq5+6p165rkEULH2J2SsSewsTwBn8Y9MgBgRVf+HgmAwCUqD?= =?us-ascii?Q?Q/5YpLayygNOaU8lVcaopaL1fSLkkBN+Be56LmsDYYzhU8/9kfqnSOV2J7ul?= =?us-ascii?Q?FtjZV/Tsk0maqgbYGcFWvloMIiIpUI6qZwBtRWSXUik/Qz7L2tiVOA/dnI7X?= =?us-ascii?Q?xGlNehcRwLBPsex5G77jFZBr1hwWWb6gFiq67KgPNir9Kjt5xi6BfZokTY6j?= =?us-ascii?Q?6jZL5zJ9VhwlvaTUNwuyyLJ5rAQYgAz6aYtjQfBdV+XkZS4uT9VWWIpHNUB9?= =?us-ascii?Q?bfM/wm3xe19h0mkUAv4S9UWmO4sqpmKOCvHFMbXoYXvr0xAI5Zo9i4v2jnSW?= =?us-ascii?Q?NJ0eA8Tuwkxeo4ZtG2yYRE7Aoka0kHB2+CcE0DzTXf0c3ceRuH0esV/CM3XD?= =?us-ascii?Q?0MZAKYJ0IzLnlcVuAM0dikQZRHS3JZJ9umFXza9lz/l/+6XbiBphV7NMpIQa?= =?us-ascii?Q?eiSJJpGn03xWODP6/Rhf2XyiZN3naSn4yVCTj97c/p4WJ5PdkHdK4BoQzjg5?= =?us-ascii?Q?3VbYMfBdGGQ397uLwYAutdUtT2FLFlOXpSWl/R3q9mHFxnYY3JQ6wDWOjUC1?= =?us-ascii?Q?ezBGx1UZIhJRjEt9xBoUXmPGE159Hi+N78lp6miyPijKQX4LeEC9s2UQy+Pu?= =?us-ascii?Q?nthOFC4ufzEUxcDTbtz/TSa36o6mQvfnZnlDJle1JqUj2G3FUkP00OE8smMg?= =?us-ascii?Q?SjiEVmhhTRk42PyvhOWrbLMvzEB0hL842zCW7/xqwLdv2MqPWgSG8kFL++na?= =?us-ascii?Q?klcnOT11AlK3pe6IjqV7rIwzJRCMVQYfM9GQp10i7XLEpD4WgUq9mxGwV+XA?= =?us-ascii?Q?PwqCc8n8/3Jx5KQd/qGBv6tOFsUH5gKnG97IAPQVnz6Hna/UH2ZhL+3yxD8b?= =?us-ascii?Q?I2zPM2zjCVnZXipBnkUnVEfUpsRQXqQX0KVXiCSnpv6WvxZ++HtCkVr9lQUe?= =?us-ascii?Q?jxJqraWSDKjana238dBIeqvUT4880A41rU5vWhnwaRZeC22pGoh3zB15vYNU?= =?us-ascii?Q?PY7XphmkKJ8n0yjvZMED4L+nuIEp63+EPeQr7sf752btb9smPpw7i8AmyjGH?= =?us-ascii?Q?uvrxPvUxp+qYveYsZV5GPVJeLojEzzk+bIUOrgvf3eKWPIiFUkdOileh7xz5?= =?us-ascii?Q?UfnI4xs=3D?= X-Forefront-Antispam-Report: CIP:63.35.35.123;CTRY:IE;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:64aa7808-outbound-1.mta.getcheckrecipient.com;PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com;CAT:NONE;SFS:(13230031)(1800799015)(36860700004)(82310400014)(376005)(35042699010);DIR:OUT;SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Apr 2024 12:47:31.4736 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: d0a76780-4195-4e7e-5afd-08dc6913b366 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d;Ip=[63.35.35.123];Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DU6PEPF00009527.eurprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB4PR08MB8078 X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO,GIT_PATCH_0,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,TXREP,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Many routines relied on implicit loads - we need to explicitly force ld1s so that the lanes are in the right order. --- Thanks, Joe sysdeps/aarch64/fpu/asinh_advsimd.c | 15 ++++++++----- sysdeps/aarch64/fpu/cosh_advsimd.c | 9 +++++--- sysdeps/aarch64/fpu/erf_advsimd.c | 4 ++-- sysdeps/aarch64/fpu/erfc_advsimd.c | 31 ++++++++++++++++----------- sysdeps/aarch64/fpu/erfcf_advsimd.c | 28 ++++++++++++++---------- sysdeps/aarch64/fpu/erff_advsimd.c | 12 +++++------ sysdeps/aarch64/fpu/exp10f_advsimd.c | 10 +++++---- sysdeps/aarch64/fpu/expm1_advsimd.c | 9 +++++--- sysdeps/aarch64/fpu/expm1f_advsimd.c | 11 +++++----- sysdeps/aarch64/fpu/log10_advsimd.c | 6 ++++-- sysdeps/aarch64/fpu/log2_advsimd.c | 6 ++++-- sysdeps/aarch64/fpu/log_advsimd.c | 9 ++------ sysdeps/aarch64/fpu/sinh_advsimd.c | 13 ++++++----- sysdeps/aarch64/fpu/tan_advsimd.c | 8 ++++--- sysdeps/aarch64/fpu/tanf_advsimd.c | 11 +++++----- sysdeps/aarch64/fpu/v_expf_inline.h | 10 +++++---- sysdeps/aarch64/fpu/v_expm1f_inline.h | 12 ++++++----- 17 files changed, 119 insertions(+), 85 deletions(-) diff --git a/sysdeps/aarch64/fpu/asinh_advsimd.c b/sysdeps/aarch64/fpu/asinh_advsimd.c index 544a52f651..6207e7da95 100644 --- a/sysdeps/aarch64/fpu/asinh_advsimd.c +++ b/sysdeps/aarch64/fpu/asinh_advsimd.c @@ -22,6 +22,7 @@ #define A(i) v_f64 (__v_log_data.poly[i]) #define N (1 << V_LOG_TABLE_BITS) +#define IndexMask (N - 1) const static struct data { @@ -63,11 +64,15 @@ struct entry static inline struct entry lookup (uint64x2_t i) { - float64x2_t e0 = vld1q_f64 ( - &__v_log_data.table[(i[0] >> (52 - V_LOG_TABLE_BITS)) & (N - 1)].invc); - float64x2_t e1 = vld1q_f64 ( - &__v_log_data.table[(i[1] >> (52 - V_LOG_TABLE_BITS)) & (N - 1)].invc); - return (struct entry){ vuzp1q_f64 (e0, e1), vuzp2q_f64 (e0, e1) }; + /* Since N is a power of 2, n % N = n & (N - 1). */ + struct entry e; + uint64_t i0 = (vgetq_lane_u64 (i, 0) >> (52 - V_LOG_TABLE_BITS)) & IndexMask; + uint64_t i1 = (vgetq_lane_u64 (i, 1) >> (52 - V_LOG_TABLE_BITS)) & IndexMask; + float64x2_t e0 = vld1q_f64 (&__v_log_data.table[i0].invc); + float64x2_t e1 = vld1q_f64 (&__v_log_data.table[i1].invc); + e.invc = vuzp1q_f64 (e0, e1); + e.logc = vuzp2q_f64 (e0, e1); + return e; } static inline float64x2_t diff --git a/sysdeps/aarch64/fpu/cosh_advsimd.c b/sysdeps/aarch64/fpu/cosh_advsimd.c index ec7b59637e..4bee734f00 100644 --- a/sysdeps/aarch64/fpu/cosh_advsimd.c +++ b/sysdeps/aarch64/fpu/cosh_advsimd.c @@ -22,7 +22,9 @@ static const struct data { float64x2_t poly[3]; - float64x2_t inv_ln2, ln2, shift, thres; + float64x2_t inv_ln2; + double ln2[2]; + float64x2_t shift, thres; uint64x2_t index_mask, special_bound; } data = { .poly = { V2 (0x1.fffffffffffd4p-2), V2 (0x1.5555571d6b68cp-3), @@ -58,8 +60,9 @@ exp_inline (float64x2_t x) float64x2_t n = vsubq_f64 (z, d->shift); /* r = x - n*ln2/N. */ - float64x2_t r = vfmaq_laneq_f64 (x, n, d->ln2, 0); - r = vfmaq_laneq_f64 (r, n, d->ln2, 1); + float64x2_t ln2 = vld1q_f64 (d->ln2); + float64x2_t r = vfmaq_laneq_f64 (x, n, ln2, 0); + r = vfmaq_laneq_f64 (r, n, ln2, 1); uint64x2_t e = vshlq_n_u64 (u, 52 - V_EXP_TAIL_TABLE_BITS); uint64x2_t i = vandq_u64 (u, d->index_mask); diff --git a/sysdeps/aarch64/fpu/erf_advsimd.c b/sysdeps/aarch64/fpu/erf_advsimd.c index 3e70cbc025..19cbb7d0f4 100644 --- a/sysdeps/aarch64/fpu/erf_advsimd.c +++ b/sysdeps/aarch64/fpu/erf_advsimd.c @@ -56,8 +56,8 @@ static inline struct entry lookup (uint64x2_t i) { struct entry e; - float64x2_t e1 = vld1q_f64 ((float64_t *) (__erf_data.tab + i[0])), - e2 = vld1q_f64 ((float64_t *) (__erf_data.tab + i[1])); + float64x2_t e1 = vld1q_f64 (&__erf_data.tab[vgetq_lane_u64 (i, 0)].erf), + e2 = vld1q_f64 (&__erf_data.tab[vgetq_lane_u64 (i, 1)].erf); e.erf = vuzp1q_f64 (e1, e2); e.scale = vuzp2q_f64 (e1, e2); return e; diff --git a/sysdeps/aarch64/fpu/erfc_advsimd.c b/sysdeps/aarch64/fpu/erfc_advsimd.c index 548f21a3d6..f1b3bfe830 100644 --- a/sysdeps/aarch64/fpu/erfc_advsimd.c +++ b/sysdeps/aarch64/fpu/erfc_advsimd.c @@ -26,7 +26,7 @@ static const struct data float64x2_t max, shift; float64x2_t p20, p40, p41, p42; float64x2_t p51, p52; - float64x2_t qr5, qr6, qr7, qr8, qr9; + double qr5[2], qr6[2], qr7[2], qr8[2], qr9[2]; #if WANT_SIMD_EXCEPT float64x2_t uflow_bound; #endif @@ -68,8 +68,10 @@ static inline struct entry lookup (uint64x2_t i) { struct entry e; - float64x2_t e1 = vld1q_f64 ((float64_t *) (__erfc_data.tab - Off + i[0])), - e2 = vld1q_f64 ((float64_t *) (__erfc_data.tab - Off + i[1])); + float64x2_t e1 + = vld1q_f64 (&__erfc_data.tab[vgetq_lane_u64 (i, 0) - Off].erfc); + float64x2_t e2 + = vld1q_f64 (&__erfc_data.tab[vgetq_lane_u64 (i, 1) - Off].erfc); e.erfc = vuzp1q_f64 (e1, e2); e.scale = vuzp2q_f64 (e1, e2); return e; @@ -161,16 +163,19 @@ float64x2_t V_NAME_D1 (erfc) (float64x2_t x) p5 = vmulq_f64 (r, vfmaq_f64 (vmulq_f64 (v_f64 (0.5), dat->p20), r2, p5)); /* Compute p_i using recurrence relation: p_{i+2} = (p_i + r * Q_{i+1} * p_{i+1}) * R_{i+1}. */ - float64x2_t p6 = vfmaq_f64 (p4, p5, vmulq_laneq_f64 (r, dat->qr5, 0)); - p6 = vmulq_laneq_f64 (p6, dat->qr5, 1); - float64x2_t p7 = vfmaq_f64 (p5, p6, vmulq_laneq_f64 (r, dat->qr6, 0)); - p7 = vmulq_laneq_f64 (p7, dat->qr6, 1); - float64x2_t p8 = vfmaq_f64 (p6, p7, vmulq_laneq_f64 (r, dat->qr7, 0)); - p8 = vmulq_laneq_f64 (p8, dat->qr7, 1); - float64x2_t p9 = vfmaq_f64 (p7, p8, vmulq_laneq_f64 (r, dat->qr8, 0)); - p9 = vmulq_laneq_f64 (p9, dat->qr8, 1); - float64x2_t p10 = vfmaq_f64 (p8, p9, vmulq_laneq_f64 (r, dat->qr9, 0)); - p10 = vmulq_laneq_f64 (p10, dat->qr9, 1); + float64x2_t qr5 = vld1q_f64 (dat->qr5), qr6 = vld1q_f64 (dat->qr6), + qr7 = vld1q_f64 (dat->qr7), qr8 = vld1q_f64 (dat->qr8), + qr9 = vld1q_f64 (dat->qr9); + float64x2_t p6 = vfmaq_f64 (p4, p5, vmulq_laneq_f64 (r, qr5, 0)); + p6 = vmulq_laneq_f64 (p6, qr5, 1); + float64x2_t p7 = vfmaq_f64 (p5, p6, vmulq_laneq_f64 (r, qr6, 0)); + p7 = vmulq_laneq_f64 (p7, qr6, 1); + float64x2_t p8 = vfmaq_f64 (p6, p7, vmulq_laneq_f64 (r, qr7, 0)); + p8 = vmulq_laneq_f64 (p8, qr7, 1); + float64x2_t p9 = vfmaq_f64 (p7, p8, vmulq_laneq_f64 (r, qr8, 0)); + p9 = vmulq_laneq_f64 (p9, qr8, 1); + float64x2_t p10 = vfmaq_f64 (p8, p9, vmulq_laneq_f64 (r, qr9, 0)); + p10 = vmulq_laneq_f64 (p10, qr9, 1); /* Compute polynomial in d using pairwise Horner scheme. */ float64x2_t p90 = vfmaq_f64 (p9, d, p10); float64x2_t p78 = vfmaq_f64 (p7, d, p8); diff --git a/sysdeps/aarch64/fpu/erfcf_advsimd.c b/sysdeps/aarch64/fpu/erfcf_advsimd.c index 30b9e48dd4..ca5bc3ab33 100644 --- a/sysdeps/aarch64/fpu/erfcf_advsimd.c +++ b/sysdeps/aarch64/fpu/erfcf_advsimd.c @@ -23,7 +23,8 @@ static const struct data { uint32x4_t offset, table_scale; float32x4_t max, shift; - float32x4_t coeffs, third, two_over_five, tenth; + float coeffs[4]; + float32x4_t third, two_over_five, tenth; #if WANT_SIMD_EXCEPT float32x4_t uflow_bound; #endif @@ -37,7 +38,7 @@ static const struct data .shift = V4 (0x1p17f), /* Store 1/3, 2/3 and 2/15 in a single register for use with indexed muls and fmas. */ - .coeffs = (float32x4_t){ 0x1.555556p-2f, 0x1.555556p-1f, 0x1.111112p-3f, 0 }, + .coeffs = { 0x1.555556p-2f, 0x1.555556p-1f, 0x1.111112p-3f, 0 }, .third = V4 (0x1.555556p-2f), .two_over_five = V4 (-0x1.99999ap-2f), .tenth = V4 (-0x1.99999ap-4f), @@ -60,12 +61,16 @@ static inline struct entry lookup (uint32x4_t i) { struct entry e; - float64_t t0 = *((float64_t *) (__erfcf_data.tab - Off + i[0])); - float64_t t1 = *((float64_t *) (__erfcf_data.tab - Off + i[1])); - float64_t t2 = *((float64_t *) (__erfcf_data.tab - Off + i[2])); - float64_t t3 = *((float64_t *) (__erfcf_data.tab - Off + i[3])); - float32x4_t e1 = vreinterpretq_f32_f64 ((float64x2_t){ t0, t1 }); - float32x4_t e2 = vreinterpretq_f32_f64 ((float64x2_t){ t2, t3 }); + float32x2_t t0 + = vld1_f32 (&__erfcf_data.tab[vgetq_lane_u32 (i, 0) - Off].erfc); + float32x2_t t1 + = vld1_f32 (&__erfcf_data.tab[vgetq_lane_u32 (i, 1) - Off].erfc); + float32x2_t t2 + = vld1_f32 (&__erfcf_data.tab[vgetq_lane_u32 (i, 2) - Off].erfc); + float32x2_t t3 + = vld1_f32 (&__erfcf_data.tab[vgetq_lane_u32 (i, 3) - Off].erfc); + float32x4_t e1 = vcombine_f32 (t0, t1); + float32x4_t e2 = vcombine_f32 (t2, t3); e.erfc = vuzp1q_f32 (e1, e2); e.scale = vuzp2q_f32 (e1, e2); return e; @@ -140,10 +145,11 @@ float32x4_t NOINLINE V_NAME_F1 (erfc) (float32x4_t x) float32x4_t r2 = vmulq_f32 (r, r); float32x4_t p1 = r; - float32x4_t p2 = vfmsq_laneq_f32 (dat->third, r2, dat->coeffs, 1); + float32x4_t coeffs = vld1q_f32 (dat->coeffs); + float32x4_t p2 = vfmsq_laneq_f32 (dat->third, r2, coeffs, 1); float32x4_t p3 - = vmulq_f32 (r, vfmaq_laneq_f32 (v_f32 (-0.5), r2, dat->coeffs, 0)); - float32x4_t p4 = vfmaq_laneq_f32 (dat->two_over_five, r2, dat->coeffs, 2); + = vmulq_f32 (r, vfmaq_laneq_f32 (v_f32 (-0.5), r2, coeffs, 0)); + float32x4_t p4 = vfmaq_laneq_f32 (dat->two_over_five, r2, coeffs, 2); p4 = vfmsq_f32 (dat->tenth, r2, p4); float32x4_t y = vfmaq_f32 (p3, d, p4); diff --git a/sysdeps/aarch64/fpu/erff_advsimd.c b/sysdeps/aarch64/fpu/erff_advsimd.c index c44644a71c..f2fe6ff236 100644 --- a/sysdeps/aarch64/fpu/erff_advsimd.c +++ b/sysdeps/aarch64/fpu/erff_advsimd.c @@ -47,12 +47,12 @@ static inline struct entry lookup (uint32x4_t i) { struct entry e; - float64_t t0 = *((float64_t *) (__erff_data.tab + i[0])); - float64_t t1 = *((float64_t *) (__erff_data.tab + i[1])); - float64_t t2 = *((float64_t *) (__erff_data.tab + i[2])); - float64_t t3 = *((float64_t *) (__erff_data.tab + i[3])); - float32x4_t e1 = vreinterpretq_f32_f64 ((float64x2_t){ t0, t1 }); - float32x4_t e2 = vreinterpretq_f32_f64 ((float64x2_t){ t2, t3 }); + float32x2_t t0 = vld1_f32 (&__erff_data.tab[vgetq_lane_u32 (i, 0)].erf); + float32x2_t t1 = vld1_f32 (&__erff_data.tab[vgetq_lane_u32 (i, 1)].erf); + float32x2_t t2 = vld1_f32 (&__erff_data.tab[vgetq_lane_u32 (i, 2)].erf); + float32x2_t t3 = vld1_f32 (&__erff_data.tab[vgetq_lane_u32 (i, 3)].erf); + float32x4_t e1 = vcombine_f32 (t0, t1); + float32x4_t e2 = vcombine_f32 (t2, t3); e.erf = vuzp1q_f32 (e1, e2); e.scale = vuzp2q_f32 (e1, e2); return e; diff --git a/sysdeps/aarch64/fpu/exp10f_advsimd.c b/sysdeps/aarch64/fpu/exp10f_advsimd.c index ab117b69da..cf53e73290 100644 --- a/sysdeps/aarch64/fpu/exp10f_advsimd.c +++ b/sysdeps/aarch64/fpu/exp10f_advsimd.c @@ -25,7 +25,8 @@ static const struct data { float32x4_t poly[5]; - float32x4_t log10_2_and_inv, shift; + float log10_2_and_inv[4]; + float32x4_t shift; #if !WANT_SIMD_EXCEPT float32x4_t scale_thresh; @@ -111,10 +112,11 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (exp10) (float32x4_t x) /* exp10(x) = 2^n * 10^r = 2^n * (1 + poly (r)), with poly(r) in [1/sqrt(2), sqrt(2)] and x = r + n * log10 (2), with r in [-log10(2)/2, log10(2)/2]. */ - float32x4_t z = vfmaq_laneq_f32 (d->shift, x, d->log10_2_and_inv, 0); + float32x4_t log10_2_and_inv = vld1q_f32 (d->log10_2_and_inv); + float32x4_t z = vfmaq_laneq_f32 (d->shift, x, log10_2_and_inv, 0); float32x4_t n = vsubq_f32 (z, d->shift); - float32x4_t r = vfmsq_laneq_f32 (x, n, d->log10_2_and_inv, 1); - r = vfmsq_laneq_f32 (r, n, d->log10_2_and_inv, 2); + float32x4_t r = vfmsq_laneq_f32 (x, n, log10_2_and_inv, 1); + r = vfmsq_laneq_f32 (r, n, log10_2_and_inv, 2); uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_f32 (z), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, ExponentBias)); diff --git a/sysdeps/aarch64/fpu/expm1_advsimd.c b/sysdeps/aarch64/fpu/expm1_advsimd.c index 3628398674..3db3b80c49 100644 --- a/sysdeps/aarch64/fpu/expm1_advsimd.c +++ b/sysdeps/aarch64/fpu/expm1_advsimd.c @@ -23,7 +23,9 @@ static const struct data { float64x2_t poly[11]; - float64x2_t invln2, ln2, shift; + float64x2_t invln2; + double ln2[2]; + float64x2_t shift; int64x2_t exponent_bias; #if WANT_SIMD_EXCEPT uint64x2_t thresh, tiny_bound; @@ -92,8 +94,9 @@ float64x2_t VPCS_ATTR V_NAME_D1 (expm1) (float64x2_t x) where 2^i is exact because i is an integer. */ float64x2_t n = vsubq_f64 (vfmaq_f64 (d->shift, d->invln2, x), d->shift); int64x2_t i = vcvtq_s64_f64 (n); - float64x2_t f = vfmsq_laneq_f64 (x, n, d->ln2, 0); - f = vfmsq_laneq_f64 (f, n, d->ln2, 1); + float64x2_t ln2 = vld1q_f64 (&d->ln2[0]); + float64x2_t f = vfmsq_laneq_f64 (x, n, ln2, 0); + f = vfmsq_laneq_f64 (f, n, ln2, 1); /* Approximate expm1(f) using polynomial. Taylor expansion for expm1(x) has the form: diff --git a/sysdeps/aarch64/fpu/expm1f_advsimd.c b/sysdeps/aarch64/fpu/expm1f_advsimd.c index 93db200f61..a0616ec754 100644 --- a/sysdeps/aarch64/fpu/expm1f_advsimd.c +++ b/sysdeps/aarch64/fpu/expm1f_advsimd.c @@ -23,7 +23,7 @@ static const struct data { float32x4_t poly[5]; - float32x4_t invln2_and_ln2; + float invln2_and_ln2[4]; float32x4_t shift; int32x4_t exponent_bias; #if WANT_SIMD_EXCEPT @@ -88,11 +88,12 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (expm1) (float32x4_t x) and f = x - i * ln2, then f is in [-ln2/2, ln2/2]. exp(x) - 1 = 2^i * (expm1(f) + 1) - 1 where 2^i is exact because i is an integer. */ - float32x4_t j = vsubq_f32 ( - vfmaq_laneq_f32 (d->shift, x, d->invln2_and_ln2, 0), d->shift); + float32x4_t invln2_and_ln2 = vld1q_f32 (d->invln2_and_ln2); + float32x4_t j + = vsubq_f32 (vfmaq_laneq_f32 (d->shift, x, invln2_and_ln2, 0), d->shift); int32x4_t i = vcvtq_s32_f32 (j); - float32x4_t f = vfmsq_laneq_f32 (x, j, d->invln2_and_ln2, 1); - f = vfmsq_laneq_f32 (f, j, d->invln2_and_ln2, 2); + float32x4_t f = vfmsq_laneq_f32 (x, j, invln2_and_ln2, 1); + f = vfmsq_laneq_f32 (f, j, invln2_and_ln2, 2); /* Approximate expm1(f) using polynomial. Taylor expansion for expm1(x) has the form: diff --git a/sysdeps/aarch64/fpu/log10_advsimd.c b/sysdeps/aarch64/fpu/log10_advsimd.c index 1e5ef99e89..c065aaebae 100644 --- a/sysdeps/aarch64/fpu/log10_advsimd.c +++ b/sysdeps/aarch64/fpu/log10_advsimd.c @@ -58,8 +58,10 @@ static inline struct entry lookup (uint64x2_t i) { struct entry e; - uint64_t i0 = (i[0] >> (52 - V_LOG10_TABLE_BITS)) & IndexMask; - uint64_t i1 = (i[1] >> (52 - V_LOG10_TABLE_BITS)) & IndexMask; + uint64_t i0 + = (vgetq_lane_u64 (i, 0) >> (52 - V_LOG10_TABLE_BITS)) & IndexMask; + uint64_t i1 + = (vgetq_lane_u64 (i, 1) >> (52 - V_LOG10_TABLE_BITS)) & IndexMask; float64x2_t e0 = vld1q_f64 (&__v_log10_data.table[i0].invc); float64x2_t e1 = vld1q_f64 (&__v_log10_data.table[i1].invc); e.invc = vuzp1q_f64 (e0, e1); diff --git a/sysdeps/aarch64/fpu/log2_advsimd.c b/sysdeps/aarch64/fpu/log2_advsimd.c index a34978f6cf..4057c552d8 100644 --- a/sysdeps/aarch64/fpu/log2_advsimd.c +++ b/sysdeps/aarch64/fpu/log2_advsimd.c @@ -55,8 +55,10 @@ static inline struct entry lookup (uint64x2_t i) { struct entry e; - uint64_t i0 = (i[0] >> (52 - V_LOG2_TABLE_BITS)) & IndexMask; - uint64_t i1 = (i[1] >> (52 - V_LOG2_TABLE_BITS)) & IndexMask; + uint64_t i0 + = (vgetq_lane_u64 (i, 0) >> (52 - V_LOG2_TABLE_BITS)) & IndexMask; + uint64_t i1 + = (vgetq_lane_u64 (i, 1) >> (52 - V_LOG2_TABLE_BITS)) & IndexMask; float64x2_t e0 = vld1q_f64 (&__v_log2_data.table[i0].invc); float64x2_t e1 = vld1q_f64 (&__v_log2_data.table[i1].invc); e.invc = vuzp1q_f64 (e0, e1); diff --git a/sysdeps/aarch64/fpu/log_advsimd.c b/sysdeps/aarch64/fpu/log_advsimd.c index 21df61728c..015a6da7d7 100644 --- a/sysdeps/aarch64/fpu/log_advsimd.c +++ b/sysdeps/aarch64/fpu/log_advsimd.c @@ -54,17 +54,12 @@ lookup (uint64x2_t i) { /* Since N is a power of 2, n % N = n & (N - 1). */ struct entry e; - uint64_t i0 = (i[0] >> (52 - V_LOG_TABLE_BITS)) & IndexMask; - uint64_t i1 = (i[1] >> (52 - V_LOG_TABLE_BITS)) & IndexMask; + uint64_t i0 = (vgetq_lane_u64 (i, 0) >> (52 - V_LOG_TABLE_BITS)) & IndexMask; + uint64_t i1 = (vgetq_lane_u64 (i, 1) >> (52 - V_LOG_TABLE_BITS)) & IndexMask; float64x2_t e0 = vld1q_f64 (&__v_log_data.table[i0].invc); float64x2_t e1 = vld1q_f64 (&__v_log_data.table[i1].invc); -#if __BYTE_ORDER == __LITTLE_ENDIAN e.invc = vuzp1q_f64 (e0, e1); e.logc = vuzp2q_f64 (e0, e1); -#else - e.invc = vuzp1q_f64 (e1, e0); - e.logc = vuzp2q_f64 (e1, e0); -#endif return e; } diff --git a/sysdeps/aarch64/fpu/sinh_advsimd.c b/sysdeps/aarch64/fpu/sinh_advsimd.c index fa3723b10c..3e3b76c502 100644 --- a/sysdeps/aarch64/fpu/sinh_advsimd.c +++ b/sysdeps/aarch64/fpu/sinh_advsimd.c @@ -22,8 +22,9 @@ static const struct data { - float64x2_t poly[11]; - float64x2_t inv_ln2, m_ln2, shift; + float64x2_t poly[11], inv_ln2; + double m_ln2[2]; + float64x2_t shift; uint64x2_t halff; int64x2_t onef; #if WANT_SIMD_EXCEPT @@ -40,7 +41,7 @@ static const struct data V2 (0x1.af5eedae67435p-26), V2 (0x1.1f143d060a28ap-29), }, .inv_ln2 = V2 (0x1.71547652b82fep0), - .m_ln2 = (float64x2_t) {-0x1.62e42fefa39efp-1, -0x1.abc9e3b39803fp-56}, + .m_ln2 = {-0x1.62e42fefa39efp-1, -0x1.abc9e3b39803fp-56}, .shift = V2 (0x1.8p52), .halff = V2 (0x3fe0000000000000), @@ -67,8 +68,10 @@ expm1_inline (float64x2_t x) and f = x - i * ln2 (f in [-ln2/2, ln2/2]). */ float64x2_t j = vsubq_f64 (vfmaq_f64 (d->shift, d->inv_ln2, x), d->shift); int64x2_t i = vcvtq_s64_f64 (j); - float64x2_t f = vfmaq_laneq_f64 (x, j, d->m_ln2, 0); - f = vfmaq_laneq_f64 (f, j, d->m_ln2, 1); + + float64x2_t m_ln2 = vld1q_f64 (d->m_ln2); + float64x2_t f = vfmaq_laneq_f64 (x, j, m_ln2, 0); + f = vfmaq_laneq_f64 (f, j, m_ln2, 1); /* Approximate expm1(f) using polynomial. */ float64x2_t f2 = vmulq_f64 (f, f); float64x2_t f4 = vmulq_f64 (f2, f2); diff --git a/sysdeps/aarch64/fpu/tan_advsimd.c b/sysdeps/aarch64/fpu/tan_advsimd.c index 0459821ab2..d56a102dd1 100644 --- a/sysdeps/aarch64/fpu/tan_advsimd.c +++ b/sysdeps/aarch64/fpu/tan_advsimd.c @@ -23,7 +23,8 @@ static const struct data { float64x2_t poly[9]; - float64x2_t half_pi, two_over_pi, shift; + double half_pi[2]; + float64x2_t two_over_pi, shift; #if !WANT_SIMD_EXCEPT float64x2_t range_val; #endif @@ -81,8 +82,9 @@ float64x2_t VPCS_ATTR V_NAME_D1 (tan) (float64x2_t x) /* Use q to reduce x to r in [-pi/4, pi/4], by: r = x - q * pi/2, in extended precision. */ float64x2_t r = x; - r = vfmsq_laneq_f64 (r, q, dat->half_pi, 0); - r = vfmsq_laneq_f64 (r, q, dat->half_pi, 1); + float64x2_t half_pi = vld1q_f64 (dat->half_pi); + r = vfmsq_laneq_f64 (r, q, half_pi, 0); + r = vfmsq_laneq_f64 (r, q, half_pi, 1); /* Further reduce r to [-pi/8, pi/8], to be reconstructed using double angle formula. */ r = vmulq_n_f64 (r, 0.5); diff --git a/sysdeps/aarch64/fpu/tanf_advsimd.c b/sysdeps/aarch64/fpu/tanf_advsimd.c index 5a7489390a..705586f0c0 100644 --- a/sysdeps/aarch64/fpu/tanf_advsimd.c +++ b/sysdeps/aarch64/fpu/tanf_advsimd.c @@ -23,7 +23,7 @@ static const struct data { float32x4_t poly[6]; - float32x4_t pi_consts; + float pi_consts[4]; float32x4_t shift; #if !WANT_SIMD_EXCEPT float32x4_t range_val; @@ -95,16 +95,17 @@ float32x4_t VPCS_ATTR NOINLINE V_NAME_F1 (tan) (float32x4_t x) #endif /* n = rint(x/(pi/2)). */ - float32x4_t q = vfmaq_laneq_f32 (d->shift, x, d->pi_consts, 3); + float32x4_t pi_consts = vld1q_f32 (d->pi_consts); + float32x4_t q = vfmaq_laneq_f32 (d->shift, x, pi_consts, 3); float32x4_t n = vsubq_f32 (q, d->shift); /* Determine if x lives in an interval, where |tan(x)| grows to infinity. */ uint32x4_t pred_alt = vtstq_u32 (vreinterpretq_u32_f32 (q), v_u32 (1)); /* r = x - n * (pi/2) (range reduction into -pi./4 .. pi/4). */ float32x4_t r; - r = vfmaq_laneq_f32 (x, n, d->pi_consts, 0); - r = vfmaq_laneq_f32 (r, n, d->pi_consts, 1); - r = vfmaq_laneq_f32 (r, n, d->pi_consts, 2); + r = vfmaq_laneq_f32 (x, n, pi_consts, 0); + r = vfmaq_laneq_f32 (r, n, pi_consts, 1); + r = vfmaq_laneq_f32 (r, n, pi_consts, 2); /* If x lives in an interval, where |tan(x)| - is finite, then use a polynomial approximation of the form diff --git a/sysdeps/aarch64/fpu/v_expf_inline.h b/sysdeps/aarch64/fpu/v_expf_inline.h index a3b0e32f9e..08b06e0a6b 100644 --- a/sysdeps/aarch64/fpu/v_expf_inline.h +++ b/sysdeps/aarch64/fpu/v_expf_inline.h @@ -25,7 +25,8 @@ struct v_expf_data { float32x4_t poly[5]; - float32x4_t shift, invln2_and_ln2; + float32x4_t shift; + float invln2_and_ln2[4]; }; /* maxerr: 1.45358 +0.5 ulp. */ @@ -50,10 +51,11 @@ v_expf_inline (float32x4_t x, const struct v_expf_data *d) /* exp(x) = 2^n (1 + poly(r)), with 1 + poly(r) in [1/sqrt(2),sqrt(2)] x = ln2*n + r, with r in [-ln2/2, ln2/2]. */ float32x4_t n, r, z; - z = vfmaq_laneq_f32 (d->shift, x, d->invln2_and_ln2, 0); + float32x4_t invln2_and_ln2 = vld1q_f32 (d->invln2_and_ln2); + z = vfmaq_laneq_f32 (d->shift, x, invln2_and_ln2, 0); n = vsubq_f32 (z, d->shift); - r = vfmsq_laneq_f32 (x, n, d->invln2_and_ln2, 1); - r = vfmsq_laneq_f32 (r, n, d->invln2_and_ln2, 2); + r = vfmsq_laneq_f32 (x, n, invln2_and_ln2, 1); + r = vfmsq_laneq_f32 (r, n, invln2_and_ln2, 2); uint32x4_t e = vshlq_n_u32 (vreinterpretq_u32_f32 (z), 23); float32x4_t scale = vreinterpretq_f32_u32 (vaddq_u32 (e, ExponentBias)); diff --git a/sysdeps/aarch64/fpu/v_expm1f_inline.h b/sysdeps/aarch64/fpu/v_expm1f_inline.h index 337ccfbfab..59b552da6b 100644 --- a/sysdeps/aarch64/fpu/v_expm1f_inline.h +++ b/sysdeps/aarch64/fpu/v_expm1f_inline.h @@ -26,7 +26,8 @@ struct v_expm1f_data { float32x4_t poly[5]; - float32x4_t invln2_and_ln2, shift; + float invln2_and_ln2[4]; + float32x4_t shift; int32x4_t exponent_bias; }; @@ -49,11 +50,12 @@ expm1f_inline (float32x4_t x, const struct v_expm1f_data *d) calling routine should handle special values if required. */ /* Reduce argument: f in [-ln2/2, ln2/2], i is exact. */ - float32x4_t j = vsubq_f32 ( - vfmaq_laneq_f32 (d->shift, x, d->invln2_and_ln2, 0), d->shift); + float32x4_t invln2_and_ln2 = vld1q_f32 (d->invln2_and_ln2); + float32x4_t j + = vsubq_f32 (vfmaq_laneq_f32 (d->shift, x, invln2_and_ln2, 0), d->shift); int32x4_t i = vcvtq_s32_f32 (j); - float32x4_t f = vfmsq_laneq_f32 (x, j, d->invln2_and_ln2, 1); - f = vfmsq_laneq_f32 (f, j, d->invln2_and_ln2, 2); + float32x4_t f = vfmsq_laneq_f32 (x, j, invln2_and_ln2, 1); + f = vfmsq_laneq_f32 (f, j, invln2_and_ln2, 2); /* Approximate expm1(f) with polynomial P, expm1(f) ~= f + f^2 * P(f). Uses Estrin scheme, where the main _ZGVnN4v_expm1f routine uses -- 2.27.0