From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=L+iR=F6=arm.com=Wilco.Dijkstra@sourceware.org>
Received: from EUR05-VI1-obe.outbound.protection.outlook.com (mail-vi1eur05on2041.outbound.protection.outlook.com [40.107.21.41])
	by sourceware.org (Postfix) with ESMTPS id E43D03858C2A
	for <gcc-patches@gcc.gnu.org>; Mon, 16 Oct 2023 12:27:18 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E43D03858C2A
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E43D03858C2A
Authentication-Results: server2.sourceware.org; arc=fail smtp.remote-ip=40.107.21.41
ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1697459241; cv=fail;
	b=i1T6ph4j8vs8yX0A0nCDwgDJEdrxqV7JzScdN1m9fTt+JQf35B1f5KA/dpBgkgWqc+/L0GoyRCDv2aiQBl7auKp8lTo2NxmINH8/6TxM7I/uHm6VuYc201OFLtxUWMK7oNNqaeFmsVtFTNBWmtuvzU6UQ9Ek8JyFEkF7qfTZ2Nc=
ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key;
	t=1697459241; c=relaxed/simple;
	bh=wndXF/7x2VIykXJGweNIaVi1G5ruPf4oFu7v0SSy3PQ=;
	h=DKIM-Signature:DKIM-Signature:From:To:Subject:Date:Message-ID:
	 MIME-Version; b=u9/CsCHfscGxfWY0WJEBd9aiMth5mU858XkJSjkJDRxXwKkai6MS5OojSWLQJbz1hXxoyeRgX0qraK4TGTwGZmUl0pziN9+FhuWNKsR22p1KSZEPIXnEr8J4MITcBx83Ais2X5UBJ6A7HoIJH0EzvtADf0dmEMnnQmF3ar/jTDY=
ARC-Authentication-Results: i=2; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=zmlSwyT9UXA4MyDCdQKutwlWw8X5LZXcX7Dz7ZrpR4g=;
 b=vP7v6J/ny03jjrGFrygjG/eF13AwhWhjg0KEy4BmfHqk0XpQH7Z/8Hoyy99+zPAUdjWPVYbCVM5qXmYu2T3SSPacJYG6THYxAshBMl8XaydmQUN4+g/CM/7CiYb96y0DZEb5J8S9CjdqKj6C35ByUUy7zfJw9uu+tFTn9kDVopQ=
Received: from AM6P194CA0035.EURP194.PROD.OUTLOOK.COM (2603:10a6:209:90::48)
 by PA6PR08MB10420.eurprd08.prod.outlook.com (2603:10a6:102:3d2::8) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6886.35; Mon, 16 Oct
 2023 12:27:15 +0000
Received: from AMS0EPF000001AD.eurprd05.prod.outlook.com
 (2603:10a6:209:90:cafe::5a) by AM6P194CA0035.outlook.office365.com
 (2603:10a6:209:90::48) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6886.35 via Frontend
 Transport; Mon, 16 Oct 2023 12:27:15 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
 63.35.35.123 as permitted sender) receiver=protection.outlook.com;
 client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
 pr=C
Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by
 AMS0EPF000001AD.mail.protection.outlook.com (10.167.16.153) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.6838.22 via Frontend Transport; Mon, 16 Oct 2023 12:27:15 +0000
Received: ("Tessian outbound fb5c0777b309:v211"); Mon, 16 Oct 2023 12:27:14 +0000
X-CheckRecipientChecked: true
X-CR-MTA-CID: 8d166c920e003cab
X-CR-MTA-TID: 64aa7808
Received: from 9d6c38f3529a.2
	by 64aa7808-outbound-1.mta.getcheckrecipient.com id D2FFE73E-A8DA-4476-BDE0-EE5D114B0754.1;
	Mon, 16 Oct 2023 12:27:08 +0000
Received: from EUR05-VI1-obe.outbound.protection.outlook.com
    by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 9d6c38f3529a.2
    (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384);
    Mon, 16 Oct 2023 12:27:08 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=lMr5gwNWeTBqGgqlDqqPhKzCt2/8lFYvs5idz3myzVhRZpCOcEysbKf6POIT+YHpNkpxA7kAAWOVC/aiODIE0/gRYJyA8pO+AX5F7u/CMxyzOqhOYdVdXrC95ZYxQ86JcTxozKNOuz76bLZCl1TaYimBZ+FFPKbbl5eHbow/ekeBPrHRzf40+H0jrrbXmkbpN73OJa5MWXPTKS4QzdJMyo0eXKW3IVgSc2odt89KEZu2wxMTjvr6gE1rURZaatDwRUe7MUWIi9vqZq9RhEDbbLVu8padrls0yrtuMu5Lh2HnM5gAyIGcYeSkCjIDaWtqaAMCeYkj+tchmNnTuOQpHw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=zmlSwyT9UXA4MyDCdQKutwlWw8X5LZXcX7Dz7ZrpR4g=;
 b=O5/0jz64jSBmr6CurtcIh1zADO/p9HmB6cxmLBngK8KbZsN3OdnJZiU3mxH/MrmECoH8tzRJJP7BKn1TcYoLHH6pYrOv6YD97QSpq99FvvlOiF9X5/15484loPUa7jIPibSjdfHzBwd4hhCWTBPc5tC3LCmibzVfY4uDSFFNhpErgg+aOoxZ4SXPB9dTr4lGAM5MZXcM91186IAg23GTncRYwaMRkkF0LEFfqxeyD7cYKlEha1KkLa+1Lv4OD6VWRvwbnh+s1t/ce6DfIQW1NKOE1ZdpsP7mEI6am0gdg+cnH3QgEmqFlZmfJrWz5jPrv9qZqgpw3ifuFshK0oboog==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector2-armh-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=zmlSwyT9UXA4MyDCdQKutwlWw8X5LZXcX7Dz7ZrpR4g=;
 b=vP7v6J/ny03jjrGFrygjG/eF13AwhWhjg0KEy4BmfHqk0XpQH7Z/8Hoyy99+zPAUdjWPVYbCVM5qXmYu2T3SSPacJYG6THYxAshBMl8XaydmQUN4+g/CM/7CiYb96y0DZEb5J8S9CjdqKj6C35ByUUy7zfJw9uu+tFTn9kDVopQ=
Received: from PAWPR08MB8982.eurprd08.prod.outlook.com (2603:10a6:102:33f::20)
 by GV2PR08MB9302.eurprd08.prod.outlook.com (2603:10a6:150:d4::10) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6886.34; Mon, 16 Oct
 2023 12:27:05 +0000
Received: from PAWPR08MB8982.eurprd08.prod.outlook.com
 ([fe80::31cd:30d1:37a7:3e8]) by PAWPR08MB8982.eurprd08.prod.outlook.com
 ([fe80::31cd:30d1:37a7:3e8%4]) with mapi id 15.20.6886.034; Mon, 16 Oct 2023
 12:27:05 +0000
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
CC: Richard Sandiford <Richard.Sandiford@arm.com>, Richard Earnshaw
	<Richard.Earnshaw@arm.com>
Subject: [PATCH v2] AArch64: Add inline memmove expansion
Thread-Topic: [PATCH v2] AArch64: Add inline memmove expansion
Thread-Index: AQHaACsZ3TFocMu4B0Ow6liA9vXqVg==
Date: Mon, 16 Oct 2023 12:27:05 +0000
Message-ID:
 <PAWPR08MB898291CAEA27140073ADE93983D7A@PAWPR08MB8982.eurprd08.prod.outlook.com>
Accept-Language: en-GB, en-US
Content-Language: en-GB
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels:
Authentication-Results-Original: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
x-ms-traffictypediagnostic:
	PAWPR08MB8982:EE_|GV2PR08MB9302:EE_|AMS0EPF000001AD:EE_|PA6PR08MB10420:EE_
X-MS-Office365-Filtering-Correlation-Id: 5c917573-a839-43d1-ad4c-08dbce433b07
x-checkrecipientrouted: true
nodisclaimer: true
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original:
 0i9hAhtaI/x8w85Fy4FmUMh3HasU8zPbxXh8fnsgOEerG6ro9yfubuYnvyJ5WmtJreEjrTorJSdxkAXqoeVe+GR6hL4rxzjBISxWAgEAKIBlowTkjeoFs/1VNt5yLt1t1KCr4gLjFUxXmipdCfG+zDtfBdfHFpAsouwfwLFmziiFWxI/8XjqN09cega080+cG+ipUvjxuwX3H7VyqWGyRh+Lgdtg/TY/82gyyOSXp1ZxPHavxttvwgvu1zeoKW1Tepqzwk0OUgzPX5IahubakPJ70/LTGI4vIM13eQ9ViAPipp7k1aEsujRk+eGYd92ISDsZvL5ugqnnVm+e90MyvNBMoFWx52ADk/nhBbBA7lp3u7GRQTSvfgVQ9R0HE+7wBUeu/TluT/eqec7MqzaiEI9/csgmMU+zvjaYWCIZiJgCQpnFSewRY0tIWHXaxybOy9fHyIlulPwih+MLUNFhpS2CwM1bhn/c42yfutTUoRJDlvFUjPyt766VPULhs9PrsmHKszfLtqeZn21WSzBRuOH2L0CEA4T/ytOI0YBGwVQbQ3ssXBSuDKVnWRiXe5sUGbKsxhTxVt9aHmaXcf736SZZ0tJ91h4wD4xdu6GNiTg=
X-Forefront-Antispam-Report-Untrusted:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:PAWPR08MB8982.eurprd08.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(136003)(366004)(376002)(346002)(39860400002)(396003)(230922051799003)(451199024)(186009)(64100799003)(1800799009)(55016003)(84970400001)(122000001)(38070700005)(38100700002)(83380400001)(9686003)(26005)(6506007)(71200400001)(7696005)(66446008)(316002)(76116006)(66946007)(6916009)(478600001)(66556008)(91956017)(64756008)(66476007)(54906003)(30864003)(2906002)(33656002)(41300700001)(86362001)(8676002)(8936002)(4326008)(52536014)(5660300002);DIR:OUT;SFP:1101;
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-Transport-CrossTenantHeadersStamped: GV2PR08MB9302
Original-Authentication-Results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
X-EOPAttributedMessage: 0
X-MS-Exchange-Transport-CrossTenantHeadersStripped:
 AMS0EPF000001AD.eurprd05.prod.outlook.com
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id-Prvs:
	1bd58eeb-87f6-4594-9556-08dbce433581
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info:
	Qc3D+pEn0oUSt4SuHoXUvXq51Qn/AqIthl9n9Yvz0vpB/tjurucwO3IJZgr5RMthb5ncozZy/tpfecsCte30NRUCCMmUfAI702SBkQvDQuEKIi5F4Ph3z6RsFRzECcJEP+IJaPjSb03CAKRoKSgF7r4WCrSr9TMIlT5KQUj5o3kkm+Fn6SoHdeXFFsiF4BLcckHi3PYdRrLcYiZwtOjXVXnfmvRS2MtXLRzhEiejOEmke3pCEQSJE8lh51Co8Xp1VORqd3H6q+6NDm97npgt5yAsUOLmEzw/MNWFJSiEMwKlJiRAqDalYZoiV1NigPqnjWJrQ8lBFo6gMHEIERSRDWsxAl5kvv+/57jfse0Ak7D9xfM7VTRl87ur+/D5PS+I8WKBXVCFeXUWkYiK7QYjayhHHWVxZZoAn9v7xuY5STDGKwSRDESXi3UqBdtKyVCDxi0OYQ6CJTs/i8vFqedqy6kQPtvI1uxKB+Jx1/RZ962ZWaSxhQZU5b2Z19Kx0U136FOa7xboy86dQA87SroqOWhqMrR/lNrN2X0u40LDaWrITrkPYwcUuiXKk2Mulol2Ev1iJ49E/3mgheP16WWgmnhDzgG07oUI7s5Bup1hGnMK2bCFC/OpKlZxsjMxHcl49DAC1wccRk8LsCIl+6PIODHyptr0dRM5qiyqVkzs8MzfwTcXcx/yrEpX8PVVpePBAebGyJMdyJLE3BconSct8fSfZ5uWRnA5uP1iWyUK5Hs=
X-Forefront-Antispam-Report:
	CIP:63.35.35.123;CTRY:IE;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:64aa7808-outbound-1.mta.getcheckrecipient.com;PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com;CAT:NONE;SFS:(13230031)(4636009)(136003)(396003)(376002)(346002)(39860400002)(230922051799003)(451199024)(64100799003)(186009)(1800799009)(82310400011)(46966006)(36840700001)(40470700004)(84970400001)(55016003)(40480700001)(40460700003)(478600001)(70586007)(54906003)(70206006)(6916009)(47076005)(83380400001)(36860700001)(86362001)(356005)(82740400003)(9686003)(41300700001)(26005)(6506007)(316002)(7696005)(336012)(33656002)(5660300002)(81166007)(4326008)(52536014)(30864003)(8936002)(8676002)(2906002);DIR:OUT;SFP:1101;
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 16 Oct 2023 12:27:15.0921
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 5c917573-a839-43d1-ad4c-08dbce433b07
X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d;Ip=[63.35.35.123];Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com]
X-MS-Exchange-CrossTenant-AuthSource:
	AMS0EPF000001AD.eurprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PA6PR08MB10420
X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,FORGED_SPF_HELO,GIT_PATCH_0,KAM_DMARC_NONE,KAM_LOTSOFHASH,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,TXREP,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

v2: further cleanups, improved comments=0A=
=0A=
Add support for inline memmove expansions.  The generated code is identical=
=0A=
as for memcpy, except that all loads are emitted before stores rather than=
=0A=
being interleaved.  The maximum size is 256 bytes which requires at most 16=
=0A=
registers.=0A=
=0A=
Passes regress/bootstrap, OK for commit?=0A=
    =0A=
gcc/ChangeLog/=0A=
        * config/aarch64/aarch64.opt (aarch64_mops_memmove_size_threshold):=
=0A=
        Change default.=0A=
        * config/aarch64/aarch64.md (cpymemdi): Add a parameter.=0A=
        (movmemdi): Call aarch64_expand_cpymem.=0A=
        * config/aarch64/aarch64.cc (aarch64_copy_one_block): Rename functi=
on,=0A=
        simplify, support storing generated loads/stores. =0A=
        (aarch64_expand_cpymem): Support expansion of memmove.=0A=
        * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Add bool=
 arg.=0A=
=0A=
gcc/testsuite/ChangeLog/=0A=
        * gcc.target/aarch64/memmove.c: Add new test.=0A=
=0A=
---=0A=
=0A=
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch=
64-protos.h=0A=
index 60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..0d39622bd2826a3fde54d67b5c5=
da9ee9286cbbd 100644=0A=
--- a/gcc/config/aarch64/aarch64-protos.h=0A=
+++ b/gcc/config/aarch64/aarch64-protos.h=0A=
@@ -769,7 +769,7 @@ bool aarch64_emit_approx_sqrt (rtx, rtx, bool);=0A=
 tree aarch64_vector_load_decl (tree);=0A=
 void aarch64_expand_call (rtx, rtx, rtx, bool);=0A=
 bool aarch64_expand_cpymem_mops (rtx *, bool);=0A=
-bool aarch64_expand_cpymem (rtx *);=0A=
+bool aarch64_expand_cpymem (rtx *, bool);=0A=
 bool aarch64_expand_setmem (rtx *);=0A=
 bool aarch64_float_const_zero_rtx_p (rtx);=0A=
 bool aarch64_float_const_rtx_p (rtx);=0A=
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc=
=0A=
index 2fa5d09de85d385c1165e399bcc97681ef170916..e19e2d1de2e5b30eca672df05d9=
dcc1bc106ecc8 100644=0A=
--- a/gcc/config/aarch64/aarch64.cc=0A=
+++ b/gcc/config/aarch64/aarch64.cc=0A=
@@ -25238,52 +25238,37 @@ aarch64_progress_pointer (rtx pointer)=0A=
   return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer))=
);=0A=
 }=0A=
 =0A=
-/* Copy one MODE sized block from SRC to DST, then progress SRC and DST by=
=0A=
-   MODE bytes.  */=0A=
+/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */=0A=
 =0A=
 static void=0A=
-aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,=0A=
-					      machine_mode mode)=0A=
+aarch64_copy_one_block (rtx *load, rtx *store, rtx src, rtx dst,=0A=
+			int offset, machine_mode mode)=0A=
 {=0A=
-  /* Handle 256-bit memcpy separately.  We do this by making 2 adjacent me=
mory=0A=
-     address copies using V4SImode so that we can use Q registers.  */=0A=
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))=0A=
+  /* Emit explict load/store pair instructions for 32-byte copies.  */=0A=
+  if (known_eq (GET_MODE_SIZE (mode), 32))=0A=
     {=0A=
       mode =3D V4SImode;=0A=
+      rtx src1 =3D adjust_address (src, mode, offset);=0A=
+      rtx src2 =3D adjust_address (src, mode, offset + 16);=0A=
+      rtx dst1 =3D adjust_address (dst, mode, offset);=0A=
+      rtx dst2 =3D adjust_address (dst, mode, offset + 16);=0A=
       rtx reg1 =3D gen_reg_rtx (mode);=0A=
       rtx reg2 =3D gen_reg_rtx (mode);=0A=
-      /* "Cast" the pointers to the correct mode.  */=0A=
-      *src =3D adjust_address (*src, mode, 0);=0A=
-      *dst =3D adjust_address (*dst, mode, 0);=0A=
-      /* Emit the memcpy.  */=0A=
-      emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,=0A=
-					aarch64_progress_pointer (*src)));=0A=
-      emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,=0A=
-					 aarch64_progress_pointer (*dst), reg2));=0A=
-      /* Move the pointers forward.  */=0A=
-      *src =3D aarch64_move_pointer (*src, 32);=0A=
-      *dst =3D aarch64_move_pointer (*dst, 32);=0A=
+      *load =3D aarch64_gen_load_pair (mode, reg1, src1, reg2, src2);=0A=
+      *store =3D aarch64_gen_store_pair (mode, dst1, reg1, dst2, reg2);=0A=
       return;=0A=
     }=0A=
 =0A=
   rtx reg =3D gen_reg_rtx (mode);=0A=
-=0A=
-  /* "Cast" the pointers to the correct mode.  */=0A=
-  *src =3D adjust_address (*src, mode, 0);=0A=
-  *dst =3D adjust_address (*dst, mode, 0);=0A=
-  /* Emit the memcpy.  */=0A=
-  emit_move_insn (reg, *src);=0A=
-  emit_move_insn (*dst, reg);=0A=
-  /* Move the pointers forward.  */=0A=
-  *src =3D aarch64_progress_pointer (*src);=0A=
-  *dst =3D aarch64_progress_pointer (*dst);=0A=
+  *load =3D gen_move_insn (reg, adjust_address (src, mode, offset));=0A=
+  *store =3D gen_move_insn (adjust_address (dst, mode, offset), reg);=0A=
 }=0A=
 =0A=
 /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken=0A=
    from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmov=
e=0A=
    rather than memcpy.  Return true iff we succeeded.  */=0A=
 bool=0A=
-aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove =3D false)=0A=
+aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)=0A=
 {=0A=
   if (!TARGET_MOPS)=0A=
     return false;=0A=
@@ -25302,51 +25287,48 @@ aarch64_expand_cpymem_mops (rtx *operands, bool i=
s_memmove =3D false)=0A=
   return true;=0A=
 }=0A=
 =0A=
-/* Expand cpymem, as if from a __builtin_memcpy.  Return true if=0A=
-   we succeed, otherwise return false, indicating that a libcall to=0A=
-   memcpy should be emitted.  */=0A=
-=0A=
+/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove.=0A=
+   OPERANDS are taken from the cpymem/movmem pattern.  IS_MEMMOVE is true=
=0A=
+   if this is a memmove rather than memcpy.  Return true if we succeed,=0A=
+   otherwise return false, indicating that a libcall should be emitted.  *=
/=0A=
 bool=0A=
-aarch64_expand_cpymem (rtx *operands)=0A=
+aarch64_expand_cpymem (rtx *operands, bool is_memmove)=0A=
 {=0A=
-  int mode_bits;=0A=
+  int mode_bytes;=0A=
   rtx dst =3D operands[0];=0A=
   rtx src =3D operands[1];=0A=
   unsigned align =3D UINTVAL (operands[3]);=0A=
   rtx base;=0A=
-  machine_mode cur_mode =3D BLKmode;=0A=
-  bool size_p =3D optimize_function_for_size_p (cfun);=0A=
+  machine_mode cur_mode =3D BLKmode, next_mode;=0A=
 =0A=
   /* Variable-sized or strict-align copies may use the MOPS expansion.  */=
=0A=
   if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16))=0A=
-    return aarch64_expand_cpymem_mops (operands);=0A=
+    return aarch64_expand_cpymem_mops (operands, is_memmove);=0A=
 =0A=
   unsigned HOST_WIDE_INT size =3D UINTVAL (operands[2]);=0A=
 =0A=
-  /* Try to inline up to 256 bytes.  */=0A=
-  unsigned max_copy_size =3D 256;=0A=
-  unsigned mops_threshold =3D aarch64_mops_memcpy_size_threshold;=0A=
+  /* Set inline limits for memmove/memcpy.  MOPS has a separate threshold.=
  */=0A=
+  unsigned max_copy_size =3D TARGET_SIMD ? 256 : 128;=0A=
+  unsigned mops_threshold =3D is_memmove ? aarch64_mops_memmove_size_thres=
hold=0A=
+				       : aarch64_mops_memcpy_size_threshold;=0A=
+=0A=
+  /* Reduce the maximum size with -Os.  */=0A=
+  if (optimize_function_for_size_p (cfun))=0A=
+    max_copy_size /=3D 4;=0A=
 =0A=
   /* Large copies use MOPS when available or a library call.  */=0A=
   if (size > max_copy_size || (TARGET_MOPS && size > mops_threshold))=0A=
-    return aarch64_expand_cpymem_mops (operands);=0A=
+    return aarch64_expand_cpymem_mops (operands, is_memmove);=0A=
 =0A=
-  int copy_bits =3D 256;=0A=
+  unsigned copy_max =3D 32;=0A=
 =0A=
-  /* Default to 256-bit LDP/STP on large copies, however small copies, no =
SIMD=0A=
-     support or slow 256-bit LDP/STP fall back to 128-bit chunks.  */=0A=
+  /* Default to 32-byte LDP/STP on large copies, however small copies, no =
SIMD=0A=
+     support or slow LDP/STP fall back to 16-byte chunks.  */=0A=
   if (size <=3D 24=0A=
       || !TARGET_SIMD=0A=
       || (aarch64_tune_params.extra_tuning_flags=0A=
 	  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))=0A=
-    copy_bits =3D 128;=0A=
-=0A=
-  /* Emit an inline load+store sequence and count the number of operations=
=0A=
-     involved.  We use a simple count of just the loads and stores emitted=
=0A=
-     rather than rtx_insn count as all the pointer adjustments and reg cop=
ying=0A=
-     in this function will get optimized away later in the pipeline.  */=
=0A=
-  start_sequence ();=0A=
-  unsigned nops =3D 0;=0A=
+    copy_max =3D 16;=0A=
 =0A=
   base =3D copy_to_mode_reg (Pmode, XEXP (dst, 0));=0A=
   dst =3D adjust_automodify_address (dst, VOIDmode, base, 0);=0A=
@@ -25354,69 +25336,60 @@ aarch64_expand_cpymem (rtx *operands)=0A=
   base =3D copy_to_mode_reg (Pmode, XEXP (src, 0));=0A=
   src =3D adjust_automodify_address (src, VOIDmode, base, 0);=0A=
 =0A=
-  /* Convert size to bits to make the rest of the code simpler.  */=0A=
-  int n =3D size * BITS_PER_UNIT;=0A=
+  const int max_ops =3D 40;=0A=
+  rtx load[max_ops], store[max_ops];=0A=
 =0A=
-  while (n > 0)=0A=
+  int nops, offset;=0A=
+=0A=
+  for (nops =3D 0, offset =3D 0; size > 0; nops++)=0A=
     {=0A=
       /* Find the largest mode in which to do the copy in without over rea=
ding=0A=
 	 or writing.  */=0A=
       opt_scalar_int_mode mode_iter;=0A=
       FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)=0A=
-	if (GET_MODE_BITSIZE (mode_iter.require ()) <=3D MIN (n, copy_bits))=0A=
+	if (GET_MODE_SIZE (mode_iter.require ()) <=3D MIN (size, copy_max))=0A=
 	  cur_mode =3D mode_iter.require ();=0A=
 =0A=
-      gcc_assert (cur_mode !=3D BLKmode);=0A=
+      gcc_assert (cur_mode !=3D BLKmode && nops < max_ops);=0A=
 =0A=
-      mode_bits =3D GET_MODE_BITSIZE (cur_mode).to_constant ();=0A=
+      mode_bytes =3D GET_MODE_SIZE (cur_mode).to_constant ();=0A=
 =0A=
       /* Prefer Q-register accesses for the last bytes.  */=0A=
-      if (mode_bits =3D=3D 128 && copy_bits =3D=3D 256)=0A=
+      if (mode_bytes =3D=3D 16 && copy_max =3D=3D 32)=0A=
 	cur_mode =3D V4SImode;=0A=
 =0A=
-      aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode);=
=0A=
-      /* A single block copy is 1 load + 1 store.  */=0A=
-      nops +=3D 2;=0A=
-      n -=3D mode_bits;=0A=
+      aarch64_copy_one_block (&load[nops], &store[nops], src, dst, offset,=
 cur_mode);=0A=
+      size -=3D mode_bytes;=0A=
+      offset +=3D mode_bytes;=0A=
 =0A=
       /* Emit trailing copies using overlapping unaligned accesses=0A=
-	(when !STRICT_ALIGNMENT) - this is smaller and faster.  */=0A=
-      if (n > 0 && n < copy_bits / 2 && !STRICT_ALIGNMENT)=0A=
+	 (when !STRICT_ALIGNMENT) - this is smaller and faster.  */=0A=
+      if (size > 0 && size < copy_max / 2 && !STRICT_ALIGNMENT)=0A=
 	{=0A=
-	  machine_mode next_mode =3D smallest_mode_for_size (n, MODE_INT);=0A=
-	  int n_bits =3D GET_MODE_BITSIZE (next_mode).to_constant ();=0A=
-	  gcc_assert (n_bits <=3D mode_bits);=0A=
-	  src =3D aarch64_move_pointer (src, (n - n_bits) / BITS_PER_UNIT);=0A=
-	  dst =3D aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);=0A=
-	  n =3D n_bits;=0A=
+	  next_mode =3D smallest_mode_for_size (size * BITS_PER_UNIT, MODE_INT);=
=0A=
+	  int n_bytes =3D GET_MODE_SIZE (next_mode).to_constant ();=0A=
+	  gcc_assert (n_bytes <=3D mode_bytes);=0A=
+	  offset -=3D n_bytes - size;=0A=
+	  size =3D n_bytes;=0A=
 	}=0A=
     }=0A=
-  rtx_insn *seq =3D get_insns ();=0A=
-  end_sequence ();=0A=
-  /* MOPS sequence requires 3 instructions for the memory copying + 1 to m=
ove=0A=
-     the constant size into a register.  */=0A=
-  unsigned mops_cost =3D 3 + 1;=0A=
-=0A=
-  /* If MOPS is available at this point we don't consider the libcall as i=
t's=0A=
-     not a win even on code size.  At this point only consider MOPS if=0A=
-     optimizing for size.  For speed optimizations we will have chosen bet=
ween=0A=
-     the two based on copy size already.  */=0A=
-  if (TARGET_MOPS)=0A=
-    {=0A=
-      if (size_p && mops_cost < nops)=0A=
-	return aarch64_expand_cpymem_mops (operands);=0A=
-      emit_insn (seq);=0A=
-      return true;=0A=
-    }=0A=
 =0A=
-  /* A memcpy libcall in the worst case takes 3 instructions to prepare th=
e=0A=
-     arguments + 1 for the call.  When MOPS is not available and we're=0A=
-     optimizing for size a libcall may be preferable.  */=0A=
-  unsigned libcall_cost =3D 4;=0A=
-  if (size_p && libcall_cost < nops)=0A=
-    return false;=0A=
+  /* Memcpy interleaves loads with stores, memmove emits all loads first. =
 */=0A=
+  int i, j, m, inc;=0A=
+  inc =3D is_memmove ? nops : 3;=0A=
+  if (nops =3D=3D inc + 1)=0A=
+    inc =3D nops / 2;=0A=
+  for (i =3D 0; i < nops; i +=3D inc)=0A=
+    {=0A=
+      m =3D inc;=0A=
+      if (i + m > nops)=0A=
+	m =3D nops - i;=0A=
 =0A=
-  emit_insn (seq);=0A=
+      for (j =3D 0; j < m; j++)=0A=
+	emit_insn (load[i + j]);=0A=
+      for (j =3D 0; j < m; j++)=0A=
+	emit_insn (store[i + j]);=0A=
+    }=0A=
   return true;=0A=
 }=0A=
 =0A=
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md=
=0A=
index 1cb3a01d6791a48dc0b08df5783d97805448c7f2..18dd629c2456041b1185eae6d39=
de074709b2a39 100644=0A=
--- a/gcc/config/aarch64/aarch64.md=0A=
+++ b/gcc/config/aarch64/aarch64.md=0A=
@@ -1629,7 +1629,7 @@ (define_expand "cpymemdi"=0A=
    (match_operand:DI 3 "immediate_operand")]=0A=
    ""=0A=
 {=0A=
-  if (aarch64_expand_cpymem (operands))=0A=
+  if (aarch64_expand_cpymem (operands, false))=0A=
     DONE;=0A=
   FAIL;=0A=
 }=0A=
@@ -1673,17 +1673,9 @@ (define_expand "movmemdi"=0A=
    (match_operand:BLK 1 "memory_operand")=0A=
    (match_operand:DI 2 "general_operand")=0A=
    (match_operand:DI 3 "immediate_operand")]=0A=
-   "TARGET_MOPS"=0A=
+   ""=0A=
 {=0A=
-   rtx sz_reg =3D operands[2];=0A=
-   /* For constant-sized memmoves check the threshold.=0A=
-      FIXME: We should add a non-MOPS memmove expansion for smaller,=0A=
-      constant-sized memmove to avoid going to a libcall.  */=0A=
-   if (CONST_INT_P (sz_reg)=0A=
-       && INTVAL (sz_reg) < aarch64_mops_memmove_size_threshold)=0A=
-     FAIL;=0A=
-=0A=
-  if (aarch64_expand_cpymem_mops (operands, true))=0A=
+  if (aarch64_expand_cpymem (operands, true))=0A=
     DONE;=0A=
   FAIL;=0A=
 }=0A=
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.op=
t=0A=
index f5a518202a157b5b5bc2b2aa14ac1177fded7d66..0ac9d8c578d706e7bf0f0ae399d=
84544f0c619dc 100644=0A=
--- a/gcc/config/aarch64/aarch64.opt=0A=
+++ b/gcc/config/aarch64/aarch64.opt=0A=
@@ -327,7 +327,7 @@ Target Joined UInteger Var(aarch64_mops_memcpy_size_thr=
eshold) Init(256) Param=0A=
 Constant memcpy size in bytes above which to start using MOPS sequence.=0A=
 =0A=
 -param=3Daarch64-mops-memmove-size-threshold=3D=0A=
-Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(0) Pa=
ram=0A=
+Target Joined UInteger Var(aarch64_mops_memmove_size_threshold) Init(256) =
Param=0A=
 Constant memmove size in bytes above which to start using MOPS sequence.=
=0A=
 =0A=
 -param=3Daarch64-mops-memset-size-threshold=3D=0A=
diff --git a/gcc/testsuite/gcc.target/aarch64/memmove.c b/gcc/testsuite/gcc=
.target/aarch64/memmove.c=0A=
new file mode 100644=0A=
index 0000000000000000000000000000000000000000..6926a97761eb2578d3f1db7e6eb=
19dba17b888be=0A=
--- /dev/null=0A=
+++ b/gcc/testsuite/gcc.target/aarch64/memmove.c=0A=
@@ -0,0 +1,22 @@=0A=
+/* { dg-do compile } */=0A=
+/* { dg-options "-O2" } */=0A=
+=0A=
+void=0A=
+copy1 (int *x, int *y)=0A=
+{=0A=
+  __builtin_memmove (x, y, 12);=0A=
+}=0A=
+=0A=
+void=0A=
+copy2 (int *x, int *y)=0A=
+{=0A=
+  __builtin_memmove (x, y, 128);=0A=
+}=0A=
+=0A=
+void=0A=
+copy3 (int *x, int *y)=0A=
+{=0A=
+  __builtin_memmove (x, y, 255);=0A=
+}=0A=
+=0A=
+/* { dg-final { scan-assembler-not {\tb\tmemmove} } } */=0A=
=0A=