public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
@ 2023-09-25 19:49 Manos Anagnostakis
  2023-09-25 19:54 ` Andrew Pinski
  0 siblings, 1 reply; 12+ messages in thread
From: Manos Anagnostakis @ 2023-09-25 19:49 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Philipp Tomsich, Manos Anagnostakis

This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
to provide the requested behaviour for handling ldp and stp:

  /* Allow the tuning structure to disable LDP instruction formation
     from combining instructions (e.g., in peephole2).
     TODO: Implement fine-grained tuning control for LDP and STP:
           1. control policies for load and store separately;
           2. support the following policies:
              - default (use what is in the tuning structure)
              - always
              - never
              - aligned (only if the compiler can prove that the
                load will be aligned to 2 * element_size)  */

It provides two new and concrete target-specific command-line parameters
-param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
to give the ability to control load and store policies seperately as
stated in part 1 of the TODO.

The accepted values for both parameters are:
- 0: Use the policy of the tuning structure (default).
- 1: Emit ldp/stp regardless of alignment.
- 2: Do not emit ldp/stp.
- 3: In order to emit ldp/stp, first check if the load/store will
  be aligned to 2 * element_size.

gcc/ChangeLog:
        * config/aarch64/aarch64-protos.h (struct tune_params): Add
	appropriate enums for the policies.
        * config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
	options.
        * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
	function to parse ldp-policy parameter.
        (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
        (aarch64_override_options_internal): Call parsing functions.
        (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
	alignment check and remove superseded ones.
        (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
	alignment check and remove superseded ones.
        * config/aarch64/aarch64.opt: Add options.
	* doc/invoke.texi: Document the parameters accordingly.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
        * gcc.target/aarch64/ldp_aligned.c: New test.
        * gcc.target/aarch64/ldp_always.c: New test.
        * gcc.target/aarch64/ldp_never.c: New test.
        * gcc.target/aarch64/stp_aligned.c: New test.
        * gcc.target/aarch64/stp_always.c: New test.
        * gcc.target/aarch64/stp_never.c: New test.

Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
---
Changes in v3:
        - Changed command-line options to target-specific parameters
          and documented them accordingly in doc/invoke.texi.
        - Removed ampere1-no_ldp_combine.c test as superseded.

 gcc/config/aarch64/aarch64-protos.h           |  24 ++
 gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
 gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
 gcc/config/aarch64/aarch64.opt                |   8 +
 gcc/doc/invoke.texi                           |  30 +++
 .../aarch64/ampere1-no_ldp_combine.c          |  11 -
 .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
 gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
 gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
 .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
 gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
 gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
 12 files changed, 600 insertions(+), 74 deletions(-)
 delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 70303d6fd95..be1d73490ed 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -568,6 +568,30 @@ struct tune_params
   /* Place prefetch struct pointer at the end to enable type checking
      errors when tune_params misses elements (e.g., from erroneous merges).  */
   const struct cpu_prefetch_tune *prefetch;
+/* An enum specifying how to handle load pairs using a fine-grained policy:
+   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
+   to at least double the alignment of the type.
+   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
+   - LDP_POLICY_NEVER: Do not emit ldp.  */
+
+  enum aarch64_ldp_policy_model
+  {
+    LDP_POLICY_ALIGNED,
+    LDP_POLICY_ALWAYS,
+    LDP_POLICY_NEVER
+  } ldp_policy_model;
+/* An enum specifying how to handle store pairs using a fine-grained policy:
+   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
+   to at least double the alignment of the type.
+   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
+   - STP_POLICY_NEVER: Do not emit stp.  */
+
+  enum aarch64_stp_policy_model
+  {
+    STP_POLICY_ALIGNED,
+    STP_POLICY_ALWAYS,
+    STP_POLICY_NEVER
+  } stp_policy_model;
 };
 
 /* Classifies an address.
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 52112ba7c48..774568e9106 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,11 +30,6 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 
-/* Don't create non-8 byte aligned load/store pair.  That is if the
-two load/stores are not at least 8 byte aligned don't create load/store
-pairs.   */
-AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
-
 /* Some of the optional shift to some arthematic instructions are
    considered cheap.  Logical shift left <=4 with or without a
    zero extend are considered cheap.  Sign extend; non logical shift left
@@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 /* Disallow load/store pair instructions on Q-registers.  */
 AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
 
-/* Disallow load-pair instructions to be formed in combine/peephole.  */
-AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
-
 AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 219c4ee6d4c..9eeb5469cf9 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
      Neoverse V1.  It does not have a noticeable effect on A64FX and should
      have at most a very minor effect on SVE2 cores.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa73_tunings =
@@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 
@@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
-  &exynosm1_prefetch_tune
+  &exynosm1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderxt88_tunings =
@@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
-  &thunderxt88_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &thunderxt88_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
-   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &thunderx_prefetch_tune
+  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
+  &thunderx_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params tsv110_tunings =
@@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
   0,    /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
-  &tsv110_prefetch_tune
+  &tsv110_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
   17,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
-  &xgene1_prefetch_tune
+  &xgene1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params emag_tunings =
@@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
   17,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
-  &xgene1_prefetch_tune
+  &xgene1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params qdf24xx_tunings =
@@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
-  &qdf24xx_prefetch_tune
+  &qdf24xx_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
@@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx2t99_tunings =
@@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &thunderx2t99_prefetch_tune
+  &thunderx2t99_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx3t110_tunings =
@@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &thunderx3t110_prefetch_tune
+  &thunderx3t110_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params neoversen1_tunings =
@@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params ampere1_tunings =
@@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
-  &ampere1_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &ampere1_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params ampere1a_tunings =
@@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
-  &ampere1_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &ampere1_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
@@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const sve_vec_cost neoverse512tvb_sve_vector_cost =
@@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
@@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
@@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params a64fx_tunings =
@@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &a64fx_prefetch_tune
+  &a64fx_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
   return AARCH_PARSE_INVALID_ARG;
 }
 
+/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
+   the value of the parameter.  */
+
+static void
+aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
+{
+  if (value == 1)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
+  else if (value == 2)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
+  else if (value == 3)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
+}
+
+/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
+   the value of the parameter.  */
+
+static void
+aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
+{
+  if (value == 1)
+    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
+  else if (value == 2)
+    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
+  else if (value == 3)
+    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
+}
+
 /* Parse TOKEN, which has length LENGTH to see if it is an option
    described in FLAG.  If it is, return the index bit for that fusion type.
    If not, error (printing OPTION_NAME) and return zero.  */
@@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
     aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
 				   &aarch64_tune_params);
 
+  aarch64_parse_ldp_policy (aarch64_ldp_policy,
+			      &aarch64_tune_params);
+
+  aarch64_parse_stp_policy (aarch64_stp_policy,
+			      &aarch64_tune_params);
+
   /* This target defaults to strict volatile bitfields.  */
   if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
     opts->x_flag_strict_volatile_bitfields = 1;
@@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   enum reg_class rclass_1, rclass_2;
   rtx mem_1, mem_2, reg_1, reg_2;
 
-  /* Allow the tuning structure to disable LDP instruction formation
-     from combining instructions (e.g., in peephole2).
-     TODO: Implement fine-grained tuning control for LDP and STP:
-	   1. control policies for load and store separately;
-	   2. support the following policies:
-	      - default (use what is in the tuning structure)
-	      - always
-	      - never
-	      - aligned (only if the compiler can prove that the
-		load will be aligned to 2 * element_size)  */
-  if (load && (aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
+  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
+    return false;
+
+  /* If we have STP_POLICY_NEVER, reject the store pair.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
     return false;
 
   if (load)
@@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
     return false;
 
-  /* If we have SImode and slow unaligned ldp,
-     check the alignment to be at least 8 byte. */
-  if (mode == SImode
-      && (aarch64_tune_params.extra_tuning_flags
-          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
-      && !optimize_size
-      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+  /* If we have LDP_POLICY_ALIGNED,
+     do not emit the load pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
+      && !optimize_function_for_size_p (cfun)
+      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
+    return false;
+
+  /* If we have STP_POLICY_ALIGNED,
+     do not emit the store pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
+      && !optimize_function_for_size_p (cfun)
+      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
     return false;
 
   /* Check if the addresses are in the form of [base+offset].  */
@@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   HOST_WIDE_INT offvals[num_insns], msize;
   rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
 
+  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
+    return false;
+
+  /* If we have STP_POLICY_NEVER, reject the store pair.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
+    return false;
+
   if (load)
     {
       for (int i = 0; i < num_insns; i++)
@@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   if (offvals[0] % msize != offvals[2] % msize)
     return false;
 
-  /* If we have SImode and slow unaligned ldp,
-     check the alignment to be at least 8 byte. */
-  if (mode == SImode
-      && (aarch64_tune_params.extra_tuning_flags
-	  & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
-      && !optimize_size
-      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
+  /* If we have LDP_POLICY_ALIGNED,
+     do not emit the load pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
+      && !optimize_function_for_size_p (cfun)
+      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
+    return false;
+
+  /* If we have STP_POLICY_ALIGNED,
+     do not emit the store pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
+      && !optimize_function_for_size_p (cfun)
+      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
     return false;
 
   return true;
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 4a0580435a8..f61e3f968d4 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
 -param=aarch64-vect-unroll-limit=
 Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
 Limit how much the autovectorizer may unroll a loop.
+
+-param=aarch64-ldp-policy=
+Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
+Fine-grained policy for load pairs.
+
+-param=aarch64-stp-policy=
+Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
+Fine-grained policy for store pairs.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 146b40414b0..000dd5541f4 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
 @end table
 The default value is 0.
 
+@item aarch64-ldp-policy
+Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
+@table @samp
+@item 0
+Use the policy of the tuning structure.
+@item 1
+Emit ldp regardless of alignment.
+@item 2
+Do not emit ldp.
+@item 3
+Emit ldp only if the source pointer is aligned to at least double the alignment
+of the type.
+@end table
+The default value is 0.
+
+@item aarch64-stp-policy
+Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
+@table @samp
+@item 0
+Use the policy of the tuning structure.
+@item 1
+Emit stp regardless of alignment.
+@item 2
+Do not emit stp.
+@item 3
+Emit stp only if the source pointer is aligned to at least double the alignment
+of the type.
+@end table
+The default value is 0.
+
 @item aarch64-loop-vect-issue-rate-niters
 The tuning for some AArch64 CPUs tries to take both latencies and issue
 rates into account when deciding whether a loop should be vectorized
diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
deleted file mode 100644
index bc871f4481d..00000000000
--- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
+++ /dev/null
@@ -1,11 +0,0 @@
-/* { dg-options "-O3 -mtune=ampere1" } */
-
-long
-foo (long a[])
-{
-  return a[0] + a[1];
-}
-
-/* We should see two ldrs instead of one ldp. */
-/* { dg-final { scan-assembler {\tldr\t} } } */
-/* { dg-final { scan-assembler-not {\tldp\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
new file mode 100644
index 00000000000..8e43faab70d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
new file mode 100644
index 00000000000..532ca607565
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
new file mode 100644
index 00000000000..b39941c18d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
new file mode 100644
index 00000000000..01f294bb090
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
new file mode 100644
index 00000000000..cedb461b5b2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
new file mode 100644
index 00000000000..ddde658f807
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
+
-- 
2.40.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 19:49 [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation Manos Anagnostakis
@ 2023-09-25 19:54 ` Andrew Pinski
  2023-09-25 19:59   ` Philipp Tomsich
  2023-09-25 20:00   ` Manos Anagnostakis
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Pinski @ 2023-09-25 19:54 UTC (permalink / raw)
  To: Manos Anagnostakis; +Cc: gcc-patches, Kyrylo Tkachov, Philipp Tomsich

On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
<manos.anagnostakis@vrull.eu> wrote:
>
> This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> to provide the requested behaviour for handling ldp and stp:
>
>   /* Allow the tuning structure to disable LDP instruction formation
>      from combining instructions (e.g., in peephole2).
>      TODO: Implement fine-grained tuning control for LDP and STP:
>            1. control policies for load and store separately;
>            2. support the following policies:
>               - default (use what is in the tuning structure)
>               - always
>               - never
>               - aligned (only if the compiler can prove that the
>                 load will be aligned to 2 * element_size)  */
>
> It provides two new and concrete target-specific command-line parameters
> -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> to give the ability to control load and store policies seperately as
> stated in part 1 of the TODO.
>
> The accepted values for both parameters are:
> - 0: Use the policy of the tuning structure (default).
> - 1: Emit ldp/stp regardless of alignment.
> - 2: Do not emit ldp/stp.
> - 3: In order to emit ldp/stp, first check if the load/store will
>   be aligned to 2 * element_size.

Instead of a number, does it make sense to instead use an string
(ENUM) for this param.
Also I think using --param is a bad idea if it is going to be
documented in the user manual.
Maybe a -m option should be used instead.

Thanks,
Andrew

>
> gcc/ChangeLog:
>         * config/aarch64/aarch64-protos.h (struct tune_params): Add
>         appropriate enums for the policies.
>         * config/aarch64/aarch64-tuning-flags.def
>         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>         options.
>         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>         function to parse ldp-policy parameter.
>         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
>         (aarch64_override_options_internal): Call parsing functions.
>         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
>         alignment check and remove superseded ones.
>         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
>         alignment check and remove superseded ones.
>         * config/aarch64/aarch64.opt: Add options.
>         * doc/invoke.texi: Document the parameters accordingly.
>
> gcc/testsuite/ChangeLog:
>         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
>         * gcc.target/aarch64/ldp_aligned.c: New test.
>         * gcc.target/aarch64/ldp_always.c: New test.
>         * gcc.target/aarch64/ldp_never.c: New test.
>         * gcc.target/aarch64/stp_aligned.c: New test.
>         * gcc.target/aarch64/stp_always.c: New test.
>         * gcc.target/aarch64/stp_never.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> ---
> Changes in v3:
>         - Changed command-line options to target-specific parameters
>           and documented them accordingly in doc/invoke.texi.
>         - Removed ampere1-no_ldp_combine.c test as superseded.
>
>  gcc/config/aarch64/aarch64-protos.h           |  24 ++
>  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
>  gcc/config/aarch64/aarch64.opt                |   8 +
>  gcc/doc/invoke.texi                           |  30 +++
>  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
>  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
>  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
>  12 files changed, 600 insertions(+), 74 deletions(-)
>  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 70303d6fd95..be1d73490ed 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,6 +568,30 @@ struct tune_params
>    /* Place prefetch struct pointer at the end to enable type checking
>       errors when tune_params misses elements (e.g., from erroneous merges).  */
>    const struct cpu_prefetch_tune *prefetch;
> +/* An enum specifying how to handle load pairs using a fine-grained policy:
> +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> +
> +  enum aarch64_ldp_policy_model
> +  {
> +    LDP_POLICY_ALIGNED,
> +    LDP_POLICY_ALWAYS,
> +    LDP_POLICY_NEVER
> +  } ldp_policy_model;
> +/* An enum specifying how to handle store pairs using a fine-grained policy:
> +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> +   - STP_POLICY_NEVER: Do not emit stp.  */
> +
> +  enum aarch64_stp_policy_model
> +  {
> +    STP_POLICY_ALIGNED,
> +    STP_POLICY_ALWAYS,
> +    STP_POLICY_NEVER
> +  } stp_policy_model;
>  };
>
>  /* Classifies an address.
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 52112ba7c48..774568e9106 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -30,11 +30,6 @@
>
>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>
> -/* Don't create non-8 byte aligned load/store pair.  That is if the
> -two load/stores are not at least 8 byte aligned don't create load/store
> -pairs.   */
> -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> -
>  /* Some of the optional shift to some arthematic instructions are
>     considered cheap.  Logical shift left <=4 with or without a
>     zero extend are considered cheap.  Sign extend; non logical shift left
> @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>  /* Disallow load/store pair instructions on Q-registers.  */
>  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
>
> -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> -
>  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
>
>  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 219c4ee6d4c..9eeb5469cf9 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
>       Neoverse V1.  It does not have a noticeable effect on A64FX and should
>       have at most a very minor effect on SVE2 cores.  */
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa35_tunings =
> @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa53_tunings =
> @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa57_tunings =
> @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa72_tunings =
> @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa73_tunings =
> @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>
> @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
>    48,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> -  &exynosm1_prefetch_tune
> +  &exynosm1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderxt88_tunings =
> @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> -  &thunderxt88_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &thunderxt88_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx_tunings =
> @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> -  &thunderx_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> +  &thunderx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params tsv110_tunings =
> @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
>    0,    /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> -  &tsv110_prefetch_tune
> +  &tsv110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params xgene1_tunings =
> @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
>    17,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params emag_tunings =
> @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
>    17,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params qdf24xx_tunings =
> @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> -  &qdf24xx_prefetch_tune
> +  &qdf24xx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx2t99_tunings =
> @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &thunderx2t99_prefetch_tune
> +  &thunderx2t99_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx3t110_tunings =
> @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &thunderx3t110_prefetch_tune
> +  &thunderx3t110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params neoversen1_tunings =
> @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params ampere1_tunings =
> @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params ampere1a_tunings =
> @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params a64fx_tunings =
> @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &a64fx_prefetch_tune
> +  &a64fx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  /* Support for fine-grained override of the tuning structures.  */
> @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
>    return AARCH_PARSE_INVALID_ARG;
>  }
>
> +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> +   the value of the parameter.  */
> +
> +static void
> +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> +{
> +  if (value == 1)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> +  else if (value == 2)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> +  else if (value == 3)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> +}
> +
> +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> +   the value of the parameter.  */
> +
> +static void
> +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> +{
> +  if (value == 1)
> +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> +  else if (value == 2)
> +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> +  else if (value == 3)
> +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> +}
> +
>  /* Parse TOKEN, which has length LENGTH to see if it is an option
>     described in FLAG.  If it is, return the index bit for that fusion type.
>     If not, error (printing OPTION_NAME) and return zero.  */
> @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
>      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
>                                    &aarch64_tune_params);
>
> +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> +                             &aarch64_tune_params);
> +
> +  aarch64_parse_stp_policy (aarch64_stp_policy,
> +                             &aarch64_tune_params);
> +
>    /* This target defaults to strict volatile bitfields.  */
>    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
>      opts->x_flag_strict_volatile_bitfields = 1;
> @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>    enum reg_class rclass_1, rclass_2;
>    rtx mem_1, mem_2, reg_1, reg_2;
>
> -  /* Allow the tuning structure to disable LDP instruction formation
> -     from combining instructions (e.g., in peephole2).
> -     TODO: Implement fine-grained tuning control for LDP and STP:
> -          1. control policies for load and store separately;
> -          2. support the following policies:
> -             - default (use what is in the tuning structure)
> -             - always
> -             - never
> -             - aligned (only if the compiler can prove that the
> -               load will be aligned to 2 * element_size)  */
> -  if (load && (aarch64_tune_params.extra_tuning_flags
> -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
>      return false;
>
>    if (load)
> @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>      return false;
>
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> -      && !optimize_size
> -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> +      && !optimize_function_for_size_p (cfun)
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> +      && !optimize_function_for_size_p (cfun)
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>
>    /* Check if the addresses are in the form of [base+offset].  */
> @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>    HOST_WIDE_INT offvals[num_insns], msize;
>    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
>
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> +    return false;
> +
>    if (load)
>      {
>        for (int i = 0; i < num_insns; i++)
> @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>    if (offvals[0] % msize != offvals[2] % msize)
>      return false;
>
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> -      && !optimize_size
> -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> +      && !optimize_function_for_size_p (cfun)
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> +      && !optimize_function_for_size_p (cfun)
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>
>    return true;
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 4a0580435a8..f61e3f968d4 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
>  -param=aarch64-vect-unroll-limit=
>  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
>  Limit how much the autovectorizer may unroll a loop.
> +
> +-param=aarch64-ldp-policy=
> +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> +Fine-grained policy for load pairs.
> +
> +-param=aarch64-stp-policy=
> +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> +Fine-grained policy for store pairs.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 146b40414b0..000dd5541f4 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
>  @end table
>  The default value is 0.
>
> +@item aarch64-ldp-policy
> +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> +@table @samp
> +@item 0
> +Use the policy of the tuning structure.
> +@item 1
> +Emit ldp regardless of alignment.
> +@item 2
> +Do not emit ldp.
> +@item 3
> +Emit ldp only if the source pointer is aligned to at least double the alignment
> +of the type.
> +@end table
> +The default value is 0.
> +
> +@item aarch64-stp-policy
> +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> +@table @samp
> +@item 0
> +Use the policy of the tuning structure.
> +@item 1
> +Emit stp regardless of alignment.
> +@item 2
> +Do not emit stp.
> +@item 3
> +Emit stp only if the source pointer is aligned to at least double the alignment
> +of the type.
> +@end table
> +The default value is 0.
> +
>  @item aarch64-loop-vect-issue-rate-niters
>  The tuning for some AArch64 CPUs tries to take both latencies and issue
>  rates into account when deciding whether a loop should be vectorized
> diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> deleted file mode 100644
> index bc871f4481d..00000000000
> --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -/* { dg-options "-O3 -mtune=ampere1" } */
> -
> -long
> -foo (long a[])
> -{
> -  return a[0] + a[1];
> -}
> -
> -/* We should see two ldrs instead of one ldp. */
> -/* { dg-final { scan-assembler {\tldr\t} } } */
> -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> new file mode 100644
> index 00000000000..8e43faab70d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> new file mode 100644
> index 00000000000..532ca607565
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> new file mode 100644
> index 00000000000..b39941c18d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> new file mode 100644
> index 00000000000..01f294bb090
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> new file mode 100644
> index 00000000000..cedb461b5b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> new file mode 100644
> index 00000000000..ddde658f807
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 19:54 ` Andrew Pinski
@ 2023-09-25 19:59   ` Philipp Tomsich
  2023-09-25 20:04     ` Andrew Pinski
  2023-09-25 20:00   ` Manos Anagnostakis
  1 sibling, 1 reply; 12+ messages in thread
From: Philipp Tomsich @ 2023-09-25 19:59 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Manos Anagnostakis, gcc-patches, Kyrylo Tkachov

On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > to provide the requested behaviour for handling ldp and stp:
> >
> >   /* Allow the tuning structure to disable LDP instruction formation
> >      from combining instructions (e.g., in peephole2).
> >      TODO: Implement fine-grained tuning control for LDP and STP:
> >            1. control policies for load and store separately;
> >            2. support the following policies:
> >               - default (use what is in the tuning structure)
> >               - always
> >               - never
> >               - aligned (only if the compiler can prove that the
> >                 load will be aligned to 2 * element_size)  */
> >
> > It provides two new and concrete target-specific command-line parameters
> > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > to give the ability to control load and store policies seperately as
> > stated in part 1 of the TODO.
> >
> > The accepted values for both parameters are:
> > - 0: Use the policy of the tuning structure (default).
> > - 1: Emit ldp/stp regardless of alignment.
> > - 2: Do not emit ldp/stp.
> > - 3: In order to emit ldp/stp, first check if the load/store will
> >   be aligned to 2 * element_size.
>
> Instead of a number, does it make sense to instead use an string
> (ENUM) for this param.
> Also I think using --param is a bad idea if it is going to be
> documented in the user manual.
> Maybe a -m option should be used instead.

See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
for the discussion triggering the change from -m... to --param and the
change to using a number instead of a string.

Thanks,
Philipp.

>
> Thanks,
> Andrew
>
> >
> > gcc/ChangeLog:
> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> >         appropriate enums for the policies.
> >         * config/aarch64/aarch64-tuning-flags.def
> >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> >         options.
> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> >         function to parse ldp-policy parameter.
> >         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
> >         (aarch64_override_options_internal): Call parsing functions.
> >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> >         alignment check and remove superseded ones.
> >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
> >         alignment check and remove superseded ones.
> >         * config/aarch64/aarch64.opt: Add options.
> >         * doc/invoke.texi: Document the parameters accordingly.
> >
> > gcc/testsuite/ChangeLog:
> >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> >         * gcc.target/aarch64/ldp_aligned.c: New test.
> >         * gcc.target/aarch64/ldp_always.c: New test.
> >         * gcc.target/aarch64/ldp_never.c: New test.
> >         * gcc.target/aarch64/stp_aligned.c: New test.
> >         * gcc.target/aarch64/stp_always.c: New test.
> >         * gcc.target/aarch64/stp_never.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > ---
> > Changes in v3:
> >         - Changed command-line options to target-specific parameters
> >           and documented them accordingly in doc/invoke.texi.
> >         - Removed ampere1-no_ldp_combine.c test as superseded.
> >
> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> >  gcc/config/aarch64/aarch64.opt                |   8 +
> >  gcc/doc/invoke.texi                           |  30 +++
> >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> >  12 files changed, 600 insertions(+), 74 deletions(-)
> >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > index 70303d6fd95..be1d73490ed 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -568,6 +568,30 @@ struct tune_params
> >    /* Place prefetch struct pointer at the end to enable type checking
> >       errors when tune_params misses elements (e.g., from erroneous merges).  */
> >    const struct cpu_prefetch_tune *prefetch;
> > +/* An enum specifying how to handle load pairs using a fine-grained policy:
> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > +
> > +  enum aarch64_ldp_policy_model
> > +  {
> > +    LDP_POLICY_ALIGNED,
> > +    LDP_POLICY_ALWAYS,
> > +    LDP_POLICY_NEVER
> > +  } ldp_policy_model;
> > +/* An enum specifying how to handle store pairs using a fine-grained policy:
> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > +
> > +  enum aarch64_stp_policy_model
> > +  {
> > +    STP_POLICY_ALIGNED,
> > +    STP_POLICY_ALWAYS,
> > +    STP_POLICY_NEVER
> > +  } stp_policy_model;
> >  };
> >
> >  /* Classifies an address.
> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> > index 52112ba7c48..774568e9106 100644
> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > @@ -30,11 +30,6 @@
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> >
> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > -two load/stores are not at least 8 byte aligned don't create load/store
> > -pairs.   */
> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > -
> >  /* Some of the optional shift to some arthematic instructions are
> >     considered cheap.  Logical shift left <=4 with or without a
> >     zero extend are considered cheap.  Sign extend; non logical shift left
> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> >  /* Disallow load/store pair instructions on Q-registers.  */
> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> >
> > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > -
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 219c4ee6d4c..9eeb5469cf9 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
> >       have at most a very minor effect on SVE2 cores.  */
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa35_tunings =
> > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa53_tunings =
> > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa57_tunings =
> > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa72_tunings =
> > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa73_tunings =
> > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >
> > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> >    48,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &exynosm1_prefetch_tune
> > +  &exynosm1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderxt88_tunings =
> > @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > -  &thunderxt88_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &thunderxt88_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx_tunings =
> > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > -  &thunderx_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > +  &thunderx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params tsv110_tunings =
> > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> >    0,    /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > -  &tsv110_prefetch_tune
> > +  &tsv110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params xgene1_tunings =
> > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> >    17,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params emag_tunings =
> > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> >    17,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params qdf24xx_tunings =
> > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > -  &qdf24xx_prefetch_tune
> > +  &qdf24xx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx2t99_tunings =
> > @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &thunderx2t99_prefetch_tune
> > +  &thunderx2t99_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx3t110_tunings =
> > @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &thunderx3t110_prefetch_tune
> > +  &thunderx3t110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params neoversen1_tunings =
> > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1_tunings =
> > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1a_tunings =
> > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params a64fx_tunings =
> > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &a64fx_prefetch_tune
> > +  &a64fx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  /* Support for fine-grained override of the tuning structures.  */
> > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
> >    return AARCH_PARSE_INVALID_ARG;
> >  }
> >
> > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > +   the value of the parameter.  */
> > +
> > +static void
> > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> > +{
> > +  if (value == 1)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > +  else if (value == 2)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > +  else if (value == 3)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > +}
> > +
> > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > +   the value of the parameter.  */
> > +
> > +static void
> > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> > +{
> > +  if (value == 1)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > +  else if (value == 2)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > +  else if (value == 3)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > +}
> > +
> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> >     described in FLAG.  If it is, return the index bit for that fusion type.
> >     If not, error (printing OPTION_NAME) and return zero.  */
> > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
> >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> >                                    &aarch64_tune_params);
> >
> > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > +                             &aarch64_tune_params);
> > +
> > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > +                             &aarch64_tune_params);
> > +
> >    /* This target defaults to strict volatile bitfields.  */
> >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
> >      opts->x_flag_strict_volatile_bitfields = 1;
> > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> >    enum reg_class rclass_1, rclass_2;
> >    rtx mem_1, mem_2, reg_1, reg_2;
> >
> > -  /* Allow the tuning structure to disable LDP instruction formation
> > -     from combining instructions (e.g., in peephole2).
> > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > -          1. control policies for load and store separately;
> > -          2. support the following policies:
> > -             - default (use what is in the tuning structure)
> > -             - always
> > -             - never
> > -             - aligned (only if the compiler can prove that the
> > -               load will be aligned to 2 * element_size)  */
> > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> >      return false;
> >
> >    if (load)
> > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > -      && !optimize_size
> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    /* Check if the addresses are in the form of [base+offset].  */
> > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> >    HOST_WIDE_INT offvals[num_insns], msize;
> >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
> >
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > +    return false;
> > +
> >    if (load)
> >      {
> >        for (int i = 0; i < num_insns; i++)
> > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> >    if (offvals[0] % msize != offvals[2] % msize)
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > -      && !optimize_size
> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    return true;
> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > index 4a0580435a8..f61e3f968d4 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
> >  -param=aarch64-vect-unroll-limit=
> >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> >  Limit how much the autovectorizer may unroll a loop.
> > +
> > +-param=aarch64-ldp-policy=
> > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> > +Fine-grained policy for load pairs.
> > +
> > +-param=aarch64-stp-policy=
> > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> > +Fine-grained policy for store pairs.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 146b40414b0..000dd5541f4 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> >  @end table
> >  The default value is 0.
> >
> > +@item aarch64-ldp-policy
> > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > +@table @samp
> > +@item 0
> > +Use the policy of the tuning structure.
> > +@item 1
> > +Emit ldp regardless of alignment.
> > +@item 2
> > +Do not emit ldp.
> > +@item 3
> > +Emit ldp only if the source pointer is aligned to at least double the alignment
> > +of the type.
> > +@end table
> > +The default value is 0.
> > +
> > +@item aarch64-stp-policy
> > +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> > +@table @samp
> > +@item 0
> > +Use the policy of the tuning structure.
> > +@item 1
> > +Emit stp regardless of alignment.
> > +@item 2
> > +Do not emit stp.
> > +@item 3
> > +Emit stp only if the source pointer is aligned to at least double the alignment
> > +of the type.
> > +@end table
> > +The default value is 0.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > deleted file mode 100644
> > index bc871f4481d..00000000000
> > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > +++ /dev/null
> > @@ -1,11 +0,0 @@
> > -/* { dg-options "-O3 -mtune=ampere1" } */
> > -
> > -long
> > -foo (long a[])
> > -{
> > -  return a[0] + a[1];
> > -}
> > -
> > -/* We should see two ldrs instead of one ldp. */
> > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > new file mode 100644
> > index 00000000000..8e43faab70d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > new file mode 100644
> > index 00000000000..532ca607565
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > new file mode 100644
> > index 00000000000..b39941c18d7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > new file mode 100644
> > index 00000000000..01f294bb090
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > new file mode 100644
> > index 00000000000..cedb461b5b2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > new file mode 100644
> > index 00000000000..ddde658f807
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
> > --
> > 2.40.1
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 19:54 ` Andrew Pinski
  2023-09-25 19:59   ` Philipp Tomsich
@ 2023-09-25 20:00   ` Manos Anagnostakis
  1 sibling, 0 replies; 12+ messages in thread
From: Manos Anagnostakis @ 2023-09-25 20:00 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-patches, Kyrylo Tkachov, Philipp Tomsich

[-- Attachment #1: Type: text/plain, Size: 46324 bytes --]

Hello Andrew,

what you describe was my previous version, but @Kyrylo Tkachov
<kyrylo.tkachov@arm.com> prompted me to use -param.

Thank you for taking a look anyway!

Manos Anagnostakis | Compiler Engineer
| E: manos.anagnostakis@vrull.eu

VRULL GmbH | Beatrixgasse 32 1030 Vienna | W: www.vrull.eu

Στις Δευ 25 Σεπ 2023, 22:54 ο χρήστης Andrew Pinski <pinskia@gmail.com>
έγραψε:

> On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > to provide the requested behaviour for handling ldp and stp:
> >
> >   /* Allow the tuning structure to disable LDP instruction formation
> >      from combining instructions (e.g., in peephole2).
> >      TODO: Implement fine-grained tuning control for LDP and STP:
> >            1. control policies for load and store separately;
> >            2. support the following policies:
> >               - default (use what is in the tuning structure)
> >               - always
> >               - never
> >               - aligned (only if the compiler can prove that the
> >                 load will be aligned to 2 * element_size)  */
> >
> > It provides two new and concrete target-specific command-line parameters
> > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > to give the ability to control load and store policies seperately as
> > stated in part 1 of the TODO.
> >
> > The accepted values for both parameters are:
> > - 0: Use the policy of the tuning structure (default).
> > - 1: Emit ldp/stp regardless of alignment.
> > - 2: Do not emit ldp/stp.
> > - 3: In order to emit ldp/stp, first check if the load/store will
> >   be aligned to 2 * element_size.
>
> Instead of a number, does it make sense to instead use an string
> (ENUM) for this param.
> Also I think using --param is a bad idea if it is going to be
> documented in the user manual.
> Maybe a -m option should be used instead.
>
> Thanks,
> Andrew
>
> >
> > gcc/ChangeLog:
> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> >         appropriate enums for the policies.
> >         * config/aarch64/aarch64-tuning-flags.def
> >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> >         options.
> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> >         function to parse ldp-policy parameter.
> >         (aarch64_parse_stp_policy): New function to parse stp-policy
> parameter.
> >         (aarch64_override_options_internal): Call parsing functions.
> >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> >         alignment check and remove superseded ones.
> >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value
> check and
> >         alignment check and remove superseded ones.
> >         * config/aarch64/aarch64.opt: Add options.
> >         * doc/invoke.texi: Document the parameters accordingly.
> >
> > gcc/testsuite/ChangeLog:
> >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> >         * gcc.target/aarch64/ldp_aligned.c: New test.
> >         * gcc.target/aarch64/ldp_always.c: New test.
> >         * gcc.target/aarch64/ldp_never.c: New test.
> >         * gcc.target/aarch64/stp_aligned.c: New test.
> >         * gcc.target/aarch64/stp_always.c: New test.
> >         * gcc.target/aarch64/stp_never.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > ---
> > Changes in v3:
> >         - Changed command-line options to target-specific parameters
> >           and documented them accordingly in doc/invoke.texi.
> >         - Removed ampere1-no_ldp_combine.c test as superseded.
> >
> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> >  gcc/config/aarch64/aarch64.opt                |   8 +
> >  gcc/doc/invoke.texi                           |  30 +++
> >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> >  12 files changed, 600 insertions(+), 74 deletions(-)
> >  delete mode 100644
> gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 70303d6fd95..be1d73490ed 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -568,6 +568,30 @@ struct tune_params
> >    /* Place prefetch struct pointer at the end to enable type checking
> >       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
> >    const struct cpu_prefetch_tune *prefetch;
> > +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > +
> > +  enum aarch64_ldp_policy_model
> > +  {
> > +    LDP_POLICY_ALIGNED,
> > +    LDP_POLICY_ALWAYS,
> > +    LDP_POLICY_NEVER
> > +  } ldp_policy_model;
> > +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > +
> > +  enum aarch64_stp_policy_model
> > +  {
> > +    STP_POLICY_ALIGNED,
> > +    STP_POLICY_ALWAYS,
> > +    STP_POLICY_NEVER
> > +  } stp_policy_model;
> >  };
> >
> >  /* Classifies an address.
> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > index 52112ba7c48..774568e9106 100644
> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > @@ -30,11 +30,6 @@
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> >
> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > -two load/stores are not at least 8 byte aligned don't create load/store
> > -pairs.   */
> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > -
> >  /* Some of the optional shift to some arthematic instructions are
> >     considered cheap.  Logical shift left <=4 with or without a
> >     zero extend are considered cheap.  Sign extend; non logical shift
> left
> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
> >  /* Disallow load/store pair instructions on Q-registers.  */
> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> >
> > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > -
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > index 219c4ee6d4c..9eeb5469cf9 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
> >       have at most a very minor effect on SVE2 cores.  */
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa35_tunings =
> > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa53_tunings =
> > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa57_tunings =
> > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa72_tunings =
> > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa73_tunings =
> > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >
> > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> >    48,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &exynosm1_prefetch_tune
> > +  &exynosm1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderxt88_tunings =
> > @@ -1594,8 +1608,10 @@ static const struct tune_params
> thunderxt88_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > -  &thunderxt88_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &thunderxt88_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx_tunings =
> > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > -  &thunderx_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > +  &thunderx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params tsv110_tunings =
> > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> >    0,    /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > -  &tsv110_prefetch_tune
> > +  &tsv110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params xgene1_tunings =
> > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> >    17,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params emag_tunings =
> > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> >    17,  /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params qdf24xx_tunings =
> > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > -  &qdf24xx_prefetch_tune
> > +  &qdf24xx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> values
> > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx2t99_tunings =
> > @@ -1833,7 +1860,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &thunderx2t99_prefetch_tune
> > +  &thunderx2t99_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx3t110_tunings =
> > @@ -1867,7 +1896,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &thunderx3t110_prefetch_tune
> > +  &thunderx3t110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params neoversen1_tunings =
> > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings
> =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1_tunings =
> > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1a_tunings =
> > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> >    2,   /* min_div_recip_mul_df.  */
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings
> =
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > @@ -2293,7 +2332,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.
> */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings
> =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.
> */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings
> =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.
> */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params a64fx_tunings =
> > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> >    0,   /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > -  &a64fx_prefetch_tune
> > +  &a64fx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> >  };
> >
> >  /* Support for fine-grained override of the tuning structures.  */
> > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const
> struct processor **res)
> >    return AARCH_PARSE_INVALID_ARG;
> >  }
> >
> > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > +   the value of the parameter.  */
> > +
> > +static void
> > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params*
> tune)
> > +{
> > +  if (value == 1)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > +  else if (value == 2)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > +  else if (value == 3)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > +}
> > +
> > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > +   the value of the parameter.  */
> > +
> > +static void
> > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params*
> tune)
> > +{
> > +  if (value == 1)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > +  else if (value == 2)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > +  else if (value == 3)
> > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > +}
> > +
> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> >     described in FLAG.  If it is, return the index bit for that fusion
> type.
> >     If not, error (printing OPTION_NAME) and return zero.  */
> > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
> >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> >                                    &aarch64_tune_params);
> >
> > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > +                             &aarch64_tune_params);
> > +
> > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > +                             &aarch64_tune_params);
> > +
> >    /* This target defaults to strict volatile bitfields.  */
> >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> abi_version_at_least (2))
> >      opts->x_flag_strict_volatile_bitfields = 1;
> > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
> >    enum reg_class rclass_1, rclass_2;
> >    rtx mem_1, mem_2, reg_1, reg_2;
> >
> > -  /* Allow the tuning structure to disable LDP instruction formation
> > -     from combining instructions (e.g., in peephole2).
> > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > -          1. control policies for load and store separately;
> > -          2. support the following policies:
> > -             - default (use what is in the tuning structure)
> > -             - always
> > -             - never
> > -             - aligned (only if the compiler can prove that the
> > -               load will be aligned to 2 * element_size)  */
> > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> >      return false;
> >
> >    if (load)
> > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > -      && !optimize_size
> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    /* Check if the addresses are in the form of [base+offset].  */
> > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >    HOST_WIDE_INT offvals[num_insns], msize;
> >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> >
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > +    return false;
> > +
> >    if (load)
> >      {
> >        for (int i = 0; i < num_insns; i++)
> > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >    if (offvals[0] % msize != offvals[2] % msize)
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > -      && !optimize_size
> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > +      && !optimize_function_for_size_p (cfun)
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    return true;
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index 4a0580435a8..f61e3f968d4 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start
> using MOPS sequence.
> >  -param=aarch64-vect-unroll-limit=
> >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> >  Limit how much the autovectorizer may unroll a loop.
> > +
> > +-param=aarch64-ldp-policy=
> > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0,
> 3) Param
> > +Fine-grained policy for load pairs.
> > +
> > +-param=aarch64-stp-policy=
> > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0,
> 3) Param
> > +Fine-grained policy for store pairs.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 146b40414b0..000dd5541f4 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE
> when the costs are deemed equal.
> >  @end table
> >  The default value is 0.
> >
> > +@item aarch64-ldp-policy
> > +Fine-grained policy for load pairs.  Accepts values from 0 to 3,
> inclusive.
> > +@table @samp
> > +@item 0
> > +Use the policy of the tuning structure.
> > +@item 1
> > +Emit ldp regardless of alignment.
> > +@item 2
> > +Do not emit ldp.
> > +@item 3
> > +Emit ldp only if the source pointer is aligned to at least double the
> alignment
> > +of the type.
> > +@end table
> > +The default value is 0.
> > +
> > +@item aarch64-stp-policy
> > +Fine-grained policy for store pairs.  Accepts values from 0 to 3,
> inclusive.
> > +@table @samp
> > +@item 0
> > +Use the policy of the tuning structure.
> > +@item 1
> > +Emit stp regardless of alignment.
> > +@item 2
> > +Do not emit stp.
> > +@item 3
> > +Emit stp only if the source pointer is aligned to at least double the
> alignment
> > +of the type.
> > +@end table
> > +The default value is 0.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > deleted file mode 100644
> > index bc871f4481d..00000000000
> > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > +++ /dev/null
> > @@ -1,11 +0,0 @@
> > -/* { dg-options "-O3 -mtune=ampere1" } */
> > -
> > -long
> > -foo (long a[])
> > -{
> > -  return a[0] + a[1];
> > -}
> > -
> > -/* We should see two ldrs instead of one ldp. */
> > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > new file mode 100644
> > index 00000000000..8e43faab70d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > new file mode 100644
> > index 00000000000..532ca607565
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > new file mode 100644
> > index 00000000000..b39941c18d7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > new file mode 100644
> > index 00000000000..01f294bb090
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > new file mode 100644
> > index 00000000000..cedb461b5b2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > new file mode 100644
> > index 00000000000..ddde658f807
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
> > --
> > 2.40.1
> >
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 19:59   ` Philipp Tomsich
@ 2023-09-25 20:04     ` Andrew Pinski
  2023-09-25 20:57       ` Andrew Pinski
  2023-09-26  8:34       ` Kyrylo Tkachov
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Pinski @ 2023-09-25 20:04 UTC (permalink / raw)
  To: Philipp Tomsich; +Cc: Manos Anagnostakis, gcc-patches, Kyrylo Tkachov

On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
<philipp.tomsich@vrull.eu> wrote:
>
> On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com> wrote:
> >
> > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > <manos.anagnostakis@vrull.eu> wrote:
> > >
> > > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > > to provide the requested behaviour for handling ldp and stp:
> > >
> > >   /* Allow the tuning structure to disable LDP instruction formation
> > >      from combining instructions (e.g., in peephole2).
> > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > >            1. control policies for load and store separately;
> > >            2. support the following policies:
> > >               - default (use what is in the tuning structure)
> > >               - always
> > >               - never
> > >               - aligned (only if the compiler can prove that the
> > >                 load will be aligned to 2 * element_size)  */
> > >
> > > It provides two new and concrete target-specific command-line parameters
> > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > to give the ability to control load and store policies seperately as
> > > stated in part 1 of the TODO.
> > >
> > > The accepted values for both parameters are:
> > > - 0: Use the policy of the tuning structure (default).
> > > - 1: Emit ldp/stp regardless of alignment.
> > > - 2: Do not emit ldp/stp.
> > > - 3: In order to emit ldp/stp, first check if the load/store will
> > >   be aligned to 2 * element_size.
> >
> > Instead of a number, does it make sense to instead use an string
> > (ENUM) for this param.
> > Also I think using --param is a bad idea if it is going to be
> > documented in the user manual.
> > Maybe a -m option should be used instead.
>
> See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> for the discussion triggering the change from -m... to --param and the
> change to using a number instead of a string.

That is the opposite of the current GCC practice across all targets.
Things like this should be consistent and if one target decides to do
it different, then maybe it should NOT.
Anyways we should document the correct coding style for options so we
don't have these back and forths again.


Thanks,
Andrew

>
> Thanks,
> Philipp.
>
> >
> > Thanks,
> > Andrew
> >
> > >
> > > gcc/ChangeLog:
> > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > >         appropriate enums for the policies.
> > >         * config/aarch64/aarch64-tuning-flags.def
> > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > >         options.
> > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > >         function to parse ldp-policy parameter.
> > >         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
> > >         (aarch64_override_options_internal): Call parsing functions.
> > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> > >         alignment check and remove superseded ones.
> > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
> > >         alignment check and remove superseded ones.
> > >         * config/aarch64/aarch64.opt: Add options.
> > >         * doc/invoke.texi: Document the parameters accordingly.
> > >
> > > gcc/testsuite/ChangeLog:
> > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > >         * gcc.target/aarch64/ldp_always.c: New test.
> > >         * gcc.target/aarch64/ldp_never.c: New test.
> > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > >         * gcc.target/aarch64/stp_always.c: New test.
> > >         * gcc.target/aarch64/stp_never.c: New test.
> > >
> > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > > ---
> > > Changes in v3:
> > >         - Changed command-line options to target-specific parameters
> > >           and documented them accordingly in doc/invoke.texi.
> > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > >
> > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > >  gcc/doc/invoke.texi                           |  30 +++
> > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > > index 70303d6fd95..be1d73490ed 100644
> > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > @@ -568,6 +568,30 @@ struct tune_params
> > >    /* Place prefetch struct pointer at the end to enable type checking
> > >       errors when tune_params misses elements (e.g., from erroneous merges).  */
> > >    const struct cpu_prefetch_tune *prefetch;
> > > +/* An enum specifying how to handle load pairs using a fine-grained policy:
> > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > +   to at least double the alignment of the type.
> > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > +
> > > +  enum aarch64_ldp_policy_model
> > > +  {
> > > +    LDP_POLICY_ALIGNED,
> > > +    LDP_POLICY_ALWAYS,
> > > +    LDP_POLICY_NEVER
> > > +  } ldp_policy_model;
> > > +/* An enum specifying how to handle store pairs using a fine-grained policy:
> > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > +   to at least double the alignment of the type.
> > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > +
> > > +  enum aarch64_stp_policy_model
> > > +  {
> > > +    STP_POLICY_ALIGNED,
> > > +    STP_POLICY_ALWAYS,
> > > +    STP_POLICY_NEVER
> > > +  } stp_policy_model;
> > >  };
> > >
> > >  /* Classifies an address.
> > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > index 52112ba7c48..774568e9106 100644
> > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > @@ -30,11 +30,6 @@
> > >
> > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> > >
> > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > -pairs.   */
> > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > > -
> > >  /* Some of the optional shift to some arthematic instructions are
> > >     considered cheap.  Logical shift left <=4 with or without a
> > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > >  /* Disallow load/store pair instructions on Q-registers.  */
> > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> > >
> > > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > > -
> > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> > >
> > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > --- a/gcc/config/aarch64/aarch64.cc
> > > +++ b/gcc/config/aarch64/aarch64.cc
> > > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> > >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
> > >       have at most a very minor effect on SVE2 cores.  */
> > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params cortexa35_tunings =
> > > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params cortexa53_tunings =
> > > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params cortexa57_tunings =
> > > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params cortexa72_tunings =
> > > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params cortexa73_tunings =
> > > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >
> > > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> > >    48,  /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > -  &exynosm1_prefetch_tune
> > > +  &exynosm1_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params thunderxt88_tunings =
> > > @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
> > >    2,   /* min_div_recip_mul_df.  */
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > > -  &thunderxt88_prefetch_tune
> > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > +  &thunderxt88_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params thunderx_tunings =
> > > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> > >    2,   /* min_div_recip_mul_df.  */
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > -  &thunderx_prefetch_tune
> > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > +  &thunderx_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params tsv110_tunings =
> > > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> > >    0,    /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > -  &tsv110_prefetch_tune
> > > +  &tsv110_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params xgene1_tunings =
> > > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> > >    17,  /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > -  &xgene1_prefetch_tune
> > > +  &xgene1_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params emag_tunings =
> > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> > >    17,  /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > -  &xgene1_prefetch_tune
> > > +  &xgene1_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params qdf24xx_tunings =
> > > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > -  &qdf24xx_prefetch_tune
> > > +  &qdf24xx_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> > > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params thunderx2t99_tunings =
> > > @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &thunderx2t99_prefetch_tune
> > > +  &thunderx2t99_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params thunderx3t110_tunings =
> > > @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &thunderx3t110_prefetch_tune
> > > +  &thunderx3t110_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params neoversen1_tunings =
> > > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params ampere1_tunings =
> > > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> > >    2,   /* min_div_recip_mul_df.  */
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > -  &ampere1_prefetch_tune
> > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > +  &ampere1_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params ampere1a_tunings =
> > > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> > >    2,   /* min_div_recip_mul_df.  */
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > -  &ampere1_prefetch_tune
> > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > +  &ampere1_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > >  };
> > >
> > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
> > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
> > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
> > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
> > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > -  &generic_prefetch_tune
> > > +  &generic_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  static const struct tune_params a64fx_tunings =
> > > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> > >    0,   /* max_case_values.  */
> > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > -  &a64fx_prefetch_tune
> > > +  &a64fx_prefetch_tune,
> > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > >  };
> > >
> > >  /* Support for fine-grained override of the tuning structures.  */
> > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
> > >    return AARCH_PARSE_INVALID_ARG;
> > >  }
> > >
> > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > > +   the value of the parameter.  */
> > > +
> > > +static void
> > > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> > > +{
> > > +  if (value == 1)
> > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > +  else if (value == 2)
> > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > +  else if (value == 3)
> > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > +}
> > > +
> > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > > +   the value of the parameter.  */
> > > +
> > > +static void
> > > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> > > +{
> > > +  if (value == 1)
> > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > +  else if (value == 2)
> > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > +  else if (value == 3)
> > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > +}
> > > +
> > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
> > >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> > >                                    &aarch64_tune_params);
> > >
> > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > +                             &aarch64_tune_params);
> > > +
> > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > +                             &aarch64_tune_params);
> > > +
> > >    /* This target defaults to strict volatile bitfields.  */
> > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
> > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > >    enum reg_class rclass_1, rclass_2;
> > >    rtx mem_1, mem_2, reg_1, reg_2;
> > >
> > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > -     from combining instructions (e.g., in peephole2).
> > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > -          1. control policies for load and store separately;
> > > -          2. support the following policies:
> > > -             - default (use what is in the tuning structure)
> > > -             - always
> > > -             - never
> > > -             - aligned (only if the compiler can prove that the
> > > -               load will be aligned to 2 * element_size)  */
> > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > +  if (load
> > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > +    return false;
> > > +
> > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > +  if (!load
> > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > >      return false;
> > >
> > >    if (load)
> > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > >      return false;
> > >
> > > -  /* If we have SImode and slow unaligned ldp,
> > > -     check the alignment to be at least 8 byte. */
> > > -  if (mode == SImode
> > > -      && (aarch64_tune_params.extra_tuning_flags
> > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > -      && !optimize_size
> > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > +  /* If we have LDP_POLICY_ALIGNED,
> > > +     do not emit the load pair unless the alignment is checked to be
> > > +     at least double the alignment of the type.  */
> > > +  if (load
> > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > +      && !optimize_function_for_size_p (cfun)
> > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > +    return false;
> > > +
> > > +  /* If we have STP_POLICY_ALIGNED,
> > > +     do not emit the store pair unless the alignment is checked to be
> > > +     at least double the alignment of the type.  */
> > > +  if (!load
> > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > +      && !optimize_function_for_size_p (cfun)
> > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > >      return false;
> > >
> > >    /* Check if the addresses are in the form of [base+offset].  */
> > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > >    HOST_WIDE_INT offvals[num_insns], msize;
> > >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
> > >
> > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > +  if (load
> > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > +    return false;
> > > +
> > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > +  if (!load
> > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > +    return false;
> > > +
> > >    if (load)
> > >      {
> > >        for (int i = 0; i < num_insns; i++)
> > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > >    if (offvals[0] % msize != offvals[2] % msize)
> > >      return false;
> > >
> > > -  /* If we have SImode and slow unaligned ldp,
> > > -     check the alignment to be at least 8 byte. */
> > > -  if (mode == SImode
> > > -      && (aarch64_tune_params.extra_tuning_flags
> > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > -      && !optimize_size
> > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > +  /* If we have LDP_POLICY_ALIGNED,
> > > +     do not emit the load pair unless the alignment is checked to be
> > > +     at least double the alignment of the type.  */
> > > +  if (load
> > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > +      && !optimize_function_for_size_p (cfun)
> > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > +    return false;
> > > +
> > > +  /* If we have STP_POLICY_ALIGNED,
> > > +     do not emit the store pair unless the alignment is checked to be
> > > +     at least double the alignment of the type.  */
> > > +  if (!load
> > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > +      && !optimize_function_for_size_p (cfun)
> > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > >      return false;
> > >
> > >    return true;
> > > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > > index 4a0580435a8..f61e3f968d4 100644
> > > --- a/gcc/config/aarch64/aarch64.opt
> > > +++ b/gcc/config/aarch64/aarch64.opt
> > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
> > >  -param=aarch64-vect-unroll-limit=
> > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > >  Limit how much the autovectorizer may unroll a loop.
> > > +
> > > +-param=aarch64-ldp-policy=
> > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> > > +Fine-grained policy for load pairs.
> > > +
> > > +-param=aarch64-stp-policy=
> > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> > > +Fine-grained policy for store pairs.
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 146b40414b0..000dd5541f4 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> > >  @end table
> > >  The default value is 0.
> > >
> > > +@item aarch64-ldp-policy
> > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > > +@table @samp
> > > +@item 0
> > > +Use the policy of the tuning structure.
> > > +@item 1
> > > +Emit ldp regardless of alignment.
> > > +@item 2
> > > +Do not emit ldp.
> > > +@item 3
> > > +Emit ldp only if the source pointer is aligned to at least double the alignment
> > > +of the type.
> > > +@end table
> > > +The default value is 0.
> > > +
> > > +@item aarch64-stp-policy
> > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> > > +@table @samp
> > > +@item 0
> > > +Use the policy of the tuning structure.
> > > +@item 1
> > > +Emit stp regardless of alignment.
> > > +@item 2
> > > +Do not emit stp.
> > > +@item 3
> > > +Emit stp only if the source pointer is aligned to at least double the alignment
> > > +of the type.
> > > +@end table
> > > +The default value is 0.
> > > +
> > >  @item aarch64-loop-vect-issue-rate-niters
> > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > >  rates into account when deciding whether a loop should be vectorized
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > deleted file mode 100644
> > > index bc871f4481d..00000000000
> > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > +++ /dev/null
> > > @@ -1,11 +0,0 @@
> > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > -
> > > -long
> > > -foo (long a[])
> > > -{
> > > -  return a[0] + a[1];
> > > -}
> > > -
> > > -/* We should see two ldrs instead of one ldp. */
> > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > new file mode 100644
> > > index 00000000000..8e43faab70d
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > @@ -0,0 +1,66 @@
> > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[0]; \
> > > +    a_1 = arr[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[0]; \
> > > +    a_1 = a[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[100]; \
> > > +    a_1 = arr[101]; \
> > > +    a_2 = arr[102]; \
> > > +    a_3 = arr[103]; \
> > > +    a_4 = arr[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[100]; \
> > > +    a_1 = a[101]; \
> > > +    a_2 = a[102]; \
> > > +    a_3 = a[103]; \
> > > +    a_4 = a[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +LDP_TEST_ALIGNED(int32_t);
> > > +LDP_TEST_ALIGNED(int64_t);
> > > +LDP_TEST_ALIGNED(v4si);
> > > +LDP_TEST_UNALIGNED(int32_t);
> > > +LDP_TEST_UNALIGNED(int64_t);
> > > +LDP_TEST_UNALIGNED(v4si);
> > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > new file mode 100644
> > > index 00000000000..532ca607565
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > @@ -0,0 +1,66 @@
> > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[0]; \
> > > +    a_1 = arr[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[0]; \
> > > +    a_1 = a[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[100]; \
> > > +    a_1 = arr[101]; \
> > > +    a_2 = arr[102]; \
> > > +    a_3 = arr[103]; \
> > > +    a_4 = arr[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[100]; \
> > > +    a_1 = a[101]; \
> > > +    a_2 = a[102]; \
> > > +    a_3 = a[103]; \
> > > +    a_4 = a[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +LDP_TEST_ALIGNED(int32_t);
> > > +LDP_TEST_ALIGNED(int64_t);
> > > +LDP_TEST_ALIGNED(v4si);
> > > +LDP_TEST_UNALIGNED(int32_t);
> > > +LDP_TEST_UNALIGNED(int64_t);
> > > +LDP_TEST_UNALIGNED(v4si);
> > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > new file mode 100644
> > > index 00000000000..b39941c18d7
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > @@ -0,0 +1,66 @@
> > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[0]; \
> > > +    a_1 = arr[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[0]; \
> > > +    a_1 = a[1]; \
> > > +    return a_0 + a_1; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    a_0 = arr[100]; \
> > > +    a_1 = arr[101]; \
> > > +    a_2 = arr[102]; \
> > > +    a_3 = arr[103]; \
> > > +    a_4 = arr[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a_0 = a[100]; \
> > > +    a_1 = a[101]; \
> > > +    a_2 = a[102]; \
> > > +    a_3 = a[103]; \
> > > +    a_4 = a[110]; \
> > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > +}
> > > +
> > > +LDP_TEST_ALIGNED(int32_t);
> > > +LDP_TEST_ALIGNED(int64_t);
> > > +LDP_TEST_ALIGNED(v4si);
> > > +LDP_TEST_UNALIGNED(int32_t);
> > > +LDP_TEST_UNALIGNED(int64_t);
> > > +LDP_TEST_UNALIGNED(v4si);
> > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > new file mode 100644
> > > index 00000000000..01f294bb090
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > @@ -0,0 +1,60 @@
> > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define STP_TEST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[0] = x; \
> > > +    arr[1] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[0] = x; \
> > > +    a[1] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[100] = x; \
> > > +    arr[101] = x; \
> > > +    arr[102] = x; \
> > > +    arr[103] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[100] = x; \
> > > +    a[101] = x; \
> > > +    a[102] = x; \
> > > +    a[103] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +STP_TEST_ALIGNED(int32_t);
> > > +STP_TEST_ALIGNED(int64_t);
> > > +STP_TEST_ALIGNED(v4si);
> > > +STP_TEST_UNALIGNED(int32_t);
> > > +STP_TEST_UNALIGNED(int64_t);
> > > +STP_TEST_UNALIGNED(v4si);
> > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > new file mode 100644
> > > index 00000000000..cedb461b5b2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > @@ -0,0 +1,60 @@
> > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define STP_TEST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[0] = x; \
> > > +    arr[1] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[0] = x; \
> > > +    a[1] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[100] = x; \
> > > +    arr[101] = x; \
> > > +    arr[102] = x; \
> > > +    arr[103] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[100] = x; \
> > > +    a[101] = x; \
> > > +    a[102] = x; \
> > > +    a[103] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +STP_TEST_ALIGNED(int32_t);
> > > +STP_TEST_ALIGNED(int64_t);
> > > +STP_TEST_ALIGNED(v4si);
> > > +STP_TEST_UNALIGNED(int32_t);
> > > +STP_TEST_UNALIGNED(int64_t);
> > > +STP_TEST_UNALIGNED(v4si);
> > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > new file mode 100644
> > > index 00000000000..ddde658f807
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > @@ -0,0 +1,60 @@
> > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > > +
> > > +#include <stdlib.h>
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +#define STP_TEST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[0] = x; \
> > > +    arr[1] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[0] = x; \
> > > +    a[1] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    arr[100] = x; \
> > > +    arr[101] = x; \
> > > +    arr[102] = x; \
> > > +    arr[103] = x; \
> > > +    return arr; \
> > > +}
> > > +
> > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > +    TYPE *a = arr+1; \
> > > +    a[100] = x; \
> > > +    a[101] = x; \
> > > +    a[102] = x; \
> > > +    a[103] = x; \
> > > +    return a; \
> > > +}
> > > +
> > > +STP_TEST_ALIGNED(int32_t);
> > > +STP_TEST_ALIGNED(int64_t);
> > > +STP_TEST_ALIGNED(v4si);
> > > +STP_TEST_UNALIGNED(int32_t);
> > > +STP_TEST_UNALIGNED(int64_t);
> > > +STP_TEST_UNALIGNED(v4si);
> > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > +
> > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > +
> > > --
> > > 2.40.1
> > >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 20:04     ` Andrew Pinski
@ 2023-09-25 20:57       ` Andrew Pinski
  2023-09-26  6:06         ` Manos Anagnostakis
  2023-09-26  8:34       ` Kyrylo Tkachov
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Pinski @ 2023-09-25 20:57 UTC (permalink / raw)
  To: Philipp Tomsich; +Cc: Manos Anagnostakis, gcc-patches, Kyrylo Tkachov

On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <pinskia@gmail.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> <philipp.tomsich@vrull.eu> wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > <manos.anagnostakis@vrull.eu> wrote:
> > > >
> > > > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >      from combining instructions (e.g., in peephole2).
> > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > >            1. control policies for load and store separately;
> > > >            2. support the following policies:
> > > >               - default (use what is in the tuning structure)
> > > >               - always
> > > >               - never
> > > >               - aligned (only if the compiler can prove that the
> > > >                 load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
>
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

Kyrylo:
>  It will have to take a number rather than a string but that should be okay, as long as the right values are documented in invoke.texi.

No it does not need to be a number. --param=ranger-debug= does not
take a number, it takes an enum .
One of the benefits of moving --param support over to .opt to allow
more than just numbers even.

Thanks,
Andrew


>
>
> Thanks,
> Andrew
>
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > >         appropriate enums for the policies.
> > > >         * config/aarch64/aarch64-tuning-flags.def
> > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > >         options.
> > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > >         function to parse ldp-policy parameter.
> > > >         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
> > > >         (aarch64_override_options_internal): Call parsing functions.
> > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         * config/aarch64/aarch64.opt: Add options.
> > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > >
> > > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > > > ---
> > > > Changes in v3:
> > > >         - Changed command-line options to target-specific parameters
> > > >           and documented them accordingly in doc/invoke.texi.
> > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > >
> > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > >  gcc/doc/invoke.texi                           |  30 +++
> > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > > > index 70303d6fd95..be1d73490ed 100644
> > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > @@ -568,6 +568,30 @@ struct tune_params
> > > >    /* Place prefetch struct pointer at the end to enable type checking
> > > >       errors when tune_params misses elements (e.g., from erroneous merges).  */
> > > >    const struct cpu_prefetch_tune *prefetch;
> > > > +/* An enum specifying how to handle load pairs using a fine-grained policy:
> > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > +
> > > > +  enum aarch64_ldp_policy_model
> > > > +  {
> > > > +    LDP_POLICY_ALIGNED,
> > > > +    LDP_POLICY_ALWAYS,
> > > > +    LDP_POLICY_NEVER
> > > > +  } ldp_policy_model;
> > > > +/* An enum specifying how to handle store pairs using a fine-grained policy:
> > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > +
> > > > +  enum aarch64_stp_policy_model
> > > > +  {
> > > > +    STP_POLICY_ALIGNED,
> > > > +    STP_POLICY_ALWAYS,
> > > > +    STP_POLICY_NEVER
> > > > +  } stp_policy_model;
> > > >  };
> > > >
> > > >  /* Classifies an address.
> > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > index 52112ba7c48..774568e9106 100644
> > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > @@ -30,11 +30,6 @@
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> > > >
> > > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > > -pairs.   */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > > > -
> > > >  /* Some of the optional shift to some arthematic instructions are
> > > >     considered cheap.  Logical shift left <=4 with or without a
> > > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> > > >
> > > > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > > > -
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> > > >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
> > > >       have at most a very minor effect on SVE2 cores.  */
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa35_tunings =
> > > > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa53_tunings =
> > > > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa57_tunings =
> > > > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa72_tunings =
> > > > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa73_tunings =
> > > > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >
> > > > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> > > >    48,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > -  &exynosm1_prefetch_tune
> > > > +  &exynosm1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderxt88_tunings =
> > > > @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > > > -  &thunderxt88_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &thunderxt88_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx_tunings =
> > > > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &thunderx_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > +  &thunderx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params tsv110_tunings =
> > > > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> > > >    0,    /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > -  &tsv110_prefetch_tune
> > > > +  &tsv110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params xgene1_tunings =
> > > > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params emag_tunings =
> > > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params qdf24xx_tunings =
> > > > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > -  &qdf24xx_prefetch_tune
> > > > +  &qdf24xx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> > > > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx2t99_tunings =
> > > > @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx2t99_prefetch_tune
> > > > +  &thunderx2t99_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx3t110_tunings =
> > > > @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx3t110_prefetch_tune
> > > > +  &thunderx3t110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params neoversen1_tunings =
> > > > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1_tunings =
> > > > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1a_tunings =
> > > > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params a64fx_tunings =
> > > > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &a64fx_prefetch_tune
> > > > +  &a64fx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
> > > >    return AARCH_PARSE_INVALID_ARG;
> > > >  }
> > > >
> > > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
> > > >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> > > >                                    &aarch64_tune_params);
> > > >
> > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > >    /* This target defaults to strict volatile bitfields.  */
> > > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
> > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    enum reg_class rclass_1, rclass_2;
> > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > >
> > > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > > -     from combining instructions (e.g., in peephole2).
> > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > -          1. control policies for load and store separately;
> > > > -          2. support the following policies:
> > > > -             - default (use what is in the tuning structure)
> > > > -             - always
> > > > -             - never
> > > > -             - aligned (only if the compiler can prove that the
> > > > -               load will be aligned to 2 * element_size)  */
> > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > >      return false;
> > > >
> > > >    if (load)
> > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
> > > >
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > >    if (load)
> > > >      {
> > > >        for (int i = 0; i < num_insns; i++)
> > > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    return true;
> > > > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > > > index 4a0580435a8..f61e3f968d4 100644
> > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
> > > >  -param=aarch64-vect-unroll-limit=
> > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > > >  Limit how much the autovectorizer may unroll a loop.
> > > > +
> > > > +-param=aarch64-ldp-policy=
> > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for load pairs.
> > > > +
> > > > +-param=aarch64-stp-policy=
> > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for store pairs.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 146b40414b0..000dd5541f4 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> > > >  @end table
> > > >  The default value is 0.
> > > >
> > > > +@item aarch64-ldp-policy
> > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit ldp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit ldp.
> > > > +@item 3
> > > > +Emit ldp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > > +@item aarch64-stp-policy
> > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit stp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit stp.
> > > > +@item 3
> > > > +Emit stp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > >  @item aarch64-loop-vect-issue-rate-niters
> > > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > > >  rates into account when deciding whether a loop should be vectorized
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > deleted file mode 100644
> > > > index bc871f4481d..00000000000
> > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > +++ /dev/null
> > > > @@ -1,11 +0,0 @@
> > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > -
> > > > -long
> > > > -foo (long a[])
> > > > -{
> > > > -  return a[0] + a[1];
> > > > -}
> > > > -
> > > > -/* We should see two ldrs instead of one ldp. */
> > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..8e43faab70d
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > new file mode 100644
> > > > index 00000000000..532ca607565
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > new file mode 100644
> > > > index 00000000000..b39941c18d7
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..01f294bb090
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > new file mode 100644
> > > > index 00000000000..cedb461b5b2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > new file mode 100644
> > > > index 00000000000..ddde658f807
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > --
> > > > 2.40.1
> > > >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 20:57       ` Andrew Pinski
@ 2023-09-26  6:06         ` Manos Anagnostakis
  2023-09-26  8:34           ` Tamar Christina
  2023-09-26  8:35           ` Kyrylo Tkachov
  0 siblings, 2 replies; 12+ messages in thread
From: Manos Anagnostakis @ 2023-09-26  6:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: Philipp Tomsich, Kyrylo Tkachov, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 54587 bytes --]

Thank you Andrew for the input.

I've prepared a patch using --param with enum, which seems a more suitable
approach to me as strings are more descriptive as well.

The current patch needed an adjustment on how to call the parsing functions
to match the compiler coding style.

Both are bootstrapped and regstested.

I can send a V4 of whichever is preferred.

Thanks!

Manos.

On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski <pinskia@gmail.com> wrote:

> On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <pinskia@gmail.com> wrote:
> >
> > On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> > <philipp.tomsich@vrull.eu> wrote:
> > >
> > > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com>
> wrote:
> > > >
> > > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > > <manos.anagnostakis@vrull.eu> wrote:
> > > > >
> > > > > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> > > > > to provide the requested behaviour for handling ldp and stp:
> > > > >
> > > > >   /* Allow the tuning structure to disable LDP instruction
> formation
> > > > >      from combining instructions (e.g., in peephole2).
> > > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > > >            1. control policies for load and store separately;
> > > > >            2. support the following policies:
> > > > >               - default (use what is in the tuning structure)
> > > > >               - always
> > > > >               - never
> > > > >               - aligned (only if the compiler can prove that the
> > > > >                 load will be aligned to 2 * element_size)  */
> > > > >
> > > > > It provides two new and concrete target-specific command-line
> parameters
> > > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > > to give the ability to control load and store policies seperately
> as
> > > > > stated in part 1 of the TODO.
> > > > >
> > > > > The accepted values for both parameters are:
> > > > > - 0: Use the policy of the tuning structure (default).
> > > > > - 1: Emit ldp/stp regardless of alignment.
> > > > > - 2: Do not emit ldp/stp.
> > > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > > >   be aligned to 2 * element_size.
> > > >
> > > > Instead of a number, does it make sense to instead use an string
> > > > (ENUM) for this param.
> > > > Also I think using --param is a bad idea if it is going to be
> > > > documented in the user manual.
> > > > Maybe a -m option should be used instead.
> > >
> > > See
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> > > for the discussion triggering the change from -m... to --param and the
> > > change to using a number instead of a string.
> >
> > That is the opposite of the current GCC practice across all targets.
> > Things like this should be consistent and if one target decides to do
> > it different, then maybe it should NOT.
> > Anyways we should document the correct coding style for options so we
> > don't have these back and forths again.
>
> Kyrylo:
> >  It will have to take a number rather than a string but that should be
> okay, as long as the right values are documented in invoke.texi.
>
> No it does not need to be a number. --param=ranger-debug= does not
> take a number, it takes an enum .
> One of the benefits of moving --param support over to .opt to allow
> more than just numbers even.
>
> Thanks,
> Andrew
>
>
> >
> >
> > Thanks,
> > Andrew
> >
> > >
> > > Thanks,
> > > Philipp.
> > >
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > > > >
> > > > > gcc/ChangeLog:
> > > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > > >         appropriate enums for the policies.
> > > > >         * config/aarch64/aarch64-tuning-flags.def
> > > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > > >         options.
> > > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > > >         function to parse ldp-policy parameter.
> > > > >         (aarch64_parse_stp_policy): New function to parse
> stp-policy parameter.
> > > > >         (aarch64_override_options_internal): Call parsing
> functions.
> > > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value
> check and
> > > > >         alignment check and remove superseded ones.
> > > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add
> parameter-value check and
> > > > >         alignment check and remove superseded ones.
> > > > >         * config/aarch64/aarch64.opt: Add options.
> > > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > > >
> > > > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > > > > ---
> > > > > Changes in v3:
> > > > >         - Changed command-line options to target-specific
> parameters
> > > > >           and documented them accordingly in doc/invoke.texi.
> > > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > > >
> > > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > > >  gcc/config/aarch64/aarch64.cc                 | 215
> +++++++++++++-----
> > > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > > >  gcc/doc/invoke.texi                           |  30 +++
> > > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > > >  delete mode 100644
> gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > >
> > > > > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > > > > index 70303d6fd95..be1d73490ed 100644
> > > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > > @@ -568,6 +568,30 @@ struct tune_params
> > > > >    /* Place prefetch struct pointer at the end to enable type
> checking
> > > > >       errors when tune_params misses elements (e.g., from
> erroneous merges).  */
> > > > >    const struct cpu_prefetch_tune *prefetch;
> > > > > +/* An enum specifying how to handle load pairs using a
> fine-grained policy:
> > > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > > +   to at least double the alignment of the type.
> > > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > > +
> > > > > +  enum aarch64_ldp_policy_model
> > > > > +  {
> > > > > +    LDP_POLICY_ALIGNED,
> > > > > +    LDP_POLICY_ALWAYS,
> > > > > +    LDP_POLICY_NEVER
> > > > > +  } ldp_policy_model;
> > > > > +/* An enum specifying how to handle store pairs using a
> fine-grained policy:
> > > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > > +   to at least double the alignment of the type.
> > > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > > +
> > > > > +  enum aarch64_stp_policy_model
> > > > > +  {
> > > > > +    STP_POLICY_ALIGNED,
> > > > > +    STP_POLICY_ALWAYS,
> > > > > +    STP_POLICY_NEVER
> > > > > +  } stp_policy_model;
> > > > >  };
> > > > >
> > > > >  /* Classifies an address.
> > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > index 52112ba7c48..774568e9106 100644
> > > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > @@ -30,11 +30,6 @@
> > > > >
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> > > > >
> > > > > -/* Don't create non-8 byte aligned load/store pair.  That is if
> the
> > > > > -two load/stores are not at least 8 byte aligned don't create
> load/store
> > > > > -pairs.   */
> > > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> SLOW_UNALIGNED_LDPW)
> > > > > -
> > > > >  /* Some of the optional shift to some arthematic instructions are
> > > > >     considered cheap.  Logical shift left <=4 with or without a
> > > > >     zero extend are considered cheap.  Sign extend; non logical
> shift left
> > > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION
> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> > > > >
> > > > > -/* Disallow load-pair instructions to be formed in
> combine/peephole.  */
> > > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > > > > -
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> > > > >
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> > > > > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > > @@ -1357,7 +1357,9 @@ static const struct tune_params
> generic_tunings =
> > > > >       Neoverse V1.  It does not have a noticeable effect on A64FX
> and should
> > > > >       have at most a very minor effect on SVE2 cores.  */
> > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa35_tunings =
> > > > > @@ -1391,7 +1393,9 @@ static const struct tune_params
> cortexa35_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa53_tunings =
> > > > > @@ -1425,7 +1429,9 @@ static const struct tune_params
> cortexa53_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa57_tunings =
> > > > > @@ -1459,7 +1465,9 @@ static const struct tune_params
> cortexa57_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa72_tunings =
> > > > > @@ -1493,7 +1501,9 @@ static const struct tune_params
> cortexa72_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa73_tunings =
> > > > > @@ -1527,7 +1537,9 @@ static const struct tune_params
> cortexa73_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >
> > > > > @@ -1562,7 +1574,9 @@ static const struct tune_params
> exynosm1_tunings =
> > > > >    48,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > > -  &exynosm1_prefetch_tune
> > > > > +  &exynosm1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderxt88_tunings =
> > > > > @@ -1594,8 +1608,10 @@ static const struct tune_params
> thunderxt88_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.
> */
> > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > > > > -  &thunderxt88_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &thunderxt88_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx_tunings =
> > > > > @@ -1627,9 +1643,10 @@ static const struct tune_params
> thunderx_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.
> */
> > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > > -  &thunderx_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > > +  &thunderx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params tsv110_tunings =
> > > > > @@ -1663,7 +1680,9 @@ static const struct tune_params
> tsv110_tunings =
> > > > >    0,    /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > > -  &tsv110_prefetch_tune
> > > > > +  &tsv110_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params xgene1_tunings =
> > > > > @@ -1696,7 +1715,9 @@ static const struct tune_params
> xgene1_tunings =
> > > > >    17,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > > -  &xgene1_prefetch_tune
> > > > > +  &xgene1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params emag_tunings =
> > > > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings
> =
> > > > >    17,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > > -  &xgene1_prefetch_tune
> > > > > +  &xgene1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params qdf24xx_tunings =
> > > > > @@ -1763,7 +1786,9 @@ static const struct tune_params
> qdf24xx_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > > -  &qdf24xx_prefetch_tune
> > > > > +  &qdf24xx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to
> falkor values
> > > > > @@ -1799,7 +1824,9 @@ static const struct tune_params
> saphira_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx2t99_tunings =
> > > > > @@ -1833,7 +1860,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &thunderx2t99_prefetch_tune
> > > > > +  &thunderx2t99_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx3t110_tunings =
> > > > > @@ -1867,7 +1896,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &thunderx3t110_prefetch_tune
> > > > > +  &thunderx3t110_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params neoversen1_tunings =
> > > > > @@ -1900,7 +1931,9 @@ static const struct tune_params
> neoversen1_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params ampere1_tunings =
> > > > > @@ -1936,8 +1969,10 @@ static const struct tune_params
> ampere1_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > -  &ampere1_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &ampere1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params ampere1a_tunings =
> > > > > @@ -1974,8 +2009,10 @@ static const struct tune_params
> ampere1a_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > -  &ampere1_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &ampere1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > > @@ -2156,7 +2193,9 @@ static const struct tune_params
> neoversev1_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > > @@ -2293,7 +2332,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > > @@ -2483,7 +2524,9 @@ static const struct tune_params
> neoversen2_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > > @@ -2673,7 +2716,9 @@ static const struct tune_params
> neoversev2_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params a64fx_tunings =
> > > > > @@ -2706,7 +2751,9 @@ static const struct tune_params
> a64fx_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &a64fx_prefetch_tune
> > > > > +  &a64fx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse,
> const struct processor **res)
> > > > >    return AARCH_PARSE_INVALID_ARG;
> > > > >  }
> > > > >
> > > > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.
> VALUE is
> > > > > +   the value of the parameter.  */
> > > > > +
> > > > > +static void
> > > > > +aarch64_parse_ldp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > > +{
> > > > > +  if (value == 1)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > > +  else if (value == 2)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > > +  else if (value == 3)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > > +}
> > > > > +
> > > > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.
> VALUE is
> > > > > +   the value of the parameter.  */
> > > > > +
> > > > > +static void
> > > > > +aarch64_parse_stp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > > +{
> > > > > +  if (value == 1)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > > +  else if (value == 2)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > > +  else if (value == 3)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > > +}
> > > > > +
> > > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > > >     described in FLAG.  If it is, return the index bit for that
> fusion type.
> > > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
> > > > >      aarch64_parse_override_string
> (opts->x_aarch64_override_tune_string,
> > > > >                                    &aarch64_tune_params);
> > > > >
> > > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > > +                             &aarch64_tune_params);
> > > > > +
> > > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > > +                             &aarch64_tune_params);
> > > > > +
> > > > >    /* This target defaults to strict volatile bitfields.  */
> > > > >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> abi_version_at_least (2))
> > > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > > >    enum reg_class rclass_1, rclass_2;
> > > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > > >
> > > > > -  /* Allow the tuning structure to disable LDP instruction
> formation
> > > > > -     from combining instructions (e.g., in peephole2).
> > > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > > -          1. control policies for load and store separately;
> > > > > -          2. support the following policies:
> > > > > -             - default (use what is in the tuning structure)
> > > > > -             - always
> > > > > -             - never
> > > > > -             - aligned (only if the compiler can prove that the
> > > > > -               load will be aligned to 2 * element_size)  */
> > > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > > >      return false;
> > > > >
> > > > >    if (load)
> > > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > > >      return false;
> > > > >
> > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > -     check the alignment to be at least 8 byte. */
> > > > > -  if (mode == SImode
> > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > -      && !optimize_size
> > > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > +     do not emit the load pair unless the alignment is checked to
> be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > +     do not emit the store pair unless the alignment is checked
> to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > >      return false;
> > > > >
> > > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp
> (rtx *operands, bool load,
> > > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > > >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> > > > >
> > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > >    if (load)
> > > > >      {
> > > > >        for (int i = 0; i < num_insns; i++)
> > > > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp
> (rtx *operands, bool load,
> > > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > > >      return false;
> > > > >
> > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > -     check the alignment to be at least 8 byte. */
> > > > > -  if (mode == SImode
> > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > -      && !optimize_size
> > > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > +     do not emit the load pair unless the alignment is checked to
> be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > +     do not emit the store pair unless the alignment is checked
> to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > >      return false;
> > > > >
> > > > >    return true;
> > > > > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > > > > index 4a0580435a8..f61e3f968d4 100644
> > > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to
> start using MOPS sequence.
> > > > >  -param=aarch64-vect-unroll-limit=
> > > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4)
> Param
> > > > >  Limit how much the autovectorizer may unroll a loop.
> > > > > +
> > > > > +-param=aarch64-ldp-policy=
> > > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0)
> IntegerRange(0, 3) Param
> > > > > +Fine-grained policy for load pairs.
> > > > > +
> > > > > +-param=aarch64-stp-policy=
> > > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0)
> IntegerRange(0, 3) Param
> > > > > +Fine-grained policy for store pairs.
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > index 146b40414b0..000dd5541f4 100644
> > > > > --- a/gcc/doc/invoke.texi
> > > > > +++ b/gcc/doc/invoke.texi
> > > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer
> SVE when the costs are deemed equal.
> > > > >  @end table
> > > > >  The default value is 0.
> > > > >
> > > > > +@item aarch64-ldp-policy
> > > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3,
> inclusive.
> > > > > +@table @samp
> > > > > +@item 0
> > > > > +Use the policy of the tuning structure.
> > > > > +@item 1
> > > > > +Emit ldp regardless of alignment.
> > > > > +@item 2
> > > > > +Do not emit ldp.
> > > > > +@item 3
> > > > > +Emit ldp only if the source pointer is aligned to at least double
> the alignment
> > > > > +of the type.
> > > > > +@end table
> > > > > +The default value is 0.
> > > > > +
> > > > > +@item aarch64-stp-policy
> > > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3,
> inclusive.
> > > > > +@table @samp
> > > > > +@item 0
> > > > > +Use the policy of the tuning structure.
> > > > > +@item 1
> > > > > +Emit stp regardless of alignment.
> > > > > +@item 2
> > > > > +Do not emit stp.
> > > > > +@item 3
> > > > > +Emit stp only if the source pointer is aligned to at least double
> the alignment
> > > > > +of the type.
> > > > > +@end table
> > > > > +The default value is 0.
> > > > > +
> > > > >  @item aarch64-loop-vect-issue-rate-niters
> > > > >  The tuning for some AArch64 CPUs tries to take both latencies and
> issue
> > > > >  rates into account when deciding whether a loop should be
> vectorized
> > > > > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > > deleted file mode 100644
> > > > > index bc871f4481d..00000000000
> > > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > > +++ /dev/null
> > > > > @@ -1,11 +0,0 @@
> > > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > > -
> > > > > -long
> > > > > -foo (long a[])
> > > > > -{
> > > > > -  return a[0] + a[1];
> > > > > -}
> > > > > -
> > > > > -/* We should see two ldrs instead of one ldp. */
> > > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > new file mode 100644
> > > > > index 00000000000..8e43faab70d
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1
> } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > new file mode 100644
> > > > > index 00000000000..532ca607565
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2
> } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > new file mode 100644
> > > > > index 00000000000..b39941c18d7
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0
> } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0
> } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > new file mode 100644
> > > > > index 00000000000..01f294bb090
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1
> } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > new file mode 100644
> > > > > index 00000000000..cedb461b5b2
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2
> } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > new file mode 100644
> > > > > index 00000000000..ddde658f807
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic"
> } */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0
> } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0
> } } */
> > > > > +
> > > > > --
> > > > > 2.40.1
> > > > >
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-26  6:06         ` Manos Anagnostakis
@ 2023-09-26  8:34           ` Tamar Christina
  2023-09-26  8:35           ` Kyrylo Tkachov
  1 sibling, 0 replies; 12+ messages in thread
From: Tamar Christina @ 2023-09-26  8:34 UTC (permalink / raw)
  To: Manos Anagnostakis, gcc-patches
  Cc: Philipp Tomsich, Kyrylo Tkachov, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 52799 bytes --]

Hi Manos,

Could I ask that you extract the conditions you’re checking for in aarch64_operands_ok_for_ldpstp to it’s own function?

We are in the process of rewriting load/store pairs and the new pass does not use this helper function as it discovers
them differently. It would be easier for us if the check could be called on the mem rtx directly.

Kind Regards,
Tamar

From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Sent: Tuesday, September 26, 2023 7:06 AM
To: gcc-patches@gcc.gnu.org
Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Andrew Pinski <pinskia@gmail.com>
Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.

Thank you Andrew for the input.

I've prepared a patch using --param with enum, which seems a more suitable approach to me as strings are more descriptive as well.

The current patch needed an adjustment on how to call the parsing functions to match the compiler coding style.

Both are bootstrapped and regstested.

I can send a V4 of whichever is preferred.

Thanks!

Manos.

On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski <pinskia@gmail.com<mailto:pinskia@gmail.com>> wrote:
On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <pinskia@gmail.com<mailto:pinskia@gmail.com>> wrote:
>
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> <philipp.tomsich@vrull.eu<mailto:philipp.tomsich@vrull.eu>> wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com<mailto:pinskia@gmail.com>> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > <manos.anagnostakis@vrull.eu<mailto:manos.anagnostakis@vrull.eu>> wrote:
> > > >
> > > > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >      from combining instructions (e.g., in peephole2).
> > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > >            1. control policies for load and store separately;
> > > >            2. support the following policies:
> > > >               - default (use what is in the tuning structure)
> > > >               - always
> > > >               - never
> > > >               - aligned (only if the compiler can prove that the
> > > >                 load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
>
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

Kyrylo:
>  It will have to take a number rather than a string but that should be okay, as long as the right values are documented in invoke.texi.

No it does not need to be a number. --param=ranger-debug= does not
take a number, it takes an enum .
One of the benefits of moving --param support over to .opt to allow
more than just numbers even.

Thanks,
Andrew


>
>
> Thanks,
> Andrew
>
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > >         appropriate enums for the policies.
> > > >         * config/aarch64/aarch64-tuning-flags.def
> > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > >         options.
> > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > >         function to parse ldp-policy parameter.
> > > >         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
> > > >         (aarch64_override_options_internal): Call parsing functions.
> > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         * config/aarch64/aarch64.opt: Add options.
> > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > >
> > > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu<mailto:manos.anagnostakis@vrull.eu>>
> > > > ---
> > > > Changes in v3:
> > > >         - Changed command-line options to target-specific parameters
> > > >           and documented them accordingly in doc/invoke.texi.
> > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > >
> > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > >  gcc/doc/invoke.texi                           |  30 +++
> > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > > > index 70303d6fd95..be1d73490ed 100644
> > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > @@ -568,6 +568,30 @@ struct tune_params
> > > >    /* Place prefetch struct pointer at the end to enable type checking
> > > >       errors when tune_params misses elements (e.g., from erroneous merges).  */
> > > >    const struct cpu_prefetch_tune *prefetch;
> > > > +/* An enum specifying how to handle load pairs using a fine-grained policy:
> > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > +
> > > > +  enum aarch64_ldp_policy_model
> > > > +  {
> > > > +    LDP_POLICY_ALIGNED,
> > > > +    LDP_POLICY_ALWAYS,
> > > > +    LDP_POLICY_NEVER
> > > > +  } ldp_policy_model;
> > > > +/* An enum specifying how to handle store pairs using a fine-grained policy:
> > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > +
> > > > +  enum aarch64_stp_policy_model
> > > > +  {
> > > > +    STP_POLICY_ALIGNED,
> > > > +    STP_POLICY_ALWAYS,
> > > > +    STP_POLICY_NEVER
> > > > +  } stp_policy_model;
> > > >  };
> > > >
> > > >  /* Classifies an address.
> > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > index 52112ba7c48..774568e9106 100644
> > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > @@ -30,11 +30,6 @@
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> > > >
> > > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > > -pairs.   */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > > > -
> > > >  /* Some of the optional shift to some arthematic instructions are
> > > >     considered cheap.  Logical shift left <=4 with or without a
> > > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> > > >
> > > > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > > > -
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> > > >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
> > > >       have at most a very minor effect on SVE2 cores.  */
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa35_tunings =
> > > > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa53_tunings =
> > > > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa57_tunings =
> > > > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa72_tunings =
> > > > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa73_tunings =
> > > > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >
> > > > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> > > >    48,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > -  &exynosm1_prefetch_tune
> > > > +  &exynosm1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderxt88_tunings =
> > > > @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > > > -  &thunderxt88_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &thunderxt88_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx_tunings =
> > > > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &thunderx_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > +  &thunderx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params tsv110_tunings =
> > > > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> > > >    0,    /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > -  &tsv110_prefetch_tune
> > > > +  &tsv110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params xgene1_tunings =
> > > > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params emag_tunings =
> > > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params qdf24xx_tunings =
> > > > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > -  &qdf24xx_prefetch_tune
> > > > +  &qdf24xx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> > > > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx2t99_tunings =
> > > > @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx2t99_prefetch_tune
> > > > +  &thunderx2t99_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx3t110_tunings =
> > > > @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx3t110_prefetch_tune
> > > > +  &thunderx3t110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params neoversen1_tunings =
> > > > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1_tunings =
> > > > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1a_tunings =
> > > > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params a64fx_tunings =
> > > > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &a64fx_prefetch_tune
> > > > +  &a64fx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
> > > >    return AARCH_PARSE_INVALID_ARG;
> > > >  }
> > > >
> > > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
> > > >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> > > >                                    &aarch64_tune_params);
> > > >
> > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > >    /* This target defaults to strict volatile bitfields.  */
> > > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
> > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    enum reg_class rclass_1, rclass_2;
> > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > >
> > > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > > -     from combining instructions (e.g., in peephole2).
> > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > -          1. control policies for load and store separately;
> > > > -          2. support the following policies:
> > > > -             - default (use what is in the tuning structure)
> > > > -             - always
> > > > -             - never
> > > > -             - aligned (only if the compiler can prove that the
> > > > -               load will be aligned to 2 * element_size)  */
> > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > >      return false;
> > > >
> > > >    if (load)
> > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
> > > >
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > >    if (load)
> > > >      {
> > > >        for (int i = 0; i < num_insns; i++)
> > > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    return true;
> > > > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > > > index 4a0580435a8..f61e3f968d4 100644
> > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
> > > >  -param=aarch64-vect-unroll-limit=
> > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > > >  Limit how much the autovectorizer may unroll a loop.
> > > > +
> > > > +-param=aarch64-ldp-policy=
> > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for load pairs.
> > > > +
> > > > +-param=aarch64-stp-policy=
> > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for store pairs.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 146b40414b0..000dd5541f4 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> > > >  @end table
> > > >  The default value is 0.
> > > >
> > > > +@item aarch64-ldp-policy
> > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit ldp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit ldp.
> > > > +@item 3
> > > > +Emit ldp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > > +@item aarch64-stp-policy
> > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit stp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit stp.
> > > > +@item 3
> > > > +Emit stp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > >  @item aarch64-loop-vect-issue-rate-niters
> > > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > > >  rates into account when deciding whether a loop should be vectorized
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > deleted file mode 100644
> > > > index bc871f4481d..00000000000
> > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > +++ /dev/null
> > > > @@ -1,11 +0,0 @@
> > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > -
> > > > -long
> > > > -foo (long a[])
> > > > -{
> > > > -  return a[0] + a[1];
> > > > -}
> > > > -
> > > > -/* We should see two ldrs instead of one ldp. */
> > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..8e43faab70d
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > new file mode 100644
> > > > index 00000000000..532ca607565
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > new file mode 100644
> > > > index 00000000000..b39941c18d7
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..01f294bb090
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > new file mode 100644
> > > > index 00000000000..cedb461b5b2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > new file mode 100644
> > > > index 00000000000..ddde658f807
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > --
> > > > 2.40.1
> > > >


--
Manos Anagnostakis | Compiler Engineer |
E: manos.anagnostakis@vrull.eu<mailto:makeljana.shkurti@vrull.eu>

VRULL GmbH | Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu<http://www.vrull.eu/> | LinkedIn<https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-25 20:04     ` Andrew Pinski
  2023-09-25 20:57       ` Andrew Pinski
@ 2023-09-26  8:34       ` Kyrylo Tkachov
  1 sibling, 0 replies; 12+ messages in thread
From: Kyrylo Tkachov @ 2023-09-26  8:34 UTC (permalink / raw)
  To: Andrew Pinski, Philipp Tomsich; +Cc: Manos Anagnostakis, gcc-patches



> -----Original Message-----
> From: Andrew Pinski <pinskia@gmail.com>
> Sent: Monday, September 25, 2023 9:05 PM
> To: Philipp Tomsich <philipp.tomsich@vrull.eu>
> Cc: Manos Anagnostakis <manos.anagnostakis@vrull.eu>; gcc-
> patches@gcc.gnu.org; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> <philipp.tomsich@vrull.eu> wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <pinskia@gmail.com> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > <manos.anagnostakis@vrull.eu> wrote:
> > > >
> > > > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >      from combining instructions (e.g., in peephole2).
> > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > >            1. control policies for load and store separately;
> > > >            2. support the following policies:
> > > >               - default (use what is in the tuning structure)
> > > >               - always
> > > >               - never
> > > >               - aligned (only if the compiler can prove that the
> > > >                 load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line
> parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
> 
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

My rationale for having this as a param rather than an -m* option is that
this is just an override for a codegen heuristic that the compiler should be
getting correct on its own when used by a normal user.
Having a way to force an explicit LDP/STP policy can be useful for testing
the compiler and for some power user experimentation, but I wouldn't
want to see it make its way into any user makefiles.

Good point on having it accept an enum, it is definitely more readable to have a string argument.
Thanks,
Kyrill

> 
> 
> Thanks,
> Andrew
> 
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > >         appropriate enums for the policies.
> > > >         * config/aarch64/aarch64-tuning-flags.def
> > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > >         options.
> > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > >         function to parse ldp-policy parameter.
> > > >         (aarch64_parse_stp_policy): New function to parse stp-policy
> parameter.
> > > >         (aarch64_override_options_internal): Call parsing functions.
> > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check
> and
> > > >         alignment check and remove superseded ones.
> > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value
> check and
> > > >         alignment check and remove superseded ones.
> > > >         * config/aarch64/aarch64.opt: Add options.
> > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > >
> > > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > > > ---
> > > > Changes in v3:
> > > >         - Changed command-line options to target-specific parameters
> > > >           and documented them accordingly in doc/invoke.texi.
> > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > >
> > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > >  gcc/doc/invoke.texi                           |  30 +++
> > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-
> no_ldp_combine.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > > > index 70303d6fd95..be1d73490ed 100644
> > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > @@ -568,6 +568,30 @@ struct tune_params
> > > >    /* Place prefetch struct pointer at the end to enable type checking
> > > >       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
> > > >    const struct cpu_prefetch_tune *prefetch;
> > > > +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > +
> > > > +  enum aarch64_ldp_policy_model
> > > > +  {
> > > > +    LDP_POLICY_ALIGNED,
> > > > +    LDP_POLICY_ALWAYS,
> > > > +    LDP_POLICY_NEVER
> > > > +  } ldp_policy_model;
> > > > +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > +
> > > > +  enum aarch64_stp_policy_model
> > > > +  {
> > > > +    STP_POLICY_ALIGNED,
> > > > +    STP_POLICY_ALWAYS,
> > > > +    STP_POLICY_NEVER
> > > > +  } stp_policy_model;
> > > >  };
> > > >
> > > >  /* Classifies an address.
> > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > index 52112ba7c48..774568e9106 100644
> > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > @@ -30,11 +30,6 @@
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs",
> RENAME_FMA_REGS)
> > > >
> > > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > > -pairs.   */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> SLOW_UNALIGNED_LDPW)
> > > > -
> > > >  /* Some of the optional shift to some arthematic instructions are
> > > >     considered cheap.  Logical shift left <=4 with or without a
> > > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION
> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs",
> NO_LDP_STP_QREGS)
> > > >
> > > > -/* Disallow load-pair instructions to be formed in combine/peephole.
> */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine",
> NO_LDP_COMBINE)
> > > > -
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs",
> RENAME_LOAD_REGS)
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> > > > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > @@ -1357,7 +1357,9 @@ static const struct tune_params
> generic_tunings =
> > > >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
> > > >       have at most a very minor effect on SVE2 cores.  */
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.
> */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa35_tunings =
> > > > @@ -1391,7 +1393,9 @@ static const struct tune_params
> cortexa35_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa53_tunings =
> > > > @@ -1425,7 +1429,9 @@ static const struct tune_params
> cortexa53_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa57_tunings =
> > > > @@ -1459,7 +1465,9 @@ static const struct tune_params
> cortexa57_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa72_tunings =
> > > > @@ -1493,7 +1501,9 @@ static const struct tune_params
> cortexa72_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa73_tunings =
> > > > @@ -1527,7 +1537,9 @@ static const struct tune_params
> cortexa73_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >
> > > > @@ -1562,7 +1574,9 @@ static const struct tune_params
> exynosm1_tunings =
> > > >    48,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > -  &exynosm1_prefetch_tune
> > > > +  &exynosm1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderxt88_tunings =
> > > > @@ -1594,8 +1608,10 @@ static const struct tune_params
> thunderxt88_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.
> */
> > > > -  &thunderxt88_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &thunderxt88_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx_tunings =
> > > > @@ -1627,9 +1643,10 @@ static const struct tune_params
> thunderx_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &thunderx_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > +  &thunderx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params tsv110_tunings =
> > > > @@ -1663,7 +1680,9 @@ static const struct tune_params
> tsv110_tunings =
> > > >    0,    /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > -  &tsv110_prefetch_tune
> > > > +  &tsv110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params xgene1_tunings =
> > > > @@ -1696,7 +1715,9 @@ static const struct tune_params
> xgene1_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params emag_tunings =
> > > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings
> =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params qdf24xx_tunings =
> > > > @@ -1763,7 +1786,9 @@ static const struct tune_params
> qdf24xx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > -  &qdf24xx_prefetch_tune
> > > > +  &qdf24xx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> values
> > > > @@ -1799,7 +1824,9 @@ static const struct tune_params
> saphira_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx2t99_tunings =
> > > > @@ -1833,7 +1860,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx2t99_prefetch_tune
> > > > +  &thunderx2t99_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx3t110_tunings =
> > > > @@ -1867,7 +1896,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx3t110_prefetch_tune
> > > > +  &thunderx3t110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params neoversen1_tunings =
> > > > @@ -1900,7 +1931,9 @@ static const struct tune_params
> neoversen1_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1_tunings =
> > > > @@ -1936,8 +1969,10 @@ static const struct tune_params
> ampere1_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1a_tunings =
> > > > @@ -1974,8 +2009,10 @@ static const struct tune_params
> ampere1a_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > @@ -2156,7 +2193,9 @@ static const struct tune_params
> neoversev1_tunings =
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > @@ -2293,7 +2332,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > @@ -2483,7 +2524,9 @@ static const struct tune_params
> neoversen2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > @@ -2673,7 +2716,9 @@ static const struct tune_params
> neoversev2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params a64fx_tunings =
> > > > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings
> =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.
> */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &a64fx_prefetch_tune
> > > > +  &a64fx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char
> *to_parse, const struct processor **res)
> > > >    return AARCH_PARSE_INVALID_ARG;
> > > >  }
> > > >
> > > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.
> VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_ldp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.
> VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_stp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
> > > >      aarch64_parse_override_string (opts-
> >x_aarch64_override_tune_string,
> > > >                                    &aarch64_tune_params);
> > > >
> > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > >    /* This target defaults to strict volatile bitfields.  */
> > > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least
> (2))
> > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > >    enum reg_class rclass_1, rclass_2;
> > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > >
> > > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > > -     from combining instructions (e.g., in peephole2).
> > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > -          1. control policies for load and store separately;
> > > > -          2. support the following policies:
> > > > -             - default (use what is in the tuning structure)
> > > > -             - always
> > > > -             - never
> > > > -             - aligned (only if the compiler can prove that the
> > > > -               load will be aligned to 2 * element_size)  */
> > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > >      return false;
> > > >
> > > >    if (load)
> > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp
> (rtx *operands, bool load,
> > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> > > >
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > >    if (load)
> > > >      {
> > > >        for (int i = 0; i < num_insns; i++)
> > > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp
> (rtx *operands, bool load,
> > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    return true;
> > > > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > > > index 4a0580435a8..f61e3f968d4 100644
> > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to
> start using MOPS sequence.
> > > >  -param=aarch64-vect-unroll-limit=
> > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > > >  Limit how much the autovectorizer may unroll a loop.
> > > > +
> > > > +-param=aarch64-ldp-policy=
> > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0,
> 3) Param
> > > > +Fine-grained policy for load pairs.
> > > > +
> > > > +-param=aarch64-stp-policy=
> > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0,
> 3) Param
> > > > +Fine-grained policy for store pairs.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 146b40414b0..000dd5541f4 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer
> SVE when the costs are deemed equal.
> > > >  @end table
> > > >  The default value is 0.
> > > >
> > > > +@item aarch64-ldp-policy
> > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit ldp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit ldp.
> > > > +@item 3
> > > > +Emit ldp only if the source pointer is aligned to at least double the
> alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > > +@item aarch64-stp-policy
> > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3,
> inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit stp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit stp.
> > > > +@item 3
> > > > +Emit stp only if the source pointer is aligned to at least double the
> alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > >  @item aarch64-loop-vect-issue-rate-niters
> > > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > > >  rates into account when deciding whether a loop should be vectorized
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > deleted file mode 100644
> > > > index bc871f4481d..00000000000
> > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > +++ /dev/null
> > > > @@ -1,11 +0,0 @@
> > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > -
> > > > -long
> > > > -foo (long a[])
> > > > -{
> > > > -  return a[0] + a[1];
> > > > -}
> > > > -
> > > > -/* We should see two ldrs instead of one ldp. */
> > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..8e43faab70d
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > new file mode 100644
> > > > index 00000000000..532ca607565
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > new file mode 100644
> > > > index 00000000000..b39941c18d7
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..01f294bb090
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > new file mode 100644
> > > > index 00000000000..cedb461b5b2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > new file mode 100644
> > > > index 00000000000..ddde658f807
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > --
> > > > 2.40.1
> > > >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-26  6:06         ` Manos Anagnostakis
  2023-09-26  8:34           ` Tamar Christina
@ 2023-09-26  8:35           ` Kyrylo Tkachov
  2023-09-26  8:50             ` Kyrylo Tkachov
  1 sibling, 1 reply; 12+ messages in thread
From: Kyrylo Tkachov @ 2023-09-26  8:35 UTC (permalink / raw)
  To: Manos Anagnostakis, gcc-patches; +Cc: Philipp Tomsich, Andrew Pinski

Hi Manos,

Thank you for the quick turnaround, please post the patch that uses a --param with an enum. I think that's the direction we should be going with this patch.
Thanks,
Kyrill

From: Manos Anagnostakis <manos.anagnostakis@vrull.eu> 
Sent: Tuesday, September 26, 2023 7:06 AM
To: gcc-patches@gcc.gnu.org
Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Andrew Pinski <pinskia@gmail.com>
Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.

Thank you Andrew for the input.

I've prepared a patch using --param with enum, which seems a more suitable approach to me as strings are more descriptive as well.

The current patch needed an adjustment on how to call the parsing functions to match the compiler coding style.

Both are bootstrapped and regstested.

I can send a V4 of whichever is preferred.

Thanks!

Manos.

On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski <mailto:pinskia@gmail.com> wrote:
On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <mailto:pinskia@gmail.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> <mailto:philipp.tomsich@vrull.eu> wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <mailto:pinskia@gmail.com> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > <mailto:manos.anagnostakis@vrull.eu> wrote:
> > > >
> > > > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >      from combining instructions (e.g., in peephole2).
> > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > >            1. control policies for load and store separately;
> > > >            2. support the following policies:
> > > >               - default (use what is in the tuning structure)
> > > >               - always
> > > >               - never
> > > >               - aligned (only if the compiler can prove that the
> > > >                 load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
>
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

Kyrylo:
>  It will have to take a number rather than a string but that should be okay, as long as the right values are documented in invoke.texi.

No it does not need to be a number. --param=ranger-debug= does not
take a number, it takes an enum .
One of the benefits of moving --param support over to .opt to allow
more than just numbers even.

Thanks,
Andrew


>
>
> Thanks,
> Andrew
>
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > >         appropriate enums for the policies.
> > > >         * config/aarch64/aarch64-tuning-flags.def
> > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > >         options.
> > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > >         function to parse ldp-policy parameter.
> > > >         (aarch64_parse_stp_policy): New function to parse stp-policy parameter.
> > > >         (aarch64_override_options_internal): Call parsing functions.
> > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value check and
> > > >         alignment check and remove superseded ones.
> > > >         * config/aarch64/aarch64.opt: Add options.
> > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > >
> > > > Signed-off-by: Manos Anagnostakis <mailto:manos.anagnostakis@vrull.eu>
> > > > ---
> > > > Changes in v3:
> > > >         - Changed command-line options to target-specific parameters
> > > >           and documented them accordingly in doc/invoke.texi.
> > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > >
> > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > >  gcc/doc/invoke.texi                           |  30 +++
> > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > > > index 70303d6fd95..be1d73490ed 100644
> > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > @@ -568,6 +568,30 @@ struct tune_params
> > > >    /* Place prefetch struct pointer at the end to enable type checking
> > > >       errors when tune_params misses elements (e.g., from erroneous merges).  */
> > > >    const struct cpu_prefetch_tune *prefetch;
> > > > +/* An enum specifying how to handle load pairs using a fine-grained policy:
> > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > +
> > > > +  enum aarch64_ldp_policy_model
> > > > +  {
> > > > +    LDP_POLICY_ALIGNED,
> > > > +    LDP_POLICY_ALWAYS,
> > > > +    LDP_POLICY_NEVER
> > > > +  } ldp_policy_model;
> > > > +/* An enum specifying how to handle store pairs using a fine-grained policy:
> > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > +   to at least double the alignment of the type.
> > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > +
> > > > +  enum aarch64_stp_policy_model
> > > > +  {
> > > > +    STP_POLICY_ALIGNED,
> > > > +    STP_POLICY_ALWAYS,
> > > > +    STP_POLICY_NEVER
> > > > +  } stp_policy_model;
> > > >  };
> > > >
> > > >  /* Classifies an address.
> > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > index 52112ba7c48..774568e9106 100644
> > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > @@ -30,11 +30,6 @@
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> > > >
> > > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > > -pairs.   */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > > > -
> > > >  /* Some of the optional shift to some arthematic instructions are
> > > >     considered cheap.  Logical shift left <=4 with or without a
> > > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> > > >
> > > > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > > > -
> > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> > > >
> > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > @@ -1357,7 +1357,9 @@ static const struct tune_params generic_tunings =
> > > >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
> > > >       have at most a very minor effect on SVE2 cores.  */
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa35_tunings =
> > > > @@ -1391,7 +1393,9 @@ static const struct tune_params cortexa35_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa53_tunings =
> > > > @@ -1425,7 +1429,9 @@ static const struct tune_params cortexa53_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa57_tunings =
> > > > @@ -1459,7 +1465,9 @@ static const struct tune_params cortexa57_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa72_tunings =
> > > > @@ -1493,7 +1501,9 @@ static const struct tune_params cortexa72_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params cortexa73_tunings =
> > > > @@ -1527,7 +1537,9 @@ static const struct tune_params cortexa73_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >
> > > > @@ -1562,7 +1574,9 @@ static const struct tune_params exynosm1_tunings =
> > > >    48,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > -  &exynosm1_prefetch_tune
> > > > +  &exynosm1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderxt88_tunings =
> > > > @@ -1594,8 +1608,10 @@ static const struct tune_params thunderxt88_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> > > > -  &thunderxt88_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &thunderxt88_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx_tunings =
> > > > @@ -1627,9 +1643,10 @@ static const struct tune_params thunderx_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &thunderx_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > +  &thunderx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params tsv110_tunings =
> > > > @@ -1663,7 +1680,9 @@ static const struct tune_params tsv110_tunings =
> > > >    0,    /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > -  &tsv110_prefetch_tune
> > > > +  &tsv110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params xgene1_tunings =
> > > > @@ -1696,7 +1715,9 @@ static const struct tune_params xgene1_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params emag_tunings =
> > > > @@ -1729,7 +1750,9 @@ static const struct tune_params emag_tunings =
> > > >    17,  /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > -  &xgene1_prefetch_tune
> > > > +  &xgene1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params qdf24xx_tunings =
> > > > @@ -1763,7 +1786,9 @@ static const struct tune_params qdf24xx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > -  &qdf24xx_prefetch_tune
> > > > +  &qdf24xx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> > > > @@ -1799,7 +1824,9 @@ static const struct tune_params saphira_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx2t99_tunings =
> > > > @@ -1833,7 +1860,9 @@ static const struct tune_params thunderx2t99_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx2t99_prefetch_tune
> > > > +  &thunderx2t99_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params thunderx3t110_tunings =
> > > > @@ -1867,7 +1896,9 @@ static const struct tune_params thunderx3t110_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &thunderx3t110_prefetch_tune
> > > > +  &thunderx3t110_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params neoversen1_tunings =
> > > > @@ -1900,7 +1931,9 @@ static const struct tune_params neoversen1_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1_tunings =
> > > > @@ -1936,8 +1969,10 @@ static const struct tune_params ampere1_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params ampere1a_tunings =
> > > > @@ -1974,8 +2009,10 @@ static const struct tune_params ampere1a_tunings =
> > > >    2,   /* min_div_recip_mul_df.  */
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > -  &ampere1_prefetch_tune
> > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > +  &ampere1_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > @@ -2156,7 +2193,9 @@ static const struct tune_params neoversev1_tunings =
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > @@ -2293,7 +2332,9 @@ static const struct tune_params neoverse512tvb_tunings =
> > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > @@ -2483,7 +2524,9 @@ static const struct tune_params neoversen2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > @@ -2673,7 +2716,9 @@ static const struct tune_params neoversev2_tunings =
> > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> > > > -  &generic_prefetch_tune
> > > > +  &generic_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  static const struct tune_params a64fx_tunings =
> > > > @@ -2706,7 +2751,9 @@ static const struct tune_params a64fx_tunings =
> > > >    0,   /* max_case_values.  */
> > > >    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > -  &a64fx_prefetch_tune
> > > > +  &a64fx_prefetch_tune,
> > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > >  };
> > > >
> > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
> > > >    return AARCH_PARSE_INVALID_ARG;
> > > >  }
> > > >
> > > > +/* Parse a command-line -param=aarch64-ldp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_ldp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > > +/* Parse a command-line -param=aarch64-stp-policy= parameter.  VALUE is
> > > > +   the value of the parameter.  */
> > > > +
> > > > +static void
> > > > +aarch64_parse_stp_policy (const unsigned int value, struct tune_params* tune)
> > > > +{
> > > > +  if (value == 1)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > +  else if (value == 2)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > +  else if (value == 3)
> > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > +}
> > > > +
> > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
> > > >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> > > >                                    &aarch64_tune_params);
> > > >
> > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > +                             &aarch64_tune_params);
> > > > +
> > > >    /* This target defaults to strict volatile bitfields.  */
> > > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
> > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    enum reg_class rclass_1, rclass_2;
> > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > >
> > > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > > -     from combining instructions (e.g., in peephole2).
> > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > -          1. control policies for load and store separately;
> > > > -          2. support the following policies:
> > > > -             - default (use what is in the tuning structure)
> > > > -             - always
> > > > -             - never
> > > > -             - aligned (only if the compiler can prove that the
> > > > -               load will be aligned to 2 * element_size)  */
> > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > >      return false;
> > > >
> > > >    if (load)
> > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > @@ -26640,6 +26726,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
> > > >
> > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> > > > +    return false;
> > > > +
> > > >    if (load)
> > > >      {
> > > >        for (int i = 0; i < num_insns; i++)
> > > > @@ -26729,13 +26825,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > >      return false;
> > > >
> > > > -  /* If we have SImode and slow unaligned ldp,
> > > > -     check the alignment to be at least 8 byte. */
> > > > -  if (mode == SImode
> > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > -      && !optimize_size
> > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > +     do not emit the load pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (load
> > > > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > +    return false;
> > > > +
> > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > +     do not emit the store pair unless the alignment is checked to be
> > > > +     at least double the alignment of the type.  */
> > > > +  if (!load
> > > > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> > > > +      && !optimize_function_for_size_p (cfun)
> > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > > >      return false;
> > > >
> > > >    return true;
> > > > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > > > index 4a0580435a8..f61e3f968d4 100644
> > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to start using MOPS sequence.
> > > >  -param=aarch64-vect-unroll-limit=
> > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > > >  Limit how much the autovectorizer may unroll a loop.
> > > > +
> > > > +-param=aarch64-ldp-policy=
> > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for load pairs.
> > > > +
> > > > +-param=aarch64-stp-policy=
> > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0) IntegerRange(0, 3) Param
> > > > +Fine-grained policy for store pairs.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 146b40414b0..000dd5541f4 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
> > > >  @end table
> > > >  The default value is 0.
> > > >
> > > > +@item aarch64-ldp-policy
> > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit ldp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit ldp.
> > > > +@item 3
> > > > +Emit ldp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > > +@item aarch64-stp-policy
> > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3, inclusive.
> > > > +@table @samp
> > > > +@item 0
> > > > +Use the policy of the tuning structure.
> > > > +@item 1
> > > > +Emit stp regardless of alignment.
> > > > +@item 2
> > > > +Do not emit stp.
> > > > +@item 3
> > > > +Emit stp only if the source pointer is aligned to at least double the alignment
> > > > +of the type.
> > > > +@end table
> > > > +The default value is 0.
> > > > +
> > > >  @item aarch64-loop-vect-issue-rate-niters
> > > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > > >  rates into account when deciding whether a loop should be vectorized
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > deleted file mode 100644
> > > > index bc871f4481d..00000000000
> > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > +++ /dev/null
> > > > @@ -1,11 +0,0 @@
> > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > -
> > > > -long
> > > > -foo (long a[])
> > > > -{
> > > > -  return a[0] + a[1];
> > > > -}
> > > > -
> > > > -/* We should see two ldrs instead of one ldp. */
> > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..8e43faab70d
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > new file mode 100644
> > > > index 00000000000..532ca607565
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > new file mode 100644
> > > > index 00000000000..b39941c18d7
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > @@ -0,0 +1,66 @@
> > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[0]; \
> > > > +    a_1 = arr[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[0]; \
> > > > +    a_1 = a[1]; \
> > > > +    return a_0 + a_1; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    a_0 = arr[100]; \
> > > > +    a_1 = arr[101]; \
> > > > +    a_2 = arr[102]; \
> > > > +    a_3 = arr[103]; \
> > > > +    a_4 = arr[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a_0 = a[100]; \
> > > > +    a_1 = a[101]; \
> > > > +    a_2 = a[102]; \
> > > > +    a_3 = a[103]; \
> > > > +    a_4 = a[110]; \
> > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > +}
> > > > +
> > > > +LDP_TEST_ALIGNED(int32_t);
> > > > +LDP_TEST_ALIGNED(int64_t);
> > > > +LDP_TEST_ALIGNED(v4si);
> > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > +LDP_TEST_UNALIGNED(v4si);
> > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > new file mode 100644
> > > > index 00000000000..01f294bb090
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > new file mode 100644
> > > > index 00000000000..cedb461b5b2
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > new file mode 100644
> > > > index 00000000000..ddde658f807
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > @@ -0,0 +1,60 @@
> > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" } */
> > > > +
> > > > +#include <stdlib.h>
> > > > +#include <stdint.h>
> > > > +
> > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > +
> > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[0] = x; \
> > > > +    arr[1] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[0] = x; \
> > > > +    a[1] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    arr[100] = x; \
> > > > +    arr[101] = x; \
> > > > +    arr[102] = x; \
> > > > +    arr[103] = x; \
> > > > +    return arr; \
> > > > +}
> > > > +
> > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > +    TYPE *a = arr+1; \
> > > > +    a[100] = x; \
> > > > +    a[101] = x; \
> > > > +    a[102] = x; \
> > > > +    a[103] = x; \
> > > > +    return a; \
> > > > +}
> > > > +
> > > > +STP_TEST_ALIGNED(int32_t);
> > > > +STP_TEST_ALIGNED(int64_t);
> > > > +STP_TEST_ALIGNED(v4si);
> > > > +STP_TEST_UNALIGNED(int32_t);
> > > > +STP_TEST_UNALIGNED(int64_t);
> > > > +STP_TEST_UNALIGNED(v4si);
> > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > +
> > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > +
> > > > --
> > > > 2.40.1
> > > >



-- 
Manos Anagnostakis | Compiler Engineer |
E: mailto:makeljana.shkurti@vrull.eu 

VRULL GmbH | Beatrixgasse 32 1030 Vienna |
 W: http://www.vrull.eu/ | https://www.linkedin.com/company/vrull/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-26  8:35           ` Kyrylo Tkachov
@ 2023-09-26  8:50             ` Kyrylo Tkachov
  2023-09-26 11:51               ` Manos Anagnostakis
  0 siblings, 1 reply; 12+ messages in thread
From: Kyrylo Tkachov @ 2023-09-26  8:50 UTC (permalink / raw)
  To: Kyrylo Tkachov, Manos Anagnostakis, gcc-patches
  Cc: Philipp Tomsich, Andrew Pinski



> -----Original Message-----
> From: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Sent: Tuesday, September 26, 2023 9:36 AM
> To: Manos Anagnostakis <manos.anagnostakis@vrull.eu>; gcc-
> patches@gcc.gnu.org
> Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Andrew Pinski
> <pinskia@gmail.com>
> Subject: RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> Hi Manos,
> 
> Thank you for the quick turnaround, please post the patch that uses a --
> param with an enum. I think that's the direction we should be going with this
> patch.

Ah, and please address Tamar's feedback from https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631343.html
Thanks,
Kyrill

> 
> From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> Sent: Tuesday, September 26, 2023 7:06 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Andrew Pinski <pinskia@gmail.com>
> Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> Thank you Andrew for the input.
> 
> I've prepared a patch using --param with enum, which seems a more suitable
> approach to me as strings are more descriptive as well.
> 
> The current patch needed an adjustment on how to call the parsing functions
> to match the compiler coding style.
> 
> Both are bootstrapped and regstested.
> 
> I can send a V4 of whichever is preferred.
> 
> Thanks!
> 
> Manos.
> 
> On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski
> <mailto:pinskia@gmail.com> wrote:
> On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <mailto:pinskia@gmail.com>
> wrote:
> >
> > On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> > <mailto:philipp.tomsich@vrull.eu> wrote:
> > >
> > > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski
> <mailto:pinskia@gmail.com> wrote:
> > > >
> > > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > > <mailto:manos.anagnostakis@vrull.eu> wrote:
> > > > >
> > > > > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> > > > > to provide the requested behaviour for handling ldp and stp:
> > > > >
> > > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > > >      from combining instructions (e.g., in peephole2).
> > > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > > >            1. control policies for load and store separately;
> > > > >            2. support the following policies:
> > > > >               - default (use what is in the tuning structure)
> > > > >               - always
> > > > >               - never
> > > > >               - aligned (only if the compiler can prove that the
> > > > >                 load will be aligned to 2 * element_size)  */
> > > > >
> > > > > It provides two new and concrete target-specific command-line
> parameters
> > > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > > to give the ability to control load and store policies seperately as
> > > > > stated in part 1 of the TODO.
> > > > >
> > > > > The accepted values for both parameters are:
> > > > > - 0: Use the policy of the tuning structure (default).
> > > > > - 1: Emit ldp/stp regardless of alignment.
> > > > > - 2: Do not emit ldp/stp.
> > > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > > >   be aligned to 2 * element_size.
> > > >
> > > > Instead of a number, does it make sense to instead use an string
> > > > (ENUM) for this param.
> > > > Also I think using --param is a bad idea if it is going to be
> > > > documented in the user manual.
> > > > Maybe a -m option should be used instead.
> > >
> > > See https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/631283.html
> > > for the discussion triggering the change from -m... to --param and the
> > > change to using a number instead of a string.
> >
> > That is the opposite of the current GCC practice across all targets.
> > Things like this should be consistent and if one target decides to do
> > it different, then maybe it should NOT.
> > Anyways we should document the correct coding style for options so we
> > don't have these back and forths again.
> 
> Kyrylo:
> >  It will have to take a number rather than a string but that should be okay, as
> long as the right values are documented in invoke.texi.
> 
> No it does not need to be a number. --param=ranger-debug= does not
> take a number, it takes an enum .
> One of the benefits of moving --param support over to .opt to allow
> more than just numbers even.
> 
> Thanks,
> Andrew
> 
> 
> >
> >
> > Thanks,
> > Andrew
> >
> > >
> > > Thanks,
> > > Philipp.
> > >
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > > > >
> > > > > gcc/ChangeLog:
> > > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > > >         appropriate enums for the policies.
> > > > >         * config/aarch64/aarch64-tuning-flags.def
> > > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > > >         options.
> > > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > > >         function to parse ldp-policy parameter.
> > > > >         (aarch64_parse_stp_policy): New function to parse stp-policy
> parameter.
> > > > >         (aarch64_override_options_internal): Call parsing functions.
> > > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value check
> and
> > > > >         alignment check and remove superseded ones.
> > > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value
> check and
> > > > >         alignment check and remove superseded ones.
> > > > >         * config/aarch64/aarch64.opt: Add options.
> > > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > > >
> > > > > Signed-off-by: Manos Anagnostakis
> <mailto:manos.anagnostakis@vrull.eu>
> > > > > ---
> > > > > Changes in v3:
> > > > >         - Changed command-line options to target-specific parameters
> > > > >           and documented them accordingly in doc/invoke.texi.
> > > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > > >
> > > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > > >  gcc/config/aarch64/aarch64.cc                 | 215 +++++++++++++-----
> > > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > > >  gcc/doc/invoke.texi                           |  30 +++
> > > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-
> no_ldp_combine.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > >
> > > > > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > > > > index 70303d6fd95..be1d73490ed 100644
> > > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > > @@ -568,6 +568,30 @@ struct tune_params
> > > > >    /* Place prefetch struct pointer at the end to enable type checking
> > > > >       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
> > > > >    const struct cpu_prefetch_tune *prefetch;
> > > > > +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> > > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > > > > +   to at least double the alignment of the type.
> > > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > > +
> > > > > +  enum aarch64_ldp_policy_model
> > > > > +  {
> > > > > +    LDP_POLICY_ALIGNED,
> > > > > +    LDP_POLICY_ALWAYS,
> > > > > +    LDP_POLICY_NEVER
> > > > > +  } ldp_policy_model;
> > > > > +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> > > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > > > > +   to at least double the alignment of the type.
> > > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > > +
> > > > > +  enum aarch64_stp_policy_model
> > > > > +  {
> > > > > +    STP_POLICY_ALIGNED,
> > > > > +    STP_POLICY_ALWAYS,
> > > > > +    STP_POLICY_NEVER
> > > > > +  } stp_policy_model;
> > > > >  };
> > > > >
> > > > >  /* Classifies an address.
> > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > index 52112ba7c48..774568e9106 100644
> > > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > @@ -30,11 +30,6 @@
> > > > >
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs",
> RENAME_FMA_REGS)
> > > > >
> > > > > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > > > > -two load/stores are not at least 8 byte aligned don't create load/store
> > > > > -pairs.   */
> > > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> SLOW_UNALIGNED_LDPW)
> > > > > -
> > > > >  /* Some of the optional shift to some arthematic instructions are
> > > > >     considered cheap.  Logical shift left <=4 with or without a
> > > > >     zero extend are considered cheap.  Sign extend; non logical shift left
> > > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION
> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs",
> NO_LDP_STP_QREGS)
> > > > >
> > > > > -/* Disallow load-pair instructions to be formed in
> combine/peephole.  */
> > > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine",
> NO_LDP_COMBINE)
> > > > > -
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs",
> RENAME_LOAD_REGS)
> > > > >
> > > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> > > > > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > > @@ -1357,7 +1357,9 @@ static const struct tune_params
> generic_tunings =
> > > > >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
> > > > >       have at most a very minor effect on SVE2 cores.  */
> > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa35_tunings =
> > > > > @@ -1391,7 +1393,9 @@ static const struct tune_params
> cortexa35_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa53_tunings =
> > > > > @@ -1425,7 +1429,9 @@ static const struct tune_params
> cortexa53_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa57_tunings =
> > > > > @@ -1459,7 +1465,9 @@ static const struct tune_params
> cortexa57_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa72_tunings =
> > > > > @@ -1493,7 +1501,9 @@ static const struct tune_params
> cortexa72_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params cortexa73_tunings =
> > > > > @@ -1527,7 +1537,9 @@ static const struct tune_params
> cortexa73_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >
> > > > > @@ -1562,7 +1574,9 @@ static const struct tune_params
> exynosm1_tunings =
> > > > >    48,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK, /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > > -  &exynosm1_prefetch_tune
> > > > > +  &exynosm1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderxt88_tunings =
> > > > > @@ -1594,8 +1608,10 @@ static const struct tune_params
> thunderxt88_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> autoprefetcher_model.  */
> > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /*
> tune_flags.  */
> > > > > -  &thunderxt88_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &thunderxt88_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx_tunings =
> > > > > @@ -1627,9 +1643,10 @@ static const struct tune_params
> thunderx_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> autoprefetcher_model.  */
> > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /*
> tune_flags.  */
> > > > > -  &thunderx_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /*
> tune_flags.  */
> > > > > +  &thunderx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params tsv110_tunings =
> > > > > @@ -1663,7 +1680,9 @@ static const struct tune_params
> tsv110_tunings =
> > > > >    0,    /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,     /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > > -  &tsv110_prefetch_tune
> > > > > +  &tsv110_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params xgene1_tunings =
> > > > > @@ -1696,7 +1715,9 @@ static const struct tune_params
> xgene1_tunings =
> > > > >    17,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > > -  &xgene1_prefetch_tune
> > > > > +  &xgene1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params emag_tunings =
> > > > > @@ -1729,7 +1750,9 @@ static const struct tune_params
> emag_tunings =
> > > > >    17,  /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> > > > > -  &xgene1_prefetch_tune
> > > > > +  &xgene1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params qdf24xx_tunings =
> > > > > @@ -1763,7 +1786,9 @@ static const struct tune_params
> qdf24xx_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > > -  &qdf24xx_prefetch_tune
> > > > > +  &qdf24xx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> values
> > > > > @@ -1799,7 +1824,9 @@ static const struct tune_params
> saphira_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx2t99_tunings =
> > > > > @@ -1833,7 +1860,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &thunderx2t99_prefetch_tune
> > > > > +  &thunderx2t99_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params thunderx3t110_tunings =
> > > > > @@ -1867,7 +1896,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &thunderx3t110_prefetch_tune
> > > > > +  &thunderx3t110_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params neoversen1_tunings =
> > > > > @@ -1900,7 +1931,9 @@ static const struct tune_params
> neoversen1_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params ampere1_tunings =
> > > > > @@ -1936,8 +1969,10 @@ static const struct tune_params
> ampere1_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > -  &ampere1_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &ampere1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params ampere1a_tunings =
> > > > > @@ -1974,8 +2009,10 @@ static const struct tune_params
> ampere1a_tunings =
> > > > >    2,   /* min_div_recip_mul_df.  */
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > -  &ampere1_prefetch_tune
> > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > +  &ampere1_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > > @@ -2156,7 +2193,9 @@ static const struct tune_params
> neoversev1_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > > @@ -2293,7 +2332,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > > @@ -2483,7 +2524,9 @@ static const struct tune_params
> neoversen2_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > > @@ -2673,7 +2716,9 @@ static const struct tune_params
> neoversev2_tunings =
> > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> tune_flags.  */
> > > > > -  &generic_prefetch_tune
> > > > > +  &generic_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  static const struct tune_params a64fx_tunings =
> > > > > @@ -2706,7 +2751,9 @@ static const struct tune_params
> a64fx_tunings =
> > > > >    0,   /* max_case_values.  */
> > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> autoprefetcher_model.  */
> > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > -  &a64fx_prefetch_tune
> > > > > +  &a64fx_prefetch_tune,
> > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > >  };
> > > > >
> > > > >  /* Support for fine-grained override of the tuning structures.  */
> > > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char
> *to_parse, const struct processor **res)
> > > > >    return AARCH_PARSE_INVALID_ARG;
> > > > >  }
> > > > >
> > > > > +/* Parse a command-line -param=aarch64-ldp-policy=
> parameter.  VALUE is
> > > > > +   the value of the parameter.  */
> > > > > +
> > > > > +static void
> > > > > +aarch64_parse_ldp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > > +{
> > > > > +  if (value == 1)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > > +  else if (value == 2)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > > +  else if (value == 3)
> > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > > +}
> > > > > +
> > > > > +/* Parse a command-line -param=aarch64-stp-policy=
> parameter.  VALUE is
> > > > > +   the value of the parameter.  */
> > > > > +
> > > > > +static void
> > > > > +aarch64_parse_stp_policy (const unsigned int value, struct
> tune_params* tune)
> > > > > +{
> > > > > +  if (value == 1)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > > +  else if (value == 2)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > > +  else if (value == 3)
> > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > > +}
> > > > > +
> > > > >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> > > > >     described in FLAG.  If it is, return the index bit for that fusion type.
> > > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal
> (struct gcc_options *opts)
> > > > >      aarch64_parse_override_string (opts-
> >x_aarch64_override_tune_string,
> > > > >                                    &aarch64_tune_params);
> > > > >
> > > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > > +                             &aarch64_tune_params);
> > > > > +
> > > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > > +                             &aarch64_tune_params);
> > > > > +
> > > > >    /* This target defaults to strict volatile bitfields.  */
> > > > >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least
> (2))
> > > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > > >    enum reg_class rclass_1, rclass_2;
> > > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > > >
> > > > > -  /* Allow the tuning structure to disable LDP instruction formation
> > > > > -     from combining instructions (e.g., in peephole2).
> > > > > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > > > > -          1. control policies for load and store separately;
> > > > > -          2. support the following policies:
> > > > > -             - default (use what is in the tuning structure)
> > > > > -             - always
> > > > > -             - never
> > > > > -             - aligned (only if the compiler can prove that the
> > > > > -               load will be aligned to 2 * element_size)  */
> > > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > > >      return false;
> > > > >
> > > > >    if (load)
> > > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> > > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > > >      return false;
> > > > >
> > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > -     check the alignment to be at least 8 byte. */
> > > > > -  if (mode == SImode
> > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > -      && !optimize_size
> > > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > +     do not emit the load pair unless the alignment is checked to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > +     do not emit the store pair unless the alignment is checked to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > >      return false;
> > > > >
> > > > >    /* Check if the addresses are in the form of [base+offset].  */
> > > > > @@ -26640,6 +26726,16 @@
> aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > > >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> > > > >
> > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > > > > +    return false;
> > > > > +
> > > > >    if (load)
> > > > >      {
> > > > >        for (int i = 0; i < num_insns; i++)
> > > > > @@ -26729,13 +26825,22 @@
> aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > > >      return false;
> > > > >
> > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > -     check the alignment to be at least 8 byte. */
> > > > > -  if (mode == SImode
> > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > -      && !optimize_size
> > > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > +     do not emit the load pair unless the alignment is checked to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (load
> > > > > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT
> (mode))
> > > > > +    return false;
> > > > > +
> > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > +     do not emit the store pair unless the alignment is checked to be
> > > > > +     at least double the alignment of the type.  */
> > > > > +  if (!load
> > > > > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT
> (mode))
> > > > >      return false;
> > > > >
> > > > >    return true;
> > > > > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > > > > index 4a0580435a8..f61e3f968d4 100644
> > > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to
> start using MOPS sequence.
> > > > >  -param=aarch64-vect-unroll-limit=
> > > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> > > > >  Limit how much the autovectorizer may unroll a loop.
> > > > > +
> > > > > +-param=aarch64-ldp-policy=
> > > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0)
> IntegerRange(0, 3) Param
> > > > > +Fine-grained policy for load pairs.
> > > > > +
> > > > > +-param=aarch64-stp-policy=
> > > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0)
> IntegerRange(0, 3) Param
> > > > > +Fine-grained policy for store pairs.
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > index 146b40414b0..000dd5541f4 100644
> > > > > --- a/gcc/doc/invoke.texi
> > > > > +++ b/gcc/doc/invoke.texi
> > > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and
> SVE.  Prefer SVE when the costs are deemed equal.
> > > > >  @end table
> > > > >  The default value is 0.
> > > > >
> > > > > +@item aarch64-ldp-policy
> > > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3,
> inclusive.
> > > > > +@table @samp
> > > > > +@item 0
> > > > > +Use the policy of the tuning structure.
> > > > > +@item 1
> > > > > +Emit ldp regardless of alignment.
> > > > > +@item 2
> > > > > +Do not emit ldp.
> > > > > +@item 3
> > > > > +Emit ldp only if the source pointer is aligned to at least double the
> alignment
> > > > > +of the type.
> > > > > +@end table
> > > > > +The default value is 0.
> > > > > +
> > > > > +@item aarch64-stp-policy
> > > > > +Fine-grained policy for store pairs.  Accepts values from 0 to 3,
> inclusive.
> > > > > +@table @samp
> > > > > +@item 0
> > > > > +Use the policy of the tuning structure.
> > > > > +@item 1
> > > > > +Emit stp regardless of alignment.
> > > > > +@item 2
> > > > > +Do not emit stp.
> > > > > +@item 3
> > > > > +Emit stp only if the source pointer is aligned to at least double the
> alignment
> > > > > +of the type.
> > > > > +@end table
> > > > > +The default value is 0.
> > > > > +
> > > > >  @item aarch64-loop-vect-issue-rate-niters
> > > > >  The tuning for some AArch64 CPUs tries to take both latencies and
> issue
> > > > >  rates into account when deciding whether a loop should be vectorized
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-
> no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-
> no_ldp_combine.c
> > > > > deleted file mode 100644
> > > > > index bc871f4481d..00000000000
> > > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > > +++ /dev/null
> > > > > @@ -1,11 +0,0 @@
> > > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > > -
> > > > > -long
> > > > > -foo (long a[])
> > > > > -{
> > > > > -  return a[0] + a[1];
> > > > > -}
> > > > > -
> > > > > -/* We should see two ldrs instead of one ldp. */
> > > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > new file mode 100644
> > > > > index 00000000000..8e43faab70d
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > new file mode 100644
> > > > > index 00000000000..532ca607565
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > new file mode 100644
> > > > > index 00000000000..b39941c18d7
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > @@ -0,0 +1,66 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[0]; \
> > > > > +    a_1 = arr[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[0]; \
> > > > > +    a_1 = a[1]; \
> > > > > +    return a_0 + a_1; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    a_0 = arr[100]; \
> > > > > +    a_1 = arr[101]; \
> > > > > +    a_2 = arr[102]; \
> > > > > +    a_3 = arr[103]; \
> > > > > +    a_4 = arr[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a_0 = a[100]; \
> > > > > +    a_1 = a[101]; \
> > > > > +    a_2 = a[102]; \
> > > > > +    a_3 = a[103]; \
> > > > > +    a_4 = a[110]; \
> > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > +}
> > > > > +
> > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > new file mode 100644
> > > > > index 00000000000..01f294bb090
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > new file mode 100644
> > > > > index 00000000000..cedb461b5b2
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > > > > +
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > new file mode 100644
> > > > > index 00000000000..ddde658f807
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > @@ -0,0 +1,60 @@
> > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2 -mcpu=generic" }
> */
> > > > > +
> > > > > +#include <stdlib.h>
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > +
> > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[0] = x; \
> > > > > +    arr[1] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[0] = x; \
> > > > > +    a[1] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    arr[100] = x; \
> > > > > +    arr[101] = x; \
> > > > > +    arr[102] = x; \
> > > > > +    arr[103] = x; \
> > > > > +    return arr; \
> > > > > +}
> > > > > +
> > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> > > > > +    TYPE *a = arr+1; \
> > > > > +    a[100] = x; \
> > > > > +    a[101] = x; \
> > > > > +    a[102] = x; \
> > > > > +    a[103] = x; \
> > > > > +    return a; \
> > > > > +}
> > > > > +
> > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > +STP_TEST_ALIGNED(v4si);
> > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > > > +
> > > > > --
> > > > > 2.40.1
> > > > >
> 
> 
> 
> --
> Manos Anagnostakis | Compiler Engineer |
> E: mailto:makeljana.shkurti@vrull.eu
> 
> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
>  W: http://www.vrull.eu/ | https://www.linkedin.com/company/vrull/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.
  2023-09-26  8:50             ` Kyrylo Tkachov
@ 2023-09-26 11:51               ` Manos Anagnostakis
  0 siblings, 0 replies; 12+ messages in thread
From: Manos Anagnostakis @ 2023-09-26 11:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: Philipp Tomsich, Kyrylo Tkachov, tamar.christina

[-- Attachment #1: Type: text/plain, Size: 58936 bytes --]

Thank you for the feedback Kyrill and Tamar,

Soon, I'll follow up with a V4 containing the separate function.

Manos.

On Tue, Sep 26, 2023 at 11:50 AM Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
wrote:

>
>
> > -----Original Message-----
> > From: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> > Sent: Tuesday, September 26, 2023 9:36 AM
> > To: Manos Anagnostakis <manos.anagnostakis@vrull.eu>; gcc-
> > patches@gcc.gnu.org
> > Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Andrew Pinski
> > <pinskia@gmail.com>
> > Subject: RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> > formation.
> >
> > Hi Manos,
> >
> > Thank you for the quick turnaround, please post the patch that uses a --
> > param with an enum. I think that's the direction we should be going with
> this
> > patch.
>
> Ah, and please address Tamar's feedback from
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631343.html
> Thanks,
> Kyrill
>
> >
> > From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Sent: Tuesday, September 26, 2023 7:06 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>; Kyrylo Tkachov
> > <Kyrylo.Tkachov@arm.com>; Andrew Pinski <pinskia@gmail.com>
> > Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> > formation.
> >
> > Thank you Andrew for the input.
> >
> > I've prepared a patch using --param with enum, which seems a more
> suitable
> > approach to me as strings are more descriptive as well.
> >
> > The current patch needed an adjustment on how to call the parsing
> functions
> > to match the compiler coding style.
> >
> > Both are bootstrapped and regstested.
> >
> > I can send a V4 of whichever is preferred.
> >
> > Thanks!
> >
> > Manos.
> >
> > On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski
> > <mailto:pinskia@gmail.com> wrote:
> > On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <mailto:pinskia@gmail.com>
> > wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> > > <mailto:philipp.tomsich@vrull.eu> wrote:
> > > >
> > > > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski
> > <mailto:pinskia@gmail.com> wrote:
> > > > >
> > > > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > > > <mailto:manos.anagnostakis@vrull.eu> wrote:
> > > > > >
> > > > > > This patch implements the following TODO in
> > gcc/config/aarch64/aarch64.cc
> > > > > > to provide the requested behaviour for handling ldp and stp:
> > > > > >
> > > > > >   /* Allow the tuning structure to disable LDP instruction
> formation
> > > > > >      from combining instructions (e.g., in peephole2).
> > > > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > > > >            1. control policies for load and store separately;
> > > > > >            2. support the following policies:
> > > > > >               - default (use what is in the tuning structure)
> > > > > >               - always
> > > > > >               - never
> > > > > >               - aligned (only if the compiler can prove that the
> > > > > >                 load will be aligned to 2 * element_size)  */
> > > > > >
> > > > > > It provides two new and concrete target-specific command-line
> > parameters
> > > > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > > > to give the ability to control load and store policies
> seperately as
> > > > > > stated in part 1 of the TODO.
> > > > > >
> > > > > > The accepted values for both parameters are:
> > > > > > - 0: Use the policy of the tuning structure (default).
> > > > > > - 1: Emit ldp/stp regardless of alignment.
> > > > > > - 2: Do not emit ldp/stp.
> > > > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > > > >   be aligned to 2 * element_size.
> > > > >
> > > > > Instead of a number, does it make sense to instead use an string
> > > > > (ENUM) for this param.
> > > > > Also I think using --param is a bad idea if it is going to be
> > > > > documented in the user manual.
> > > > > Maybe a -m option should be used instead.
> > > >
> > > > See https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > September/631283.html
> > > > for the discussion triggering the change from -m... to --param and
> the
> > > > change to using a number instead of a string.
> > >
> > > That is the opposite of the current GCC practice across all targets.
> > > Things like this should be consistent and if one target decides to do
> > > it different, then maybe it should NOT.
> > > Anyways we should document the correct coding style for options so we
> > > don't have these back and forths again.
> >
> > Kyrylo:
> > >  It will have to take a number rather than a string but that should be
> okay, as
> > long as the right values are documented in invoke.texi.
> >
> > No it does not need to be a number. --param=ranger-debug= does not
> > take a number, it takes an enum .
> > One of the benefits of moving --param support over to .opt to allow
> > more than just numbers even.
> >
> > Thanks,
> > Andrew
> >
> >
> > >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > Thanks,
> > > > Philipp.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Andrew
> > > > >
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >         * config/aarch64/aarch64-protos.h (struct tune_params):
> Add
> > > > > >         appropriate enums for the policies.
> > > > > >         * config/aarch64/aarch64-tuning-flags.def
> > > > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > > > >         options.
> > > > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy):
> New
> > > > > >         function to parse ldp-policy parameter.
> > > > > >         (aarch64_parse_stp_policy): New function to parse
> stp-policy
> > parameter.
> > > > > >         (aarch64_override_options_internal): Call parsing
> functions.
> > > > > >         (aarch64_operands_ok_for_ldpstp): Add parameter-value
> check
> > and
> > > > > >         alignment check and remove superseded ones.
> > > > > >         (aarch64_operands_adjust_ok_for_ldpstp): Add
> parameter-value
> > check and
> > > > > >         alignment check and remove superseded ones.
> > > > > >         * config/aarch64/aarch64.opt: Add options.
> > > > > >         * doc/invoke.texi: Document the parameters accordingly.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >         * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> > > > > >         * gcc.target/aarch64/ldp_aligned.c: New test.
> > > > > >         * gcc.target/aarch64/ldp_always.c: New test.
> > > > > >         * gcc.target/aarch64/ldp_never.c: New test.
> > > > > >         * gcc.target/aarch64/stp_aligned.c: New test.
> > > > > >         * gcc.target/aarch64/stp_always.c: New test.
> > > > > >         * gcc.target/aarch64/stp_never.c: New test.
> > > > > >
> > > > > > Signed-off-by: Manos Anagnostakis
> > <mailto:manos.anagnostakis@vrull.eu>
> > > > > > ---
> > > > > > Changes in v3:
> > > > > >         - Changed command-line options to target-specific
> parameters
> > > > > >           and documented them accordingly in doc/invoke.texi.
> > > > > >         - Removed ampere1-no_ldp_combine.c test as superseded.
> > > > > >
> > > > > >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> > > > > >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> > > > > >  gcc/config/aarch64/aarch64.cc                 | 215
> +++++++++++++-----
> > > > > >  gcc/config/aarch64/aarch64.opt                |   8 +
> > > > > >  gcc/doc/invoke.texi                           |  30 +++
> > > > > >  .../aarch64/ampere1-no_ldp_combine.c          |  11 -
> > > > > >  .../gcc.target/aarch64/ldp_aligned.c          |  66 ++++++
> > > > > >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++++++
> > > > > >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++++++
> > > > > >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> > > > > >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> > > > > >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> > > > > >  12 files changed, 600 insertions(+), 74 deletions(-)
> > > > > >  delete mode 100644 gcc/testsuite/gcc.target/aarch64/ampere1-
> > no_ldp_combine.c
> > > > > >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > >
> > > > > > diff --git a/gcc/config/aarch64/aarch64-protos.h
> > b/gcc/config/aarch64/aarch64-protos.h
> > > > > > index 70303d6fd95..be1d73490ed 100644
> > > > > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > > > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > > > > @@ -568,6 +568,30 @@ struct tune_params
> > > > > >    /* Place prefetch struct pointer at the end to enable type
> checking
> > > > > >       errors when tune_params misses elements (e.g., from
> erroneous
> > merges).  */
> > > > > >    const struct cpu_prefetch_tune *prefetch;
> > > > > > +/* An enum specifying how to handle load pairs using a
> fine-grained
> > policy:
> > > > > > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is
> aligned
> > > > > > +   to at least double the alignment of the type.
> > > > > > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > > > > > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > > > > > +
> > > > > > +  enum aarch64_ldp_policy_model
> > > > > > +  {
> > > > > > +    LDP_POLICY_ALIGNED,
> > > > > > +    LDP_POLICY_ALWAYS,
> > > > > > +    LDP_POLICY_NEVER
> > > > > > +  } ldp_policy_model;
> > > > > > +/* An enum specifying how to handle store pairs using a
> fine-grained
> > policy:
> > > > > > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is
> aligned
> > > > > > +   to at least double the alignment of the type.
> > > > > > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > > > > > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > > > > > +
> > > > > > +  enum aarch64_stp_policy_model
> > > > > > +  {
> > > > > > +    STP_POLICY_ALIGNED,
> > > > > > +    STP_POLICY_ALWAYS,
> > > > > > +    STP_POLICY_NEVER
> > > > > > +  } stp_policy_model;
> > > > > >  };
> > > > > >
> > > > > >  /* Classifies an address.
> > > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> > b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > > index 52112ba7c48..774568e9106 100644
> > > > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > > > > > @@ -30,11 +30,6 @@
> > > > > >
> > > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs",
> > RENAME_FMA_REGS)
> > > > > >
> > > > > > -/* Don't create non-8 byte aligned load/store pair.  That is if
> the
> > > > > > -two load/stores are not at least 8 byte aligned don't create
> load/store
> > > > > > -pairs.   */
> > > > > > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> > SLOW_UNALIGNED_LDPW)
> > > > > > -
> > > > > >  /* Some of the optional shift to some arthematic instructions
> are
> > > > > >     considered cheap.  Logical shift left <=4 with or without a
> > > > > >     zero extend are considered cheap.  Sign extend; non logical
> shift left
> > > > > > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION
> > ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> > > > > >  /* Disallow load/store pair instructions on Q-registers.  */
> > > > > >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs",
> > NO_LDP_STP_QREGS)
> > > > > >
> > > > > > -/* Disallow load-pair instructions to be formed in
> > combine/peephole.  */
> > > > > > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine",
> > NO_LDP_COMBINE)
> > > > > > -
> > > > > >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs",
> > RENAME_LOAD_REGS)
> > > > > >
> > > > > >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> > CSE_SVE_VL_CONSTANTS)
> > > > > > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc
> > > > > > index 219c4ee6d4c..9eeb5469cf9 100644
> > > > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > > > @@ -1357,7 +1357,9 @@ static const struct tune_params
> > generic_tunings =
> > > > > >       Neoverse V1.  It does not have a noticeable effect on
> A64FX and
> > should
> > > > > >       have at most a very minor effect on SVE2 cores.  */
> > > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params cortexa35_tunings =
> > > > > > @@ -1391,7 +1393,9 @@ static const struct tune_params
> > cortexa35_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params cortexa53_tunings =
> > > > > > @@ -1425,7 +1429,9 @@ static const struct tune_params
> > cortexa53_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params cortexa57_tunings =
> > > > > > @@ -1459,7 +1465,9 @@ static const struct tune_params
> > cortexa57_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params cortexa72_tunings =
> > > > > > @@ -1493,7 +1501,9 @@ static const struct tune_params
> > cortexa72_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params cortexa73_tunings =
> > > > > > @@ -1527,7 +1537,9 @@ static const struct tune_params
> > cortexa73_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >
> > > > > > @@ -1562,7 +1574,9 @@ static const struct tune_params
> > exynosm1_tunings =
> > > > > >    48,  /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK, /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > > > > > -  &exynosm1_prefetch_tune
> > > > > > +  &exynosm1_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params thunderxt88_tunings =
> > > > > > @@ -1594,8 +1608,10 @@ static const struct tune_params
> > thunderxt88_tunings =
> > > > > >    2,   /* min_div_recip_mul_df.  */
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> > autoprefetcher_model.  */
> > > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /*
> > tune_flags.  */
> > > > > > -  &thunderxt88_prefetch_tune
> > > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > +  &thunderxt88_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params thunderx_tunings =
> > > > > > @@ -1627,9 +1643,10 @@ static const struct tune_params
> > thunderx_tunings =
> > > > > >    2,   /* min_div_recip_mul_df.  */
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> > autoprefetcher_model.  */
> > > > > > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > > > > > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /*
> > tune_flags.  */
> > > > > > -  &thunderx_prefetch_tune
> > > > > > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /*
> > tune_flags.  */
> > > > > > +  &thunderx_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params tsv110_tunings =
> > > > > > @@ -1663,7 +1680,9 @@ static const struct tune_params
> > tsv110_tunings =
> > > > > >    0,    /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,     /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > > > > > -  &tsv110_prefetch_tune
> > > > > > +  &tsv110_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params xgene1_tunings =
> > > > > > @@ -1696,7 +1715,9 @@ static const struct tune_params
> > xgene1_tunings =
> > > > > >    17,  /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.
> */
> > > > > > -  &xgene1_prefetch_tune
> > > > > > +  &xgene1_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params emag_tunings =
> > > > > > @@ -1729,7 +1750,9 @@ static const struct tune_params
> > emag_tunings =
> > > > > >    17,  /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_OFF,     /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.
> */
> > > > > > -  &xgene1_prefetch_tune
> > > > > > +  &xgene1_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params qdf24xx_tunings =
> > > > > > @@ -1763,7 +1786,9 @@ static const struct tune_params
> > qdf24xx_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > > > > > -  &qdf24xx_prefetch_tune
> > > > > > +  &qdf24xx_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  /* Tuning structure for the Qualcomm Saphira core.  Default to
> falkor
> > values
> > > > > > @@ -1799,7 +1824,9 @@ static const struct tune_params
> > saphira_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params thunderx2t99_tunings =
> > > > > > @@ -1833,7 +1860,9 @@ static const struct tune_params
> > thunderx2t99_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &thunderx2t99_prefetch_tune
> > > > > > +  &thunderx2t99_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params thunderx3t110_tunings =
> > > > > > @@ -1867,7 +1896,9 @@ static const struct tune_params
> > thunderx3t110_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &thunderx3t110_prefetch_tune
> > > > > > +  &thunderx3t110_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params neoversen1_tunings =
> > > > > > @@ -1900,7 +1931,9 @@ static const struct tune_params
> > neoversen1_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params ampere1_tunings =
> > > > > > @@ -1936,8 +1969,10 @@ static const struct tune_params
> > ampere1_tunings =
> > > > > >    2,   /* min_div_recip_mul_df.  */
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > > -  &ampere1_prefetch_tune
> > > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > +  &ampere1_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params ampere1a_tunings =
> > > > > > @@ -1974,8 +2009,10 @@ static const struct tune_params
> > ampere1a_tunings =
> > > > > >    2,   /* min_div_recip_mul_df.  */
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> > > > > > -  &ampere1_prefetch_tune
> > > > > > +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > +  &ampere1_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > > > > > @@ -2156,7 +2193,9 @@ static const struct tune_params
> > neoversev1_tunings =
> > > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> > > > > >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > > > > > @@ -2293,7 +2332,9 @@ static const struct tune_params
> > neoverse512tvb_tunings =
> > > > > >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > > > > > @@ -2483,7 +2524,9 @@ static const struct tune_params
> > neoversen2_tunings =
> > > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > > > > > @@ -2673,7 +2716,9 @@ static const struct tune_params
> > neoversev2_tunings =
> > > > > >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> > > > > >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> > > > > >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /*
> > tune_flags.  */
> > > > > > -  &generic_prefetch_tune
> > > > > > +  &generic_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  static const struct tune_params a64fx_tunings =
> > > > > > @@ -2706,7 +2751,9 @@ static const struct tune_params
> > a64fx_tunings =
> > > > > >    0,   /* max_case_values.  */
> > > > > >    tune_params::AUTOPREFETCHER_WEAK,    /*
> > autoprefetcher_model.  */
> > > > > >    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> > > > > > -  &a64fx_prefetch_tune
> > > > > > +  &a64fx_prefetch_tune,
> > > > > > +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> > > > > > +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
> > > > > >  };
> > > > > >
> > > > > >  /* Support for fine-grained override of the tuning structures.
> */
> > > > > > @@ -17819,6 +17866,34 @@ aarch64_parse_tune (const char
> > *to_parse, const struct processor **res)
> > > > > >    return AARCH_PARSE_INVALID_ARG;
> > > > > >  }
> > > > > >
> > > > > > +/* Parse a command-line -param=aarch64-ldp-policy=
> > parameter.  VALUE is
> > > > > > +   the value of the parameter.  */
> > > > > > +
> > > > > > +static void
> > > > > > +aarch64_parse_ldp_policy (const unsigned int value, struct
> > tune_params* tune)
> > > > > > +{
> > > > > > +  if (value == 1)
> > > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > > > > > +  else if (value == 2)
> > > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > > > > > +  else if (value == 3)
> > > > > > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > > > > > +}
> > > > > > +
> > > > > > +/* Parse a command-line -param=aarch64-stp-policy=
> > parameter.  VALUE is
> > > > > > +   the value of the parameter.  */
> > > > > > +
> > > > > > +static void
> > > > > > +aarch64_parse_stp_policy (const unsigned int value, struct
> > tune_params* tune)
> > > > > > +{
> > > > > > +  if (value == 1)
> > > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > > > > > +  else if (value == 2)
> > > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > > > > > +  else if (value == 3)
> > > > > > +    tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > > > > > +}
> > > > > > +
> > > > > >  /* Parse TOKEN, which has length LENGTH to see if it is an
> option
> > > > > >     described in FLAG.  If it is, return the index bit for that
> fusion type.
> > > > > >     If not, error (printing OPTION_NAME) and return zero.  */
> > > > > > @@ -18167,6 +18242,12 @@ aarch64_override_options_internal
> > (struct gcc_options *opts)
> > > > > >      aarch64_parse_override_string (opts-
> > >x_aarch64_override_tune_string,
> > > > > >                                    &aarch64_tune_params);
> > > > > >
> > > > > > +  aarch64_parse_ldp_policy (aarch64_ldp_policy,
> > > > > > +                             &aarch64_tune_params);
> > > > > > +
> > > > > > +  aarch64_parse_stp_policy (aarch64_stp_policy,
> > > > > > +                             &aarch64_tune_params);
> > > > > > +
> > > > > >    /* This target defaults to strict volatile bitfields.  */
> > > > > >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> abi_version_at_least
> > (2))
> > > > > >      opts->x_flag_strict_volatile_bitfields = 1;
> > > > > > @@ -26468,18 +26549,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> > *operands, bool load,
> > > > > >    enum reg_class rclass_1, rclass_2;
> > > > > >    rtx mem_1, mem_2, reg_1, reg_2;
> > > > > >
> > > > > > -  /* Allow the tuning structure to disable LDP instruction
> formation
> > > > > > -     from combining instructions (e.g., in peephole2).
> > > > > > -     TODO: Implement fine-grained tuning control for LDP and
> STP:
> > > > > > -          1. control policies for load and store separately;
> > > > > > -          2. support the following policies:
> > > > > > -             - default (use what is in the tuning structure)
> > > > > > -             - always
> > > > > > -             - never
> > > > > > -             - aligned (only if the compiler can prove that the
> > > > > > -               load will be aligned to 2 * element_size)  */
> > > > > > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > > > > > -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > > +  if (load
> > > > > > +      && aarch64_tune_params.ldp_policy_model ==
> > tune_params::LDP_POLICY_NEVER)
> > > > > > +    return false;
> > > > > > +
> > > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > > +  if (!load
> > > > > > +      && aarch64_tune_params.stp_policy_model ==
> > tune_params::STP_POLICY_NEVER)
> > > > > >      return false;
> > > > > >
> > > > > >    if (load)
> > > > > > @@ -26506,13 +26583,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> > *operands, bool load,
> > > > > >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> > > > > >      return false;
> > > > > >
> > > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > > -     check the alignment to be at least 8 byte. */
> > > > > > -  if (mode == SImode
> > > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > > -      && !optimize_size
> > > > > > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > > +     do not emit the load pair unless the alignment is checked
> to be
> > > > > > +     at least double the alignment of the type.  */
> > > > > > +  if (load
> > > > > > +      && aarch64_tune_params.ldp_policy_model ==
> > tune_params::LDP_POLICY_ALIGNED
> > > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > > > +    return false;
> > > > > > +
> > > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > > +     do not emit the store pair unless the alignment is checked
> to be
> > > > > > +     at least double the alignment of the type.  */
> > > > > > +  if (!load
> > > > > > +      && aarch64_tune_params.stp_policy_model ==
> > tune_params::STP_POLICY_ALIGNED
> > > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > > > > >      return false;
> > > > > >
> > > > > >    /* Check if the addresses are in the form of [base+offset].
> */
> > > > > > @@ -26640,6 +26726,16 @@
> > aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > > > >    HOST_WIDE_INT offvals[num_insns], msize;
> > > > > >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> > offset[num_insns];
> > > > > >
> > > > > > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > > > > > +  if (load
> > > > > > +      && aarch64_tune_params.ldp_policy_model ==
> > tune_params::LDP_POLICY_NEVER)
> > > > > > +    return false;
> > > > > > +
> > > > > > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > > > > > +  if (!load
> > > > > > +      && aarch64_tune_params.stp_policy_model ==
> > tune_params::STP_POLICY_NEVER)
> > > > > > +    return false;
> > > > > > +
> > > > > >    if (load)
> > > > > >      {
> > > > > >        for (int i = 0; i < num_insns; i++)
> > > > > > @@ -26729,13 +26825,22 @@
> > aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
> > > > > >    if (offvals[0] % msize != offvals[2] % msize)
> > > > > >      return false;
> > > > > >
> > > > > > -  /* If we have SImode and slow unaligned ldp,
> > > > > > -     check the alignment to be at least 8 byte. */
> > > > > > -  if (mode == SImode
> > > > > > -      && (aarch64_tune_params.extra_tuning_flags
> > > > > > -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > > > > > -      && !optimize_size
> > > > > > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > > > > > +  /* If we have LDP_POLICY_ALIGNED,
> > > > > > +     do not emit the load pair unless the alignment is checked
> to be
> > > > > > +     at least double the alignment of the type.  */
> > > > > > +  if (load
> > > > > > +      && aarch64_tune_params.ldp_policy_model ==
> > tune_params::LDP_POLICY_ALIGNED
> > > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT
> > (mode))
> > > > > > +    return false;
> > > > > > +
> > > > > > +  /* If we have STP_POLICY_ALIGNED,
> > > > > > +     do not emit the store pair unless the alignment is checked
> to be
> > > > > > +     at least double the alignment of the type.  */
> > > > > > +  if (!load
> > > > > > +      && aarch64_tune_params.stp_policy_model ==
> > tune_params::STP_POLICY_ALIGNED
> > > > > > +      && !optimize_function_for_size_p (cfun)
> > > > > > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT
> > (mode))
> > > > > >      return false;
> > > > > >
> > > > > >    return true;
> > > > > > diff --git a/gcc/config/aarch64/aarch64.opt
> > b/gcc/config/aarch64/aarch64.opt
> > > > > > index 4a0580435a8..f61e3f968d4 100644
> > > > > > --- a/gcc/config/aarch64/aarch64.opt
> > > > > > +++ b/gcc/config/aarch64/aarch64.opt
> > > > > > @@ -337,3 +337,11 @@ Constant memset size in bytes from which to
> > start using MOPS sequence.
> > > > > >  -param=aarch64-vect-unroll-limit=
> > > > > >  Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4)
> Param
> > > > > >  Limit how much the autovectorizer may unroll a loop.
> > > > > > +
> > > > > > +-param=aarch64-ldp-policy=
> > > > > > +Target Joined UInteger Var(aarch64_ldp_policy) Init(0)
> > IntegerRange(0, 3) Param
> > > > > > +Fine-grained policy for load pairs.
> > > > > > +
> > > > > > +-param=aarch64-stp-policy=
> > > > > > +Target Joined UInteger Var(aarch64_stp_policy) Init(0)
> > IntegerRange(0, 3) Param
> > > > > > +Fine-grained policy for store pairs.
> > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > > index 146b40414b0..000dd5541f4 100644
> > > > > > --- a/gcc/doc/invoke.texi
> > > > > > +++ b/gcc/doc/invoke.texi
> > > > > > @@ -16508,6 +16508,36 @@ Use both Advanced SIMD and
> > SVE.  Prefer SVE when the costs are deemed equal.
> > > > > >  @end table
> > > > > >  The default value is 0.
> > > > > >
> > > > > > +@item aarch64-ldp-policy
> > > > > > +Fine-grained policy for load pairs.  Accepts values from 0 to 3,
> > inclusive.
> > > > > > +@table @samp
> > > > > > +@item 0
> > > > > > +Use the policy of the tuning structure.
> > > > > > +@item 1
> > > > > > +Emit ldp regardless of alignment.
> > > > > > +@item 2
> > > > > > +Do not emit ldp.
> > > > > > +@item 3
> > > > > > +Emit ldp only if the source pointer is aligned to at least
> double the
> > alignment
> > > > > > +of the type.
> > > > > > +@end table
> > > > > > +The default value is 0.
> > > > > > +
> > > > > > +@item aarch64-stp-policy
> > > > > > +Fine-grained policy for store pairs.  Accepts values from 0 to
> 3,
> > inclusive.
> > > > > > +@table @samp
> > > > > > +@item 0
> > > > > > +Use the policy of the tuning structure.
> > > > > > +@item 1
> > > > > > +Emit stp regardless of alignment.
> > > > > > +@item 2
> > > > > > +Do not emit stp.
> > > > > > +@item 3
> > > > > > +Emit stp only if the source pointer is aligned to at least
> double the
> > alignment
> > > > > > +of the type.
> > > > > > +@end table
> > > > > > +The default value is 0.
> > > > > > +
> > > > > >  @item aarch64-loop-vect-issue-rate-niters
> > > > > >  The tuning for some AArch64 CPUs tries to take both latencies
> and
> > issue
> > > > > >  rates into account when deciding whether a loop should be
> vectorized
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ampere1-
> > no_ldp_combine.c b/gcc/testsuite/gcc.target/aarch64/ampere1-
> > no_ldp_combine.c
> > > > > > deleted file mode 100644
> > > > > > index bc871f4481d..00000000000
> > > > > > --- a/gcc/testsuite/gcc.target/aarch64/ampere1-no_ldp_combine.c
> > > > > > +++ /dev/null
> > > > > > @@ -1,11 +0,0 @@
> > > > > > -/* { dg-options "-O3 -mtune=ampere1" } */
> > > > > > -
> > > > > > -long
> > > > > > -foo (long a[])
> > > > > > -{
> > > > > > -  return a[0] + a[1];
> > > > > > -}
> > > > > > -
> > > > > > -/* We should see two ldrs instead of one ldp. */
> > > > > > -/* { dg-final { scan-assembler {\tldr\t} } } */
> > > > > > -/* { dg-final { scan-assembler-not {\tldp\t} } } */
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..8e43faab70d
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > > > > > @@ -0,0 +1,66 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=3
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[0]; \
> > > > > > +    a_1 = arr[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[0]; \
> > > > > > +    a_1 = a[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[100]; \
> > > > > > +    a_1 = arr[101]; \
> > > > > > +    a_2 = arr[102]; \
> > > > > > +    a_3 = arr[103]; \
> > > > > > +    a_4 = arr[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[100]; \
> > > > > > +    a_1 = a[101]; \
> > > > > > +    a_2 = a[102]; \
> > > > > > +    a_3 = a[103]; \
> > > > > > +    a_4 = a[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]"
> 3 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]"
> 3 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]"
> 1 } } */
> > > > > > +
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..532ca607565
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > > > > > @@ -0,0 +1,66 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=1
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[0]; \
> > > > > > +    a_1 = arr[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[0]; \
> > > > > > +    a_1 = a[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[100]; \
> > > > > > +    a_1 = arr[101]; \
> > > > > > +    a_2 = arr[102]; \
> > > > > > +    a_3 = arr[103]; \
> > > > > > +    a_4 = arr[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[100]; \
> > > > > > +    a_1 = a[101]; \
> > > > > > +    a_2 = a[102]; \
> > > > > > +    a_3 = a[103]; \
> > > > > > +    a_4 = a[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]"
> 6 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]"
> 6 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]"
> 2 } } */
> > > > > > +
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..b39941c18d7
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > > > > > @@ -0,0 +1,66 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-ldp-policy=2
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define LDP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[0]; \
> > > > > > +    a_1 = arr[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[0]; \
> > > > > > +    a_1 = a[1]; \
> > > > > > +    return a_0 + a_1; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    a_0 = arr[100]; \
> > > > > > +    a_1 = arr[101]; \
> > > > > > +    a_2 = arr[102]; \
> > > > > > +    a_3 = arr[103]; \
> > > > > > +    a_4 = arr[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > > > > > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a_0 = a[100]; \
> > > > > > +    a_1 = a[101]; \
> > > > > > +    a_2 = a[102]; \
> > > > > > +    a_3 = a[103]; \
> > > > > > +    a_4 = a[110]; \
> > > > > > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > > > > > +}
> > > > > > +
> > > > > > +LDP_TEST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ALIGNED(v4si);
> > > > > > +LDP_TEST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_UNALIGNED(int64_t);
> > > > > > +LDP_TEST_UNALIGNED(v4si);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]"
> 0 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]"
> 0 } } */
> > > > > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]"
> 0 } } */
> > > > > > +
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..01f294bb090
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=3
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[0] = x; \
> > > > > > +    arr[1] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[0] = x; \
> > > > > > +    a[1] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[100] = x; \
> > > > > > +    arr[101] = x; \
> > > > > > +    arr[102] = x; \
> > > > > > +    arr[103] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[100] = x; \
> > > > > > +    a[101] = x; \
> > > > > > +    a[102] = x; \
> > > > > > +    a[103] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ALIGNED(v4si);
> > > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]"
> 3 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]"
> 3 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]"
> 1 } } */
> > > > > > +
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..cedb461b5b2
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=1
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[0] = x; \
> > > > > > +    arr[1] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[0] = x; \
> > > > > > +    a[1] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[100] = x; \
> > > > > > +    arr[101] = x; \
> > > > > > +    arr[102] = x; \
> > > > > > +    arr[103] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[100] = x; \
> > > > > > +    a[101] = x; \
> > > > > > +    a[102] = x; \
> > > > > > +    a[103] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ALIGNED(v4si);
> > > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]"
> 6 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]"
> 6 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]"
> 2 } } */
> > > > > > +
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..ddde658f807
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > > > > > @@ -0,0 +1,60 @@
> > > > > > +/* { dg-options "-O2 --param=aarch64-stp-policy=2
> -mcpu=generic" }
> > */
> > > > > > +
> > > > > > +#include <stdlib.h>
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > > > > +
> > > > > > +#define STP_TEST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[0] = x; \
> > > > > > +    arr[1] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[0] = x; \
> > > > > > +    a[1] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > > > > > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    arr[100] = x; \
> > > > > > +    arr[101] = x; \
> > > > > > +    arr[102] = x; \
> > > > > > +    arr[103] = x; \
> > > > > > +    return arr; \
> > > > > > +}
> > > > > > +
> > > > > > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > > > > > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > > > > > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 *
> _Alignof(TYPE) - 1)); \
> > > > > > +    TYPE *a = arr+1; \
> > > > > > +    a[100] = x; \
> > > > > > +    a[101] = x; \
> > > > > > +    a[102] = x; \
> > > > > > +    a[103] = x; \
> > > > > > +    return a; \
> > > > > > +}
> > > > > > +
> > > > > > +STP_TEST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ALIGNED(v4si);
> > > > > > +STP_TEST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_UNALIGNED(int64_t);
> > > > > > +STP_TEST_UNALIGNED(v4si);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > > > > > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]"
> 0 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]"
> 0 } } */
> > > > > > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]"
> 0 } } */
> > > > > > +
> > > > > > --
> > > > > > 2.40.1
> > > > > >
> >
> >
> >
> > --
> > Manos Anagnostakis | Compiler Engineer |
> > E: mailto:makeljana.shkurti@vrull.eu
> >
> > VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> >  W: http://www.vrull.eu/ | https://www.linkedin.com/company/vrull/
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-09-26 11:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-25 19:49 [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation Manos Anagnostakis
2023-09-25 19:54 ` Andrew Pinski
2023-09-25 19:59   ` Philipp Tomsich
2023-09-25 20:04     ` Andrew Pinski
2023-09-25 20:57       ` Andrew Pinski
2023-09-26  6:06         ` Manos Anagnostakis
2023-09-26  8:34           ` Tamar Christina
2023-09-26  8:35           ` Kyrylo Tkachov
2023-09-26  8:50             ` Kyrylo Tkachov
2023-09-26 11:51               ` Manos Anagnostakis
2023-09-26  8:34       ` Kyrylo Tkachov
2023-09-25 20:00   ` Manos Anagnostakis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).