[PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
@ 2023-08-28 14:37 Manos Anagnostakis
  2023-09-22  8:51 ` [PING] " Manos Anagnostakis
  2023-09-28 13:17 ` Richard Sandiford
  0 siblings, 2 replies; 8+ messages in thread
From: Manos Anagnostakis @ 2023-08-28 14:37 UTC (permalink / raw)
  To: gcc-patches
  Cc: Kyrylo Tkachov, Richard Sandiford, Philipp Tomsich, Manos Anagnostakis

This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
to provide the requested behaviour for handling ldp and stp:

  /* Allow the tuning structure to disable LDP instruction formation
     from combining instructions (e.g., in peephole2).
     TODO: Implement fine-grained tuning control for LDP and STP:
           1. control policies for load and store separately;
           2. support the following policies:
              - default (use what is in the tuning structure)
              - always
              - never
              - aligned (only if the compiler can prove that the
                load will be aligned to 2 * element_size)  */

It provides two new and concrete command-line options -mldp-policy and -mstp-policy
to give the ability to control load and store policies seperately as
stated in part 1 of the TODO.

The accepted values for both options are:
- default: Use the ldp/stp policy defined in the corresponding tuning
  structure.
- always: Emit ldp/stp regardless of alignment.
- never: Do not emit ldp/stp.
- aligned: In order to emit ldp/stp, first check if the load/store will
  be aligned to 2 * element_size.

gcc/ChangeLog:
        * config/aarch64/aarch64-protos.h (struct tune_params): Add
	appropriate enums for the policies.
        * config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
	options.
        * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
	function to parse ldp-policy option.
        (aarch64_parse_stp_policy): New function to parse stp-policy option.
        (aarch64_override_options_internal): Call parsing functions.
        (aarch64_operands_ok_for_ldpstp): Add option-value check and
	alignment check and remove superseded ones
        (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check and
	alignment check and remove superseded ones.
        * config/aarch64/aarch64.opt: Add options.

gcc/testsuite/ChangeLog:
        * gcc.target/aarch64/ldp_aligned.c: New test.
        * gcc.target/aarch64/ldp_always.c: New test.
        * gcc.target/aarch64/ldp_never.c: New test.
        * gcc.target/aarch64/stp_aligned.c: New test.
        * gcc.target/aarch64/stp_always.c: New test.
        * gcc.target/aarch64/stp_never.c: New test.

Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
---
Changes in v2:
        - Fixed commited ldp tests to correctly trigger
          and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
        - Added "-mcpu=generic" to commited tests to guarantee generic target code
          generation and not cause the regressions of v1.

 gcc/config/aarch64/aarch64-protos.h           |  24 ++
 gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
 gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
 gcc/config/aarch64/aarch64.opt                |   8 +
 .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
 gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
 gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
 .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
 gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
 gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
 10 files changed, 586 insertions(+), 61 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 70303d6fd95..be1d73490ed 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -568,6 +568,30 @@ struct tune_params
   /* Place prefetch struct pointer at the end to enable type checking
      errors when tune_params misses elements (e.g., from erroneous merges).  */
   const struct cpu_prefetch_tune *prefetch;
+/* An enum specifying how to handle load pairs using a fine-grained policy:
+   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
+   to at least double the alignment of the type.
+   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
+   - LDP_POLICY_NEVER: Do not emit ldp.  */
+
+  enum aarch64_ldp_policy_model
+  {
+    LDP_POLICY_ALIGNED,
+    LDP_POLICY_ALWAYS,
+    LDP_POLICY_NEVER
+  } ldp_policy_model;
+/* An enum specifying how to handle store pairs using a fine-grained policy:
+   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
+   to at least double the alignment of the type.
+   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
+   - STP_POLICY_NEVER: Do not emit stp.  */
+
+  enum aarch64_stp_policy_model
+  {
+    STP_POLICY_ALIGNED,
+    STP_POLICY_ALWAYS,
+    STP_POLICY_NEVER
+  } stp_policy_model;
 };
 
 /* Classifies an address.
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 52112ba7c48..774568e9106 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,11 +30,6 @@
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
 
-/* Don't create non-8 byte aligned load/store pair.  That is if the
-two load/stores are not at least 8 byte aligned don't create load/store
-pairs.   */
-AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
-
 /* Some of the optional shift to some arthematic instructions are
    considered cheap.  Logical shift left <=4 with or without a
    zero extend are considered cheap.  Sign extend; non logical shift left
@@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 /* Disallow load/store pair instructions on Q-registers.  */
 AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
 
-/* Disallow load-pair instructions to be formed in combine/peephole.  */
-AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
-
 AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index eba5d4a7e04..43d88c68647 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
      Neoverse V1.  It does not have a noticeable effect on A64FX and should
      have at most a very minor effect on SVE2 cores.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params cortexa73_tunings =
@@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 
@@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
-  &exynosm1_prefetch_tune
+  &exynosm1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderxt88_tunings =
@@ -1593,8 +1607,10 @@ static const struct tune_params thunderxt88_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
-  &thunderxt88_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &thunderxt88_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
-   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &thunderx_prefetch_tune
+  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
+  &thunderx_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params tsv110_tunings =
@@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
   0,    /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
-  &tsv110_prefetch_tune
+  &tsv110_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
   17,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
-  &xgene1_prefetch_tune
+  &xgene1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params emag_tunings =
@@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
   17,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
-  &xgene1_prefetch_tune
+  &xgene1_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params qdf24xx_tunings =
@@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
-  &qdf24xx_prefetch_tune
+  &qdf24xx_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
@@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx2t99_tunings =
@@ -1832,7 +1859,9 @@ static const struct tune_params thunderx2t99_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &thunderx2t99_prefetch_tune
+  &thunderx2t99_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params thunderx3t110_tunings =
@@ -1866,7 +1895,9 @@ static const struct tune_params thunderx3t110_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &thunderx3t110_prefetch_tune
+  &thunderx3t110_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params neoversen1_tunings =
@@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params ampere1_tunings =
@@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
-  &ampere1_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &ampere1_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const struct tune_params ampere1a_tunings =
@@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
-  &ampere1_prefetch_tune
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  &ampere1_prefetch_tune,
+  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
@@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings =
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const sve_vec_cost neoverse512tvb_sve_vector_cost =
@@ -2292,7 +2331,9 @@ static const struct tune_params neoverse512tvb_tunings =
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
@@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings =
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
@@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings =
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
-  &generic_prefetch_tune
+  &generic_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 static const struct tune_params a64fx_tunings =
@@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
-  &a64fx_prefetch_tune
+  &a64fx_prefetch_tune,
+  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
+  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
   return AARCH_PARSE_INVALID_ARG;
 }
 
+/* Validate a command-line -mldp-policy option.  Parse the policy
+   specified in STR and throw errors if appropriate.  */
+
+static bool
+aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
+{
+  /* Check the value of the option to be one of the accepted.  */
+  if (strcmp (str, "always") == 0)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
+  else if (strcmp (str, "never") == 0)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
+  else if (strcmp (str, "aligned") == 0)
+    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
+  else if (strcmp (str, "default") != 0)
+    {
+      error ("unknown value %qs for %<-mldp-policy%>", str);
+      return false;
+    }
+
+  return true;
+}
+
+/* Validate a command-line -mstp-policy option.  Parse the policy
+   specified in STR and throw errors if appropriate.  */
+
+static bool
+aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
+{
+  /* Check the value of the option to be one of the accepted.  */
+  if (strcmp (str, "always") == 0)
+      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
+  else if (strcmp (str, "never") == 0)
+      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
+  else if (strcmp (str, "aligned") == 0)
+      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
+  else if (strcmp (str, "default") != 0)
+    {
+      error ("unknown value %qs for %<-mstp-policy%>", str);
+      return false;
+    }
+
+  return true;
+}
+
 /* Parse TOKEN, which has length LENGTH to see if it is an option
    described in FLAG.  If it is, return the index bit for that fusion type.
    If not, error (printing OPTION_NAME) and return zero.  */
@@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct gcc_options *opts)
     aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
 				   &aarch64_tune_params);
 
+  if (opts->x_aarch64_ldp_policy_string)
+    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
+			      &aarch64_tune_params);
+
+  if (opts->x_aarch64_stp_policy_string)
+    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
+			      &aarch64_tune_params);
+
   /* This target defaults to strict volatile bitfields.  */
   if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
     opts->x_flag_strict_volatile_bitfields = 1;
@@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   enum reg_class rclass_1, rclass_2;
   rtx mem_1, mem_2, reg_1, reg_2;
 
-  /* Allow the tuning structure to disable LDP instruction formation
-     from combining instructions (e.g., in peephole2).
-     TODO: Implement fine-grained tuning control for LDP and STP:
-	   1. control policies for load and store separately;
-	   2. support the following policies:
-	      - default (use what is in the tuning structure)
-	      - always
-	      - never
-	      - aligned (only if the compiler can prove that the
-		load will be aligned to 2 * element_size)  */
-  if (load && (aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
+  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
+    return false;
+
+  /* If we have STP_POLICY_NEVER, reject the store pair.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
     return false;
 
   if (load)
@@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
   if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
     return false;
 
-  /* If we have SImode and slow unaligned ldp,
-     check the alignment to be at least 8 byte. */
-  if (mode == SImode
-      && (aarch64_tune_params.extra_tuning_flags
-          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
+  /* If we have LDP_POLICY_ALIGNED,
+     do not emit the load pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
       && !optimize_size
-      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
+      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
+    return false;
+
+  /* If we have STP_POLICY_ALIGNED,
+     do not emit the store pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
+      && !optimize_size
+      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
     return false;
 
   /* Check if the addresses are in the form of [base+offset].  */
@@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   HOST_WIDE_INT offvals[num_insns], msize;
   rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
 
+  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
+    return false;
+
+  /* If we have STP_POLICY_NEVER, reject the store pair.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
+    return false;
+
   if (load)
     {
       for (int i = 0; i < num_insns; i++)
@@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
   if (offvals[0] % msize != offvals[2] % msize)
     return false;
 
-  /* If we have SImode and slow unaligned ldp,
-     check the alignment to be at least 8 byte. */
-  if (mode == SImode
-      && (aarch64_tune_params.extra_tuning_flags
-	  & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
+  /* If we have LDP_POLICY_ALIGNED,
+     do not emit the load pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (load
+      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
+      && !optimize_size
+      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
+    return false;
+
+  /* If we have STP_POLICY_ALIGNED,
+     do not emit the store pair unless the alignment is checked to be
+     at least double the alignment of the type.  */
+  if (!load
+      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
       && !optimize_size
-      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
+      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
     return false;
 
   return true;
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 4a0580435a8..e5302947ce7 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -205,6 +205,14 @@ msign-return-address=
 Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t) Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
 Select return address signing scope.
 
+mldp-policy=
+Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
+Fine-grained policy for load pairs.
+
+mstp-policy=
+Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
+Fine-grained policy for store pairs.
+
 Enum
 Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
 Supported AArch64 return address signing scope (for use with -msign-return-address= option):
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
new file mode 100644
index 00000000000..6e29b265168
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
new file mode 100644
index 00000000000..d2c4cf343e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
new file mode 100644
index 00000000000..f8a45ee18be
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
@@ -0,0 +1,66 @@
+/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define LDP_TEST_ALIGNED(TYPE) \
+TYPE ldp_aligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[0]; \
+    a_1 = arr[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_##TYPE(char* ptr){ \
+    TYPE a_0, a_1; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[0]; \
+    a_1 = a[1]; \
+    return a_0 + a_1; \
+}
+
+#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    a_0 = arr[100]; \
+    a_1 = arr[101]; \
+    a_2 = arr[102]; \
+    a_3 = arr[103]; \
+    a_4 = arr[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
+    TYPE a_0, a_1, a_2, a_3, a_4; \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a_0 = a[100]; \
+    a_1 = a[101]; \
+    a_2 = a[102]; \
+    a_3 = a[103]; \
+    a_4 = a[110]; \
+    return a_0 + a_1 + a_2 + a_3 + a_4; \
+}
+
+LDP_TEST_ALIGNED(int32_t);
+LDP_TEST_ALIGNED(int64_t);
+LDP_TEST_ALIGNED(v4si);
+LDP_TEST_UNALIGNED(int32_t);
+LDP_TEST_UNALIGNED(int64_t);
+LDP_TEST_UNALIGNED(v4si);
+LDP_TEST_ADJUST_ALIGNED(int32_t);
+LDP_TEST_ADJUST_ALIGNED(int64_t);
+LDP_TEST_ADJUST_UNALIGNED(int32_t);
+LDP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
new file mode 100644
index 00000000000..ae47b42efc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
new file mode 100644
index 00000000000..c1c51f9ae88
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
new file mode 100644
index 00000000000..c28fcafa0ed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
@@ -0,0 +1,60 @@
+/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
+
+#include <stdlib.h>
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+#define STP_TEST_ALIGNED(TYPE) \
+TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[0] = x; \
+    arr[1] = x; \
+    return arr; \
+}
+
+#define STP_TEST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[0] = x; \
+    a[1] = x; \
+    return a; \
+}
+
+#define STP_TEST_ADJUST_ALIGNED(TYPE) \
+TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    arr[100] = x; \
+    arr[101] = x; \
+    arr[102] = x; \
+    arr[103] = x; \
+    return arr; \
+}
+
+#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
+TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
+    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
+    TYPE *a = arr+1; \
+    a[100] = x; \
+    a[101] = x; \
+    a[102] = x; \
+    a[103] = x; \
+    return a; \
+}
+
+STP_TEST_ALIGNED(int32_t);
+STP_TEST_ALIGNED(int64_t);
+STP_TEST_ALIGNED(v4si);
+STP_TEST_UNALIGNED(int32_t);
+STP_TEST_UNALIGNED(int64_t);
+STP_TEST_UNALIGNED(v4si);
+STP_TEST_ADJUST_ALIGNED(int32_t);
+STP_TEST_ADJUST_ALIGNED(int64_t);
+STP_TEST_ADJUST_UNALIGNED(int32_t);
+STP_TEST_ADJUST_UNALIGNED(int64_t);
+
+/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
+
-- 
2.40.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PING] [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-08-28 14:37 [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases Manos Anagnostakis
@ 2023-09-22  8:51 ` Manos Anagnostakis
  2023-09-28 13:17 ` Richard Sandiford
  1 sibling, 0 replies; 8+ messages in thread
From: Manos Anagnostakis @ 2023-09-22  8:51 UTC (permalink / raw)
  To: gcc-patches
  Cc: Kyrylo Tkachov, Richard Sandiford, Philipp Tomsich, tamar.christina

[-- Attachment #1: Type: text/plain, Size: 42287 bytes --]

Kind ping for reviewing this patch. It's tested and does not cause
regressions:
https://patchwork.sourceware.org/project/gcc/patch/20230828143744.7574-1-manos.anagnostakis@vrull.eu/

Thank you in advance!

On Mon, Aug 28, 2023 at 5:37 PM Manos Anagnostakis <
manos.anagnostakis@vrull.eu> wrote:

> This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> to provide the requested behaviour for handling ldp and stp:
>
>   /* Allow the tuning structure to disable LDP instruction formation
>      from combining instructions (e.g., in peephole2).
>      TODO: Implement fine-grained tuning control for LDP and STP:
>            1. control policies for load and store separately;
>            2. support the following policies:
>               - default (use what is in the tuning structure)
>               - always
>               - never
>               - aligned (only if the compiler can prove that the
>                 load will be aligned to 2 * element_size)  */
>
> It provides two new and concrete command-line options -mldp-policy and
> -mstp-policy
> to give the ability to control load and store policies seperately as
> stated in part 1 of the TODO.
>
> The accepted values for both options are:
> - default: Use the ldp/stp policy defined in the corresponding tuning
>   structure.
> - always: Emit ldp/stp regardless of alignment.
> - never: Do not emit ldp/stp.
> - aligned: In order to emit ldp/stp, first check if the load/store will
>   be aligned to 2 * element_size.
>
> gcc/ChangeLog:
>         * config/aarch64/aarch64-protos.h (struct tune_params): Add
>         appropriate enums for the policies.
>         * config/aarch64/aarch64-tuning-flags.def
>         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>         options.
>         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>         function to parse ldp-policy option.
>         (aarch64_parse_stp_policy): New function to parse stp-policy
> option.
>         (aarch64_override_options_internal): Call parsing functions.
>         (aarch64_operands_ok_for_ldpstp): Add option-value check and
>         alignment check and remove superseded ones
>         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check and
>         alignment check and remove superseded ones.
>         * config/aarch64/aarch64.opt: Add options.
>
> gcc/testsuite/ChangeLog:
>         * gcc.target/aarch64/ldp_aligned.c: New test.
>         * gcc.target/aarch64/ldp_always.c: New test.
>         * gcc.target/aarch64/ldp_never.c: New test.
>         * gcc.target/aarch64/stp_aligned.c: New test.
>         * gcc.target/aarch64/stp_always.c: New test.
>         * gcc.target/aarch64/stp_never.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> ---
> Changes in v2:
>         - Fixed commited ldp tests to correctly trigger
>           and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
>         - Added "-mcpu=generic" to commited tests to guarantee generic
> target code
>           generation and not cause the regressions of v1.
>
>  gcc/config/aarch64/aarch64-protos.h           |  24 ++
>  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>  gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
>  gcc/config/aarch64/aarch64.opt                |   8 +
>  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
>  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
>  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
>  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
>  10 files changed, 586 insertions(+), 61 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> index 70303d6fd95..be1d73490ed 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,6 +568,30 @@ struct tune_params
>    /* Place prefetch struct pointer at the end to enable type checking
>       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
>    const struct cpu_prefetch_tune *prefetch;
> +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> +
> +  enum aarch64_ldp_policy_model
> +  {
> +    LDP_POLICY_ALIGNED,
> +    LDP_POLICY_ALWAYS,
> +    LDP_POLICY_NEVER
> +  } ldp_policy_model;
> +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> +   - STP_POLICY_NEVER: Do not emit stp.  */
> +
> +  enum aarch64_stp_policy_model
> +  {
> +    STP_POLICY_ALIGNED,
> +    STP_POLICY_ALWAYS,
> +    STP_POLICY_NEVER
> +  } stp_policy_model;
>  };
>
>  /* Classifies an address.
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 52112ba7c48..774568e9106 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -30,11 +30,6 @@
>
>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>
> -/* Don't create non-8 byte aligned load/store pair.  That is if the
> -two load/stores are not at least 8 byte aligned don't create load/store
> -pairs.   */
> -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> -
>  /* Some of the optional shift to some arthematic instructions are
>     considered cheap.  Logical shift left <=4 with or without a
>     zero extend are considered cheap.  Sign extend; non logical shift left
> @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
>  /* Disallow load/store pair instructions on Q-registers.  */
>  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
>
> -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> -
>  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
>
>  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index eba5d4a7e04..43d88c68647 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
>       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
>       have at most a very minor effect on SVE2 cores.  */
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa35_tunings =
> @@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa53_tunings =
> @@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa57_tunings =
> @@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),        /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa72_tunings =
> @@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params cortexa73_tunings =
> @@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>
> @@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
>    48,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> -  &exynosm1_prefetch_tune
> +  &exynosm1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderxt88_tunings =
> @@ -1593,8 +1607,10 @@ static const struct tune_params thunderxt88_tunings
> =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),    /* tune_flags.  */
> -  &thunderxt88_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &thunderxt88_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx_tunings =
> @@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> -  &thunderx_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> +  &thunderx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params tsv110_tunings =
> @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
>    0,    /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> -  &tsv110_prefetch_tune
> +  &tsv110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params xgene1_tunings =
> @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
>    17,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params emag_tunings =
> @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
>    17,  /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),       /* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params qdf24xx_tunings =
> @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> -  &qdf24xx_prefetch_tune
> +  &qdf24xx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> values
> @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),           /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx2t99_tunings =
> @@ -1832,7 +1859,9 @@ static const struct tune_params thunderx2t99_tunings
> =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &thunderx2t99_prefetch_tune
> +  &thunderx2t99_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params thunderx3t110_tunings =
> @@ -1866,7 +1895,9 @@ static const struct tune_params
> thunderx3t110_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &thunderx3t110_prefetch_tune
> +  &thunderx3t110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params neoversen1_tunings =
> @@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),     /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params ampere1_tunings =
> @@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const struct tune_params ampere1a_tunings =
> @@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
>    2,   /* min_div_recip_mul_df.  */
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,     /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED      /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> @@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings =
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> @@ -2292,7 +2331,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> @@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> @@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  static const struct tune_params a64fx_tunings =
> @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
> -  &a64fx_prefetch_tune
> +  &a64fx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,      /* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS       /* stp_policy_model.  */
>  };
>
>  /* Support for fine-grained override of the tuning structures.  */
> @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const
> struct processor **res)
>    return AARCH_PARSE_INVALID_ARG;
>  }
>
> +/* Validate a command-line -mldp-policy option.  Parse the policy
> +   specified in STR and throw errors if appropriate.  */
> +
> +static bool
> +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
> +{
> +  /* Check the value of the option to be one of the accepted.  */
> +  if (strcmp (str, "always") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> +  else if (strcmp (str, "never") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> +  else if (strcmp (str, "aligned") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> +  else if (strcmp (str, "default") != 0)
> +    {
> +      error ("unknown value %qs for %<-mldp-policy%>", str);
> +      return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Validate a command-line -mstp-policy option.  Parse the policy
> +   specified in STR and throw errors if appropriate.  */
> +
> +static bool
> +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
> +{
> +  /* Check the value of the option to be one of the accepted.  */
> +  if (strcmp (str, "always") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> +  else if (strcmp (str, "never") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> +  else if (strcmp (str, "aligned") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> +  else if (strcmp (str, "default") != 0)
> +    {
> +      error ("unknown value %qs for %<-mstp-policy%>", str);
> +      return false;
> +    }
> +
> +  return true;
> +}
> +
>  /* Parse TOKEN, which has length LENGTH to see if it is an option
>     described in FLAG.  If it is, return the index bit for that fusion
> type.
>     If not, error (printing OPTION_NAME) and return zero.  */
> @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
>      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
>                                    &aarch64_tune_params);
>
> +  if (opts->x_aarch64_ldp_policy_string)
> +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
> +                             &aarch64_tune_params);
> +
> +  if (opts->x_aarch64_stp_policy_string)
> +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
> +                             &aarch64_tune_params);
> +
>    /* This target defaults to strict volatile bitfields.  */
>    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least
> (2))
>      opts->x_flag_strict_volatile_bitfields = 1;
> @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
>    enum reg_class rclass_1, rclass_2;
>    rtx mem_1, mem_2, reg_1, reg_2;
>
> -  /* Allow the tuning structure to disable LDP instruction formation
> -     from combining instructions (e.g., in peephole2).
> -     TODO: Implement fine-grained tuning control for LDP and STP:
> -          1. control policies for load and store separately;
> -          2. support the following policies:
> -             - default (use what is in the tuning structure)
> -             - always
> -             - never
> -             - aligned (only if the compiler can prove that the
> -               load will be aligned to 2 * element_size)  */
> -  if (load && (aarch64_tune_params.extra_tuning_flags
> -              & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
>      return false;
>
>    if (load)
> @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>      return false;
>
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
>        && !optimize_size
> -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> +      && !optimize_size
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>
>    /* Check if the addresses are in the form of [base+offset].  */
> @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
>    HOST_WIDE_INT offvals[num_insns], msize;
>    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
>
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> +    return false;
> +
>    if (load)
>      {
>        for (int i = 0; i < num_insns; i++)
> @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
>    if (offvals[0] % msize != offvals[2] % msize)
>      return false;
>
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -         & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> +      && !optimize_size
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
>        && !optimize_size
> -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>
>    return true;
> diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> index 4a0580435a8..e5302947ce7 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -205,6 +205,14 @@ msign-return-address=
>  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t)
> Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
>  Select return address signing scope.
>
> +mldp-policy=
> +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
> +Fine-grained policy for load pairs.
> +
> +mstp-policy=
> +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
> +Fine-grained policy for store pairs.
> +
>  Enum
>  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
>  Supported AArch64 return address signing scope (for use with
> -msign-return-address= option):
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> new file mode 100644
> index 00000000000..6e29b265168
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> new file mode 100644
> index 00000000000..d2c4cf343e9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> new file mode 100644
> index 00000000000..f8a45ee18be
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> new file mode 100644
> index 00000000000..ae47b42efc4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> new file mode 100644
> index 00000000000..c1c51f9ae88
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> new file mode 100644
> index 00000000000..c28fcafa0ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1));
> \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +
> --
> 2.40.1
>
>

-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-08-28 14:37 [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases Manos Anagnostakis
  2023-09-22  8:51 ` [PING] " Manos Anagnostakis
@ 2023-09-28 13:17 ` Richard Sandiford
  2023-09-28 13:33   ` Manos Anagnostakis
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2023-09-28 13:17 UTC (permalink / raw)
  To: Manos Anagnostakis; +Cc: gcc-patches, Kyrylo Tkachov, Philipp Tomsich

Thanks for the patch and sorry for the slow review.

Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> to provide the requested behaviour for handling ldp and stp:
>
>   /* Allow the tuning structure to disable LDP instruction formation
>      from combining instructions (e.g., in peephole2).
>      TODO: Implement fine-grained tuning control for LDP and STP:
>            1. control policies for load and store separately;
>            2. support the following policies:
>               - default (use what is in the tuning structure)
>               - always
>               - never
>               - aligned (only if the compiler can prove that the
>                 load will be aligned to 2 * element_size)  */
>
> It provides two new and concrete command-line options -mldp-policy and -mstp-policy
> to give the ability to control load and store policies seperately as
> stated in part 1 of the TODO.
>
> The accepted values for both options are:
> - default: Use the ldp/stp policy defined in the corresponding tuning
>   structure.
> - always: Emit ldp/stp regardless of alignment.
> - never: Do not emit ldp/stp.
> - aligned: In order to emit ldp/stp, first check if the load/store will
>   be aligned to 2 * element_size.
>
> gcc/ChangeLog:
>         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> 	appropriate enums for the policies.
>         * config/aarch64/aarch64-tuning-flags.def
> 	(AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> 	options.
>         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> 	function to parse ldp-policy option.
>         (aarch64_parse_stp_policy): New function to parse stp-policy option.
>         (aarch64_override_options_internal): Call parsing functions.
>         (aarch64_operands_ok_for_ldpstp): Add option-value check and
> 	alignment check and remove superseded ones
>         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check and
> 	alignment check and remove superseded ones.
>         * config/aarch64/aarch64.opt: Add options.
>
> gcc/testsuite/ChangeLog:
>         * gcc.target/aarch64/ldp_aligned.c: New test.
>         * gcc.target/aarch64/ldp_always.c: New test.
>         * gcc.target/aarch64/ldp_never.c: New test.
>         * gcc.target/aarch64/stp_aligned.c: New test.
>         * gcc.target/aarch64/stp_always.c: New test.
>         * gcc.target/aarch64/stp_never.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> ---
> Changes in v2:
>         - Fixed commited ldp tests to correctly trigger
>           and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
>         - Added "-mcpu=generic" to commited tests to guarantee generic target code
>           generation and not cause the regressions of v1.
>
>  gcc/config/aarch64/aarch64-protos.h           |  24 ++
>  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>  gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
>  gcc/config/aarch64/aarch64.opt                |   8 +
>  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
>  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
>  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
>  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
>  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
>  10 files changed, 586 insertions(+), 61 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 70303d6fd95..be1d73490ed 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,6 +568,30 @@ struct tune_params
>    /* Place prefetch struct pointer at the end to enable type checking
>       errors when tune_params misses elements (e.g., from erroneous merges).  */
>    const struct cpu_prefetch_tune *prefetch;
> +/* An enum specifying how to handle load pairs using a fine-grained policy:
> +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> +
> +  enum aarch64_ldp_policy_model
> +  {
> +    LDP_POLICY_ALIGNED,
> +    LDP_POLICY_ALWAYS,
> +    LDP_POLICY_NEVER
> +  } ldp_policy_model;
> +/* An enum specifying how to handle store pairs using a fine-grained policy:
> +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> +   - STP_POLICY_NEVER: Do not emit stp.  */
> +
> +  enum aarch64_stp_policy_model
> +  {
> +    STP_POLICY_ALIGNED,
> +    STP_POLICY_ALWAYS,
> +    STP_POLICY_NEVER
> +  } stp_policy_model;
>  };

Generally the patch looks really good.  But I think we can use a single
enum type for both LDP and STP, with the values having the prefix
AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
and that:

>  /* Classifies an address.
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 52112ba7c48..774568e9106 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -30,11 +30,6 @@
>  
>  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>  
> -/* Don't create non-8 byte aligned load/store pair.  That is if the
> -two load/stores are not at least 8 byte aligned don't create load/store
> -pairs.   */
> -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> -
>  /* Some of the optional shift to some arthematic instructions are
>     considered cheap.  Logical shift left <=4 with or without a
>     zero extend are considered cheap.  Sign extend; non logical shift left
> @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>  /* Disallow load/store pair instructions on Q-registers.  */
>  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
>  
> -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> -
>  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
>  
>  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index eba5d4a7e04..43d88c68647 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
>       Neoverse V1.  It does not have a noticeable effect on A64FX and should
>       have at most a very minor effect on SVE2 cores.  */
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params cortexa35_tunings =
> @@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params cortexa53_tunings =
> @@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params cortexa57_tunings =
> @@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params cortexa72_tunings =
> @@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params cortexa73_tunings =
> @@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  
> @@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
>    48,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> -  &exynosm1_prefetch_tune
> +  &exynosm1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params thunderxt88_tunings =
> @@ -1593,8 +1607,10 @@ static const struct tune_params thunderxt88_tunings =
>    2,	/* min_div_recip_mul_df.  */
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),	/* tune_flags.  */
> -  &thunderxt88_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  &thunderxt88_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params thunderx_tunings =
> @@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
>    2,	/* min_div_recip_mul_df.  */
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
> -  &thunderx_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
> +  &thunderx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params tsv110_tunings =
> @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
>    0,    /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> -  &tsv110_prefetch_tune
> +  &tsv110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params xgene1_tunings =
> @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
>    17,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params emag_tunings =
> @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
>    17,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),	/* tune_flags.  */
> -  &xgene1_prefetch_tune
> +  &xgene1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params qdf24xx_tunings =
> @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> -  &qdf24xx_prefetch_tune
> +  &qdf24xx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
> @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),		/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params thunderx2t99_tunings =
> @@ -1832,7 +1859,9 @@ static const struct tune_params thunderx2t99_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &thunderx2t99_prefetch_tune
> +  &thunderx2t99_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params thunderx3t110_tunings =
> @@ -1866,7 +1895,9 @@ static const struct tune_params thunderx3t110_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &thunderx3t110_prefetch_tune
> +  &thunderx3t110_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params neoversen1_tunings =
> @@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params ampere1_tunings =
> @@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
>    2,	/* min_div_recip_mul_df.  */
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params ampere1a_tunings =
> @@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
>    2,	/* min_div_recip_mul_df.  */
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),	/* tune_flags.  */
> -  &ampere1_prefetch_tune
> +  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> +  &ampere1_prefetch_tune,
> +  tune_params::LDP_POLICY_ALIGNED,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALIGNED	/* stp_policy_model.  */
>  };
>  
>  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> @@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings =
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> @@ -2292,7 +2331,9 @@ static const struct tune_params neoverse512tvb_tunings =
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> @@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> @@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings =
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
> -  &generic_prefetch_tune
> +  &generic_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  static const struct tune_params a64fx_tunings =
> @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
>    0,	/* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
> -  &a64fx_prefetch_tune
> +  &a64fx_prefetch_tune,
> +  tune_params::LDP_POLICY_ALWAYS,	/* ldp_policy_model.  */
> +  tune_params::STP_POLICY_ALWAYS	/* stp_policy_model.  */
>  };
>  
>  /* Support for fine-grained override of the tuning structures.  */
> @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
>    return AARCH_PARSE_INVALID_ARG;
>  }
>  
> +/* Validate a command-line -mldp-policy option.  Parse the policy
> +   specified in STR and throw errors if appropriate.  */
> +
> +static bool
> +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
> +{
> +  /* Check the value of the option to be one of the accepted.  */
> +  if (strcmp (str, "always") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> +  else if (strcmp (str, "never") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> +  else if (strcmp (str, "aligned") == 0)
> +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> +  else if (strcmp (str, "default") != 0)
> +    {
> +      error ("unknown value %qs for %<-mldp-policy%>", str);
> +      return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Validate a command-line -mstp-policy option.  Parse the policy
> +   specified in STR and throw errors if appropriate.  */
> +
> +static bool
> +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
> +{
> +  /* Check the value of the option to be one of the accepted.  */
> +  if (strcmp (str, "always") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> +  else if (strcmp (str, "never") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> +  else if (strcmp (str, "aligned") == 0)
> +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> +  else if (strcmp (str, "default") != 0)
> +    {
> +      error ("unknown value %qs for %<-mstp-policy%>", str);
> +      return false;
> +    }
> +
> +  return true;
> +}
> +
>  /* Parse TOKEN, which has length LENGTH to see if it is an option
>     described in FLAG.  If it is, return the index bit for that fusion type.
>     If not, error (printing OPTION_NAME) and return zero.  */
> @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct gcc_options *opts)
>      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
>  				   &aarch64_tune_params);
>  
> +  if (opts->x_aarch64_ldp_policy_string)
> +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
> +			      &aarch64_tune_params);
> +
> +  if (opts->x_aarch64_stp_policy_string)
> +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
> +			      &aarch64_tune_params);
> +
>    /* This target defaults to strict volatile bitfields.  */
>    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
>      opts->x_flag_strict_volatile_bitfields = 1;
> @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>    enum reg_class rclass_1, rclass_2;
>    rtx mem_1, mem_2, reg_1, reg_2;
>  
> -  /* Allow the tuning structure to disable LDP instruction formation
> -     from combining instructions (e.g., in peephole2).
> -     TODO: Implement fine-grained tuning control for LDP and STP:
> -	   1. control policies for load and store separately;
> -	   2. support the following policies:
> -	      - default (use what is in the tuning structure)
> -	      - always
> -	      - never
> -	      - aligned (only if the compiler can prove that the
> -		load will be aligned to 2 * element_size)  */
> -  if (load && (aarch64_tune_params.extra_tuning_flags
> -	       & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
>      return false;

...here we could do something like:

  auto policy = (load
		 ? aarch64_tune_params.ldp_policy_model
		 : aarch64_tune_params.stp_policy_model);

Also:

>  
>    if (load)
> @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>      return false;
>  
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
>        && !optimize_size
> -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
> +      && !optimize_size
> +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>  
>    /* Check if the addresses are in the form of [base+offset].  */
> @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>    HOST_WIDE_INT offvals[num_insns], msize;
>    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
>  
> +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
> +    return false;
> +
> +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
> +    return false;
> +
>    if (load)
>      {
>        for (int i = 0; i < num_insns; i++)
> @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>    if (offvals[0] % msize != offvals[2] % msize)
>      return false;
>  
> -  /* If we have SImode and slow unaligned ldp,
> -     check the alignment to be at least 8 byte. */
> -  if (mode == SImode
> -      && (aarch64_tune_params.extra_tuning_flags
> -	  & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> +  /* If we have LDP_POLICY_ALIGNED,
> +     do not emit the load pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (load
> +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
> +      && !optimize_size
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> +    return false;
> +
> +  /* If we have STP_POLICY_ALIGNED,
> +     do not emit the store pair unless the alignment is checked to be
> +     at least double the alignment of the type.  */
> +  if (!load
> +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
>        && !optimize_size
> -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>      return false;
>  
>    return true;
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 4a0580435a8..e5302947ce7 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -205,6 +205,14 @@ msign-return-address=
>  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t) Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
>  Select return address signing scope.
>  
> +mldp-policy=
> +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
> +Fine-grained policy for load pairs.
> +
> +mstp-policy=
> +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
> +Fine-grained policy for store pairs.
> +
>  Enum
>  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
>  Supported AArch64 return address signing scope (for use with -msign-return-address= option):
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> new file mode 100644
> index 00000000000..6e29b265168
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */

It might be better to split this into two tests, one for the aligned
accesses and one for the unaligned accesses.  Same for the store version.
(Splitting isn't necessary or useful for =always and =never though.)

Thanks,
Richard

> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> new file mode 100644
> index 00000000000..d2c4cf343e9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> new file mode 100644
> index 00000000000..f8a45ee18be
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> @@ -0,0 +1,66 @@
> +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define LDP_TEST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[0]; \
> +    a_1 = arr[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[0]; \
> +    a_1 = a[1]; \
> +    return a_0 + a_1; \
> +}
> +
> +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    a_0 = arr[100]; \
> +    a_1 = arr[101]; \
> +    a_2 = arr[102]; \
> +    a_3 = arr[103]; \
> +    a_4 = arr[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> +    TYPE a_0, a_1, a_2, a_3, a_4; \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a_0 = a[100]; \
> +    a_1 = a[101]; \
> +    a_2 = a[102]; \
> +    a_3 = a[103]; \
> +    a_4 = a[110]; \
> +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> +}
> +
> +LDP_TEST_ALIGNED(int32_t);
> +LDP_TEST_ALIGNED(int64_t);
> +LDP_TEST_ALIGNED(v4si);
> +LDP_TEST_UNALIGNED(int32_t);
> +LDP_TEST_UNALIGNED(int64_t);
> +LDP_TEST_UNALIGNED(v4si);
> +LDP_TEST_ADJUST_ALIGNED(int32_t);
> +LDP_TEST_ADJUST_ALIGNED(int64_t);
> +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> new file mode 100644
> index 00000000000..ae47b42efc4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> new file mode 100644
> index 00000000000..c1c51f9ae88
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> new file mode 100644
> index 00000000000..c28fcafa0ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> @@ -0,0 +1,60 @@
> +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
> +
> +#include <stdlib.h>
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +#define STP_TEST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[0] = x; \
> +    arr[1] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[0] = x; \
> +    a[1] = x; \
> +    return a; \
> +}
> +
> +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    arr[100] = x; \
> +    arr[101] = x; \
> +    arr[102] = x; \
> +    arr[103] = x; \
> +    return arr; \
> +}
> +
> +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
> +    TYPE *a = arr+1; \
> +    a[100] = x; \
> +    a[101] = x; \
> +    a[102] = x; \
> +    a[103] = x; \
> +    return a; \
> +}
> +
> +STP_TEST_ALIGNED(int32_t);
> +STP_TEST_ALIGNED(int64_t);
> +STP_TEST_ALIGNED(v4si);
> +STP_TEST_UNALIGNED(int32_t);
> +STP_TEST_UNALIGNED(int64_t);
> +STP_TEST_UNALIGNED(v4si);
> +STP_TEST_ADJUST_ALIGNED(int32_t);
> +STP_TEST_ADJUST_ALIGNED(int64_t);
> +STP_TEST_ADJUST_UNALIGNED(int32_t);
> +STP_TEST_ADJUST_UNALIGNED(int64_t);
> +
> +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> +

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-09-28 13:17 ` Richard Sandiford
@ 2023-09-28 13:33   ` Manos Anagnostakis
  2023-09-28 13:37     ` Philipp Tomsich
  2023-09-28 13:42     ` Richard Sandiford
  0 siblings, 2 replies; 8+ messages in thread
From: Manos Anagnostakis @ 2023-09-28 13:33 UTC (permalink / raw)
  To: richard.sandiford
  Cc: Kyrylo Tkachov, Philipp Tomsich, gcc-patches, Tamar Christina

[-- Attachment #1: Type: text/plain, Size: 45168 bytes --]

Hey Richard,

Thanks for taking the time to review this, but it has been commited since
yesterday after getting reviewed by Kyrill and Tamar.

Discussions:
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631285.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631300.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631389.html

Commited version:
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631484.html

Manos.

On Thu, Sep 28, 2023 at 4:17 PM Richard Sandiford <richard.sandiford@arm.com>
wrote:

> Thanks for the patch and sorry for the slow review.
>
> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> > to provide the requested behaviour for handling ldp and stp:
> >
> >   /* Allow the tuning structure to disable LDP instruction formation
> >      from combining instructions (e.g., in peephole2).
> >      TODO: Implement fine-grained tuning control for LDP and STP:
> >            1. control policies for load and store separately;
> >            2. support the following policies:
> >               - default (use what is in the tuning structure)
> >               - always
> >               - never
> >               - aligned (only if the compiler can prove that the
> >                 load will be aligned to 2 * element_size)  */
> >
> > It provides two new and concrete command-line options -mldp-policy and
> -mstp-policy
> > to give the ability to control load and store policies seperately as
> > stated in part 1 of the TODO.
> >
> > The accepted values for both options are:
> > - default: Use the ldp/stp policy defined in the corresponding tuning
> >   structure.
> > - always: Emit ldp/stp regardless of alignment.
> > - never: Do not emit ldp/stp.
> > - aligned: In order to emit ldp/stp, first check if the load/store will
> >   be aligned to 2 * element_size.
> >
> > gcc/ChangeLog:
> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> >       appropriate enums for the policies.
> >         * config/aarch64/aarch64-tuning-flags.def
> >       (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> >       options.
> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> >       function to parse ldp-policy option.
> >         (aarch64_parse_stp_policy): New function to parse stp-policy
> option.
> >         (aarch64_override_options_internal): Call parsing functions.
> >         (aarch64_operands_ok_for_ldpstp): Add option-value check and
> >       alignment check and remove superseded ones
> >         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check
> and
> >       alignment check and remove superseded ones.
> >         * config/aarch64/aarch64.opt: Add options.
> >
> > gcc/testsuite/ChangeLog:
> >         * gcc.target/aarch64/ldp_aligned.c: New test.
> >         * gcc.target/aarch64/ldp_always.c: New test.
> >         * gcc.target/aarch64/ldp_never.c: New test.
> >         * gcc.target/aarch64/stp_aligned.c: New test.
> >         * gcc.target/aarch64/stp_always.c: New test.
> >         * gcc.target/aarch64/stp_never.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > ---
> > Changes in v2:
> >         - Fixed commited ldp tests to correctly trigger
> >           and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
> >         - Added "-mcpu=generic" to commited tests to guarantee generic
> target code
> >           generation and not cause the regressions of v1.
> >
> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> >  gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
> >  gcc/config/aarch64/aarch64.opt                |   8 +
> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> >  10 files changed, 586 insertions(+), 61 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 70303d6fd95..be1d73490ed 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -568,6 +568,30 @@ struct tune_params
> >    /* Place prefetch struct pointer at the end to enable type checking
> >       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
> >    const struct cpu_prefetch_tune *prefetch;
> > +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> > +
> > +  enum aarch64_ldp_policy_model
> > +  {
> > +    LDP_POLICY_ALIGNED,
> > +    LDP_POLICY_ALWAYS,
> > +    LDP_POLICY_NEVER
> > +  } ldp_policy_model;
> > +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> > +   to at least double the alignment of the type.
> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> > +   - STP_POLICY_NEVER: Do not emit stp.  */
> > +
> > +  enum aarch64_stp_policy_model
> > +  {
> > +    STP_POLICY_ALIGNED,
> > +    STP_POLICY_ALWAYS,
> > +    STP_POLICY_NEVER
> > +  } stp_policy_model;
> >  };
>
> Generally the patch looks really good.  But I think we can use a single
> enum type for both LDP and STP, with the values having the prefix
> AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
> and that:
>
> >  /* Classifies an address.
> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> > index 52112ba7c48..774568e9106 100644
> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> > @@ -30,11 +30,6 @@
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> >
> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> > -two load/stores are not at least 8 byte aligned don't create load/store
> > -pairs.   */
> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
> > -
> >  /* Some of the optional shift to some arthematic instructions are
> >     considered cheap.  Logical shift left <=4 with or without a
> >     zero extend are considered cheap.  Sign extend; non logical shift
> left
> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
> >  /* Disallow load/store pair instructions on Q-registers.  */
> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> >
> > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> > -
> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> >
> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> > index eba5d4a7e04..43d88c68647 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
> >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
> >       have at most a very minor effect on SVE2 cores.  */
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa35_tunings =
> > @@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa53_tunings =
> > @@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa57_tunings =
> > @@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),      /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa72_tunings =
> > @@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params cortexa73_tunings =
> > @@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >
> > @@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
> >    48,        /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &exynosm1_prefetch_tune
> > +  &exynosm1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderxt88_tunings =
> > @@ -1593,8 +1607,10 @@ static const struct tune_params
> thunderxt88_tunings =
> >    2, /* min_div_recip_mul_df.  */
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),  /* tune_flags.  */
> > -  &thunderxt88_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > +  &thunderxt88_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx_tunings =
> > @@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
> >    2, /* min_div_recip_mul_df.  */
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> > -  &thunderx_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > +  &thunderx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params tsv110_tunings =
> > @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
> >    0,    /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> > -  &tsv110_prefetch_tune
> > +  &tsv110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params xgene1_tunings =
> > @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
> >    17,        /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params emag_tunings =
> > @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
> >    17,        /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> > -  &xgene1_prefetch_tune
> > +  &xgene1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params qdf24xx_tunings =
> > @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> > -  &qdf24xx_prefetch_tune
> > +  &qdf24xx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> values
> > @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE),         /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx2t99_tunings =
> > @@ -1832,7 +1859,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &thunderx2t99_prefetch_tune
> > +  &thunderx2t99_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params thunderx3t110_tunings =
> > @@ -1866,7 +1895,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &thunderx3t110_prefetch_tune
> > +  &thunderx3t110_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params neoversen1_tunings =
> > @@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings
> =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1_tunings =
> > @@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
> >    2, /* min_div_recip_mul_df.  */
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params ampere1a_tunings =
> > @@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
> >    2, /* min_div_recip_mul_df.  */
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> > -  &ampere1_prefetch_tune
> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > +  &ampere1_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> > @@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings
> =
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> > @@ -2292,7 +2331,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> > @@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings
> =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> > @@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings
> =
> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> > -  &generic_prefetch_tune
> > +  &generic_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  static const struct tune_params a64fx_tunings =
> > @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
> >    0, /* max_case_values.  */
> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> > -  &a64fx_prefetch_tune
> > +  &a64fx_prefetch_tune,
> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >  };
> >
> >  /* Support for fine-grained override of the tuning structures.  */
> > @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const
> struct processor **res)
> >    return AARCH_PARSE_INVALID_ARG;
> >  }
> >
> > +/* Validate a command-line -mldp-policy option.  Parse the policy
> > +   specified in STR and throw errors if appropriate.  */
> > +
> > +static bool
> > +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
> > +{
> > +  /* Check the value of the option to be one of the accepted.  */
> > +  if (strcmp (str, "always") == 0)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> > +  else if (strcmp (str, "never") == 0)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> > +  else if (strcmp (str, "aligned") == 0)
> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> > +  else if (strcmp (str, "default") != 0)
> > +    {
> > +      error ("unknown value %qs for %<-mldp-policy%>", str);
> > +      return false;
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Validate a command-line -mstp-policy option.  Parse the policy
> > +   specified in STR and throw errors if appropriate.  */
> > +
> > +static bool
> > +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
> > +{
> > +  /* Check the value of the option to be one of the accepted.  */
> > +  if (strcmp (str, "always") == 0)
> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> > +  else if (strcmp (str, "never") == 0)
> > +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> > +  else if (strcmp (str, "aligned") == 0)
> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> > +  else if (strcmp (str, "default") != 0)
> > +    {
> > +      error ("unknown value %qs for %<-mstp-policy%>", str);
> > +      return false;
> > +    }
> > +
> > +  return true;
> > +}
> > +
> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> >     described in FLAG.  If it is, return the index bit for that fusion
> type.
> >     If not, error (printing OPTION_NAME) and return zero.  */
> > @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
> >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
> >                                  &aarch64_tune_params);
> >
> > +  if (opts->x_aarch64_ldp_policy_string)
> > +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
> > +                           &aarch64_tune_params);
> > +
> > +  if (opts->x_aarch64_stp_policy_string)
> > +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
> > +                           &aarch64_tune_params);
> > +
> >    /* This target defaults to strict volatile bitfields.  */
> >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> abi_version_at_least (2))
> >      opts->x_flag_strict_volatile_bitfields = 1;
> > @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
> >    enum reg_class rclass_1, rclass_2;
> >    rtx mem_1, mem_2, reg_1, reg_2;
> >
> > -  /* Allow the tuning structure to disable LDP instruction formation
> > -     from combining instructions (e.g., in peephole2).
> > -     TODO: Implement fine-grained tuning control for LDP and STP:
> > -        1. control policies for load and store separately;
> > -        2. support the following policies:
> > -           - default (use what is in the tuning structure)
> > -           - always
> > -           - never
> > -           - aligned (only if the compiler can prove that the
> > -             load will be aligned to 2 * element_size)  */
> > -  if (load && (aarch64_tune_params.extra_tuning_flags
> > -            & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> >      return false;
>
> ...here we could do something like:
>
>   auto policy = (load
>                  ? aarch64_tune_params.ldp_policy_model
>                  : aarch64_tune_params.stp_policy_model);
>
> Also:
>
> >
> >    if (load)
> > @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
> bool load,
> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> >        && !optimize_size
> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> > +      && !optimize_size
> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    /* Check if the addresses are in the form of [base+offset].  */
> > @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >    HOST_WIDE_INT offvals[num_insns], msize;
> >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> >
> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> > +    return false;
> > +
> >    if (load)
> >      {
> >        for (int i = 0; i < num_insns; i++)
> > @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >    if (offvals[0] % msize != offvals[2] % msize)
> >      return false;
> >
> > -  /* If we have SImode and slow unaligned ldp,
> > -     check the alignment to be at least 8 byte. */
> > -  if (mode == SImode
> > -      && (aarch64_tune_params.extra_tuning_flags
> > -       & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> > +  /* If we have LDP_POLICY_ALIGNED,
> > +     do not emit the load pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (load
> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> > +      && !optimize_size
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> > +    return false;
> > +
> > +  /* If we have STP_POLICY_ALIGNED,
> > +     do not emit the store pair unless the alignment is checked to be
> > +     at least double the alignment of the type.  */
> > +  if (!load
> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> >        && !optimize_size
> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >      return false;
> >
> >    return true;
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index 4a0580435a8..e5302947ce7 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -205,6 +205,14 @@ msign-return-address=
> >  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t)
> Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
> >  Select return address signing scope.
> >
> > +mldp-policy=
> > +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
> > +Fine-grained policy for load pairs.
> > +
> > +mstp-policy=
> > +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
> > +Fine-grained policy for store pairs.
> > +
> >  Enum
> >  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
> >  Supported AArch64 return address signing scope (for use with
> -msign-return-address= option):
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > new file mode 100644
> > index 00000000000..6e29b265168
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>
> It might be better to split this into two tests, one for the aligned
> accesses and one for the unaligned accesses.  Same for the store version.
> (Splitting isn't necessary or useful for =always and =never though.)
>
> Thanks,
> Richard
>
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > new file mode 100644
> > index 00000000000..d2c4cf343e9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > new file mode 100644
> > index 00000000000..f8a45ee18be
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define LDP_TEST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[0]; \
> > +    a_1 = arr[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[0]; \
> > +    a_1 = a[1]; \
> > +    return a_0 + a_1; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    a_0 = arr[100]; \
> > +    a_1 = arr[101]; \
> > +    a_2 = arr[102]; \
> > +    a_3 = arr[103]; \
> > +    a_4 = arr[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a_0 = a[100]; \
> > +    a_1 = a[101]; \
> > +    a_2 = a[102]; \
> > +    a_3 = a[103]; \
> > +    a_4 = a[110]; \
> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> > +}
> > +
> > +LDP_TEST_ALIGNED(int32_t);
> > +LDP_TEST_ALIGNED(int64_t);
> > +LDP_TEST_ALIGNED(v4si);
> > +LDP_TEST_UNALIGNED(int32_t);
> > +LDP_TEST_UNALIGNED(int64_t);
> > +LDP_TEST_UNALIGNED(v4si);
> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > new file mode 100644
> > index 00000000000..ae47b42efc4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > new file mode 100644
> > index 00000000000..c1c51f9ae88
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > new file mode 100644
> > index 00000000000..c28fcafa0ed
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> > @@ -0,0 +1,60 @@
> > +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
> > +
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +#define STP_TEST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[0] = x; \
> > +    arr[1] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[0] = x; \
> > +    a[1] = x; \
> > +    return a; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    arr[100] = x; \
> > +    arr[101] = x; \
> > +    arr[102] = x; \
> > +    arr[103] = x; \
> > +    return arr; \
> > +}
> > +
> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> > +    TYPE *a = arr+1; \
> > +    a[100] = x; \
> > +    a[101] = x; \
> > +    a[102] = x; \
> > +    a[103] = x; \
> > +    return a; \
> > +}
> > +
> > +STP_TEST_ALIGNED(int32_t);
> > +STP_TEST_ALIGNED(int64_t);
> > +STP_TEST_ALIGNED(v4si);
> > +STP_TEST_UNALIGNED(int32_t);
> > +STP_TEST_UNALIGNED(int64_t);
> > +STP_TEST_UNALIGNED(v4si);
> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> > +
> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > +
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-09-28 13:33   ` Manos Anagnostakis
@ 2023-09-28 13:37     ` Philipp Tomsich
  2023-09-28 13:40       ` Manos Anagnostakis
  2023-09-28 13:42     ` Richard Sandiford
  1 sibling, 1 reply; 8+ messages in thread
From: Philipp Tomsich @ 2023-09-28 13:37 UTC (permalink / raw)
  To: Manos Anagnostakis
  Cc: richard.sandiford, Kyrylo Tkachov, gcc-patches, Tamar Christina

Manos,

Please submit a follow-on patch implementing the requested
improvements of the code structure (as this reduces the maintenance
burden).

Thanks,
Philipp.


On Thu, 28 Sept 2023 at 15:33, Manos Anagnostakis
<manos.anagnostakis@vrull.eu> wrote:
>
> Hey Richard,
>
> Thanks for taking the time to review this, but it has been commited since yesterday after getting reviewed by Kyrill and Tamar.
>
> Discussions:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631285.html
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631300.html
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631389.html
>
> Commited version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631484.html
>
> Manos.
>
> On Thu, Sep 28, 2023 at 4:17 PM Richard Sandiford <richard.sandiford@arm.com> wrote:
>>
>> Thanks for the patch and sorry for the slow review.
>>
>> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
>> > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
>> > to provide the requested behaviour for handling ldp and stp:
>> >
>> >   /* Allow the tuning structure to disable LDP instruction formation
>> >      from combining instructions (e.g., in peephole2).
>> >      TODO: Implement fine-grained tuning control for LDP and STP:
>> >            1. control policies for load and store separately;
>> >            2. support the following policies:
>> >               - default (use what is in the tuning structure)
>> >               - always
>> >               - never
>> >               - aligned (only if the compiler can prove that the
>> >                 load will be aligned to 2 * element_size)  */
>> >
>> > It provides two new and concrete command-line options -mldp-policy and -mstp-policy
>> > to give the ability to control load and store policies seperately as
>> > stated in part 1 of the TODO.
>> >
>> > The accepted values for both options are:
>> > - default: Use the ldp/stp policy defined in the corresponding tuning
>> >   structure.
>> > - always: Emit ldp/stp regardless of alignment.
>> > - never: Do not emit ldp/stp.
>> > - aligned: In order to emit ldp/stp, first check if the load/store will
>> >   be aligned to 2 * element_size.
>> >
>> > gcc/ChangeLog:
>> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
>> >       appropriate enums for the policies.
>> >         * config/aarch64/aarch64-tuning-flags.def
>> >       (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>> >       options.
>> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>> >       function to parse ldp-policy option.
>> >         (aarch64_parse_stp_policy): New function to parse stp-policy option.
>> >         (aarch64_override_options_internal): Call parsing functions.
>> >         (aarch64_operands_ok_for_ldpstp): Add option-value check and
>> >       alignment check and remove superseded ones
>> >         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check and
>> >       alignment check and remove superseded ones.
>> >         * config/aarch64/aarch64.opt: Add options.
>> >
>> > gcc/testsuite/ChangeLog:
>> >         * gcc.target/aarch64/ldp_aligned.c: New test.
>> >         * gcc.target/aarch64/ldp_always.c: New test.
>> >         * gcc.target/aarch64/ldp_never.c: New test.
>> >         * gcc.target/aarch64/stp_aligned.c: New test.
>> >         * gcc.target/aarch64/stp_always.c: New test.
>> >         * gcc.target/aarch64/stp_never.c: New test.
>> >
>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
>> > ---
>> > Changes in v2:
>> >         - Fixed commited ldp tests to correctly trigger
>> >           and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
>> >         - Added "-mcpu=generic" to commited tests to guarantee generic target code
>> >           generation and not cause the regressions of v1.
>> >
>> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
>> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>> >  gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
>> >  gcc/config/aarch64/aarch64.opt                |   8 +
>> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
>> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
>> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
>> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
>> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
>> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
>> >  10 files changed, 586 insertions(+), 61 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
>> > index 70303d6fd95..be1d73490ed 100644
>> > --- a/gcc/config/aarch64/aarch64-protos.h
>> > +++ b/gcc/config/aarch64/aarch64-protos.h
>> > @@ -568,6 +568,30 @@ struct tune_params
>> >    /* Place prefetch struct pointer at the end to enable type checking
>> >       errors when tune_params misses elements (e.g., from erroneous merges).  */
>> >    const struct cpu_prefetch_tune *prefetch;
>> > +/* An enum specifying how to handle load pairs using a fine-grained policy:
>> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
>> > +   to at least double the alignment of the type.
>> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
>> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
>> > +
>> > +  enum aarch64_ldp_policy_model
>> > +  {
>> > +    LDP_POLICY_ALIGNED,
>> > +    LDP_POLICY_ALWAYS,
>> > +    LDP_POLICY_NEVER
>> > +  } ldp_policy_model;
>> > +/* An enum specifying how to handle store pairs using a fine-grained policy:
>> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
>> > +   to at least double the alignment of the type.
>> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
>> > +   - STP_POLICY_NEVER: Do not emit stp.  */
>> > +
>> > +  enum aarch64_stp_policy_model
>> > +  {
>> > +    STP_POLICY_ALIGNED,
>> > +    STP_POLICY_ALWAYS,
>> > +    STP_POLICY_NEVER
>> > +  } stp_policy_model;
>> >  };
>>
>> Generally the patch looks really good.  But I think we can use a single
>> enum type for both LDP and STP, with the values having the prefix
>> AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
>> and that:
>>
>> >  /* Classifies an address.
>> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
>> > index 52112ba7c48..774568e9106 100644
>> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>> > @@ -30,11 +30,6 @@
>> >
>> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>> >
>> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
>> > -two load/stores are not at least 8 byte aligned don't create load/store
>> > -pairs.   */
>> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
>> > -
>> >  /* Some of the optional shift to some arthematic instructions are
>> >     considered cheap.  Logical shift left <=4 with or without a
>> >     zero extend are considered cheap.  Sign extend; non logical shift left
>> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>> >  /* Disallow load/store pair instructions on Q-registers.  */
>> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
>> >
>> > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
>> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
>> > -
>> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
>> >
>> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index eba5d4a7e04..43d88c68647 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
>> >       Neoverse V1.  It does not have a noticeable effect on A64FX and should
>> >       have at most a very minor effect on SVE2 cores.  */
>> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa35_tunings =
>> > @@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa53_tunings =
>> > @@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa57_tunings =
>> > @@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),      /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa72_tunings =
>> > @@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa73_tunings =
>> > @@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >
>> > @@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
>> >    48,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &exynosm1_prefetch_tune
>> > +  &exynosm1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderxt88_tunings =
>> > @@ -1593,8 +1607,10 @@ static const struct tune_params thunderxt88_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),  /* tune_flags.  */
>> > -  &thunderxt88_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &thunderxt88_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx_tunings =
>> > @@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
>> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
>> > -  &thunderx_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
>> > +  &thunderx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params tsv110_tunings =
>> > @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
>> >    0,    /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
>> > -  &tsv110_prefetch_tune
>> > +  &tsv110_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params xgene1_tunings =
>> > @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
>> >    17,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
>> > -  &xgene1_prefetch_tune
>> > +  &xgene1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params emag_tunings =
>> > @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
>> >    17,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
>> > -  &xgene1_prefetch_tune
>> > +  &xgene1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params qdf24xx_tunings =
>> > @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
>> > -  &qdf24xx_prefetch_tune
>> > +  &qdf24xx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor values
>> > @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE),         /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx2t99_tunings =
>> > @@ -1832,7 +1859,9 @@ static const struct tune_params thunderx2t99_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &thunderx2t99_prefetch_tune
>> > +  &thunderx2t99_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx3t110_tunings =
>> > @@ -1866,7 +1895,9 @@ static const struct tune_params thunderx3t110_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &thunderx3t110_prefetch_tune
>> > +  &thunderx3t110_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params neoversen1_tunings =
>> > @@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params ampere1_tunings =
>> > @@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
>> > -  &ampere1_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &ampere1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params ampere1a_tunings =
>> > @@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
>> > -  &ampere1_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &ampere1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
>> > @@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings =
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
>> > @@ -2292,7 +2331,9 @@ static const struct tune_params neoverse512tvb_tunings =
>> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
>> > @@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings =
>> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
>> > @@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings =
>> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params a64fx_tunings =
>> > @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &a64fx_prefetch_tune
>> > +  &a64fx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  /* Support for fine-grained override of the tuning structures.  */
>> > @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const struct processor **res)
>> >    return AARCH_PARSE_INVALID_ARG;
>> >  }
>> >
>> > +/* Validate a command-line -mldp-policy option.  Parse the policy
>> > +   specified in STR and throw errors if appropriate.  */
>> > +
>> > +static bool
>> > +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
>> > +{
>> > +  /* Check the value of the option to be one of the accepted.  */
>> > +  if (strcmp (str, "always") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
>> > +  else if (strcmp (str, "never") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
>> > +  else if (strcmp (str, "aligned") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
>> > +  else if (strcmp (str, "default") != 0)
>> > +    {
>> > +      error ("unknown value %qs for %<-mldp-policy%>", str);
>> > +      return false;
>> > +    }
>> > +
>> > +  return true;
>> > +}
>> > +
>> > +/* Validate a command-line -mstp-policy option.  Parse the policy
>> > +   specified in STR and throw errors if appropriate.  */
>> > +
>> > +static bool
>> > +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
>> > +{
>> > +  /* Check the value of the option to be one of the accepted.  */
>> > +  if (strcmp (str, "always") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
>> > +  else if (strcmp (str, "never") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
>> > +  else if (strcmp (str, "aligned") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
>> > +  else if (strcmp (str, "default") != 0)
>> > +    {
>> > +      error ("unknown value %qs for %<-mstp-policy%>", str);
>> > +      return false;
>> > +    }
>> > +
>> > +  return true;
>> > +}
>> > +
>> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
>> >     described in FLAG.  If it is, return the index bit for that fusion type.
>> >     If not, error (printing OPTION_NAME) and return zero.  */
>> > @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct gcc_options *opts)
>> >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
>> >                                  &aarch64_tune_params);
>> >
>> > +  if (opts->x_aarch64_ldp_policy_string)
>> > +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
>> > +                           &aarch64_tune_params);
>> > +
>> > +  if (opts->x_aarch64_stp_policy_string)
>> > +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
>> > +                           &aarch64_tune_params);
>> > +
>> >    /* This target defaults to strict volatile bitfields.  */
>> >    if (opts->x_flag_strict_volatile_bitfields < 0 && abi_version_at_least (2))
>> >      opts->x_flag_strict_volatile_bitfields = 1;
>> > @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>> >    enum reg_class rclass_1, rclass_2;
>> >    rtx mem_1, mem_2, reg_1, reg_2;
>> >
>> > -  /* Allow the tuning structure to disable LDP instruction formation
>> > -     from combining instructions (e.g., in peephole2).
>> > -     TODO: Implement fine-grained tuning control for LDP and STP:
>> > -        1. control policies for load and store separately;
>> > -        2. support the following policies:
>> > -           - default (use what is in the tuning structure)
>> > -           - always
>> > -           - never
>> > -           - aligned (only if the compiler can prove that the
>> > -             load will be aligned to 2 * element_size)  */
>> > -  if (load && (aarch64_tune_params.extra_tuning_flags
>> > -            & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
>> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
>> >      return false;
>>
>> ...here we could do something like:
>>
>>   auto policy = (load
>>                  ? aarch64_tune_params.ldp_policy_model
>>                  : aarch64_tune_params.stp_policy_model);
>>
>> Also:
>>
>> >
>> >    if (load)
>> > @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load,
>> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>> >      return false;
>> >
>> > -  /* If we have SImode and slow unaligned ldp,
>> > -     check the alignment to be at least 8 byte. */
>> > -  if (mode == SImode
>> > -      && (aarch64_tune_params.extra_tuning_flags
>> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
>> > +  /* If we have LDP_POLICY_ALIGNED,
>> > +     do not emit the load pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
>> >        && !optimize_size
>> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_ALIGNED,
>> > +     do not emit the store pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
>> > +      && !optimize_size
>> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>> >      return false;
>> >
>> >    /* Check if the addresses are in the form of [base+offset].  */
>> > @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>> >    HOST_WIDE_INT offvals[num_insns], msize;
>> >    rtx mem[num_insns], reg[num_insns], base[num_insns], offset[num_insns];
>> >
>> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_NEVER)
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_NEVER)
>> > +    return false;
>> > +
>> >    if (load)
>> >      {
>> >        for (int i = 0; i < num_insns; i++)
>> > @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, bool load,
>> >    if (offvals[0] % msize != offvals[2] % msize)
>> >      return false;
>> >
>> > -  /* If we have SImode and slow unaligned ldp,
>> > -     check the alignment to be at least 8 byte. */
>> > -  if (mode == SImode
>> > -      && (aarch64_tune_params.extra_tuning_flags
>> > -       & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
>> > +  /* If we have LDP_POLICY_ALIGNED,
>> > +     do not emit the load pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model == tune_params::LDP_POLICY_ALIGNED
>> > +      && !optimize_size
>> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_ALIGNED,
>> > +     do not emit the store pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model == tune_params::STP_POLICY_ALIGNED
>> >        && !optimize_size
>> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
>> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>> >      return false;
>> >
>> >    return true;
>> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>> > index 4a0580435a8..e5302947ce7 100644
>> > --- a/gcc/config/aarch64/aarch64.opt
>> > +++ b/gcc/config/aarch64/aarch64.opt
>> > @@ -205,6 +205,14 @@ msign-return-address=
>> >  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t) Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
>> >  Select return address signing scope.
>> >
>> > +mldp-policy=
>> > +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
>> > +Fine-grained policy for load pairs.
>> > +
>> > +mstp-policy=
>> > +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
>> > +Fine-grained policy for store pairs.
>> > +
>> >  Enum
>> >  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
>> >  Supported AArch64 return address signing scope (for use with -msign-return-address= option):
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> > new file mode 100644
>> > index 00000000000..6e29b265168
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>>
>> It might be better to split this into two tests, one for the aligned
>> accesses and one for the unaligned accesses.  Same for the store version.
>> (Splitting isn't necessary or useful for =always and =never though.)
>>
>> Thanks,
>> Richard
>>
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> > new file mode 100644
>> > index 00000000000..d2c4cf343e9
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> > new file mode 100644
>> > index 00000000000..f8a45ee18be
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> > new file mode 100644
>> > index 00000000000..ae47b42efc4
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c b/gcc/testsuite/gcc.target/aarch64/stp_always.c
>> > new file mode 100644
>> > index 00000000000..c1c51f9ae88
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c b/gcc/testsuite/gcc.target/aarch64/stp_never.c
>> > new file mode 100644
>> > index 00000000000..c28fcafa0ed
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) - 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
>> > +
>
>
>
> --
> Manos Anagnostakis | Compiler Engineer |
> E: manos.anagnostakis@vrull.eu
>
> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
>  W: www.vrull.eu | LinkedIn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-09-28 13:37     ` Philipp Tomsich
@ 2023-09-28 13:40       ` Manos Anagnostakis
  0 siblings, 0 replies; 8+ messages in thread
From: Manos Anagnostakis @ 2023-09-28 13:40 UTC (permalink / raw)
  To: Philipp Tomsich
  Cc: richard.sandiford, Kyrylo Tkachov, gcc-patches, Tamar Christina

[-- Attachment #1: Type: text/plain, Size: 49084 bytes --]

Sure, I will attend to this.

Manos.

On Thu, Sep 28, 2023 at 4:37 PM Philipp Tomsich <philipp.tomsich@vrull.eu>
wrote:

> Manos,
>
> Please submit a follow-on patch implementing the requested
> improvements of the code structure (as this reduces the maintenance
> burden).
>
> Thanks,
> Philipp.
>
>
> On Thu, 28 Sept 2023 at 15:33, Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > Hey Richard,
> >
> > Thanks for taking the time to review this, but it has been commited
> since yesterday after getting reviewed by Kyrill and Tamar.
> >
> > Discussions:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631285.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631300.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631389.html
> >
> > Commited version:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631484.html
> >
> > Manos.
> >
> > On Thu, Sep 28, 2023 at 4:17 PM Richard Sandiford <
> richard.sandiford@arm.com> wrote:
> >>
> >> Thanks for the patch and sorry for the slow review.
> >>
> >> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> >> > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> >> > to provide the requested behaviour for handling ldp and stp:
> >> >
> >> >   /* Allow the tuning structure to disable LDP instruction formation
> >> >      from combining instructions (e.g., in peephole2).
> >> >      TODO: Implement fine-grained tuning control for LDP and STP:
> >> >            1. control policies for load and store separately;
> >> >            2. support the following policies:
> >> >               - default (use what is in the tuning structure)
> >> >               - always
> >> >               - never
> >> >               - aligned (only if the compiler can prove that the
> >> >                 load will be aligned to 2 * element_size)  */
> >> >
> >> > It provides two new and concrete command-line options -mldp-policy
> and -mstp-policy
> >> > to give the ability to control load and store policies seperately as
> >> > stated in part 1 of the TODO.
> >> >
> >> > The accepted values for both options are:
> >> > - default: Use the ldp/stp policy defined in the corresponding tuning
> >> >   structure.
> >> > - always: Emit ldp/stp regardless of alignment.
> >> > - never: Do not emit ldp/stp.
> >> > - aligned: In order to emit ldp/stp, first check if the load/store
> will
> >> >   be aligned to 2 * element_size.
> >> >
> >> > gcc/ChangeLog:
> >> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> >> >       appropriate enums for the policies.
> >> >         * config/aarch64/aarch64-tuning-flags.def
> >> >       (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> >> >       options.
> >> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> >> >       function to parse ldp-policy option.
> >> >         (aarch64_parse_stp_policy): New function to parse stp-policy
> option.
> >> >         (aarch64_override_options_internal): Call parsing functions.
> >> >         (aarch64_operands_ok_for_ldpstp): Add option-value check and
> >> >       alignment check and remove superseded ones
> >> >         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value
> check and
> >> >       alignment check and remove superseded ones.
> >> >         * config/aarch64/aarch64.opt: Add options.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >         * gcc.target/aarch64/ldp_aligned.c: New test.
> >> >         * gcc.target/aarch64/ldp_always.c: New test.
> >> >         * gcc.target/aarch64/ldp_never.c: New test.
> >> >         * gcc.target/aarch64/stp_aligned.c: New test.
> >> >         * gcc.target/aarch64/stp_always.c: New test.
> >> >         * gcc.target/aarch64/stp_never.c: New test.
> >> >
> >> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> >> > ---
> >> > Changes in v2:
> >> >         - Fixed commited ldp tests to correctly trigger
> >> >           and test aarch64_operands_adjust_ok_for_ldpstp in
> aarch64.cc.
> >> >         - Added "-mcpu=generic" to commited tests to guarantee
> generic target code
> >> >           generation and not cause the regressions of v1.
> >> >
> >> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> >> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> >> >  gcc/config/aarch64/aarch64.cc                 | 229
> ++++++++++++++----
> >> >  gcc/config/aarch64/aarch64.opt                |   8 +
> >> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
> >> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> >> >  10 files changed, 586 insertions(+), 61 deletions(-)
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> >> > index 70303d6fd95..be1d73490ed 100644
> >> > --- a/gcc/config/aarch64/aarch64-protos.h
> >> > +++ b/gcc/config/aarch64/aarch64-protos.h
> >> > @@ -568,6 +568,30 @@ struct tune_params
> >> >    /* Place prefetch struct pointer at the end to enable type checking
> >> >       errors when tune_params misses elements (e.g., from erroneous
> merges).  */
> >> >    const struct cpu_prefetch_tune *prefetch;
> >> > +/* An enum specifying how to handle load pairs using a fine-grained
> policy:
> >> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> >> > +   to at least double the alignment of the type.
> >> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> >> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> >> > +
> >> > +  enum aarch64_ldp_policy_model
> >> > +  {
> >> > +    LDP_POLICY_ALIGNED,
> >> > +    LDP_POLICY_ALWAYS,
> >> > +    LDP_POLICY_NEVER
> >> > +  } ldp_policy_model;
> >> > +/* An enum specifying how to handle store pairs using a fine-grained
> policy:
> >> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> >> > +   to at least double the alignment of the type.
> >> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> >> > +   - STP_POLICY_NEVER: Do not emit stp.  */
> >> > +
> >> > +  enum aarch64_stp_policy_model
> >> > +  {
> >> > +    STP_POLICY_ALIGNED,
> >> > +    STP_POLICY_ALWAYS,
> >> > +    STP_POLICY_NEVER
> >> > +  } stp_policy_model;
> >> >  };
> >>
> >> Generally the patch looks really good.  But I think we can use a single
> >> enum type for both LDP and STP, with the values having the prefix
> >> AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
> >> and that:
> >>
> >> >  /* Classifies an address.
> >> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > index 52112ba7c48..774568e9106 100644
> >> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > @@ -30,11 +30,6 @@
> >> >
> >> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> >> >
> >> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> >> > -two load/stores are not at least 8 byte aligned don't create
> load/store
> >> > -pairs.   */
> >> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> SLOW_UNALIGNED_LDPW)
> >> > -
> >> >  /* Some of the optional shift to some arthematic instructions are
> >> >     considered cheap.  Logical shift left <=4 with or without a
> >> >     zero extend are considered cheap.  Sign extend; non logical shift
> left
> >> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
> >> >  /* Disallow load/store pair instructions on Q-registers.  */
> >> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> >> >
> >> > -/* Disallow load-pair instructions to be formed in
> combine/peephole.  */
> >> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> >> > -
> >> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> >> >
> >> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> CSE_SVE_VL_CONSTANTS)
> >> > diff --git a/gcc/config/aarch64/aarch64.cc
> b/gcc/config/aarch64/aarch64.cc
> >> > index eba5d4a7e04..43d88c68647 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings
> =
> >> >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> should
> >> >       have at most a very minor effect on SVE2 cores.  */
> >> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa35_tunings =
> >> > @@ -1390,7 +1392,9 @@ static const struct tune_params
> cortexa35_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa53_tunings =
> >> > @@ -1424,7 +1428,9 @@ static const struct tune_params
> cortexa53_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa57_tunings =
> >> > @@ -1458,7 +1464,9 @@ static const struct tune_params
> cortexa57_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),      /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa72_tunings =
> >> > @@ -1492,7 +1500,9 @@ static const struct tune_params
> cortexa72_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa73_tunings =
> >> > @@ -1526,7 +1536,9 @@ static const struct tune_params
> cortexa73_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >
> >> > @@ -1561,7 +1573,9 @@ static const struct tune_params
> exynosm1_tunings =
> >> >    48,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &exynosm1_prefetch_tune
> >> > +  &exynosm1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderxt88_tunings =
> >> > @@ -1593,8 +1607,10 @@ static const struct tune_params
> thunderxt88_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),  /* tune_flags.  */
> >> > -  &thunderxt88_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &thunderxt88_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx_tunings =
> >> > @@ -1626,9 +1642,10 @@ static const struct tune_params
> thunderx_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> >> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> >> > -  &thunderx_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> >> > +  &thunderx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params tsv110_tunings =
> >> > @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
> >> >    0,    /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> >> > -  &tsv110_prefetch_tune
> >> > +  &tsv110_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params xgene1_tunings =
> >> > @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
> >> >    17,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> >> > -  &xgene1_prefetch_tune
> >> > +  &xgene1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params emag_tunings =
> >> > @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
> >> >    17,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> >> > -  &xgene1_prefetch_tune
> >> > +  &xgene1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params qdf24xx_tunings =
> >> > @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings
> =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> >> > -  &qdf24xx_prefetch_tune
> >> > +  &qdf24xx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  /* Tuning structure for the Qualcomm Saphira core.  Default to
> falkor values
> >> > @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings
> =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE),         /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx2t99_tunings =
> >> > @@ -1832,7 +1859,9 @@ static const struct tune_params
> thunderx2t99_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &thunderx2t99_prefetch_tune
> >> > +  &thunderx2t99_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx3t110_tunings =
> >> > @@ -1866,7 +1895,9 @@ static const struct tune_params
> thunderx3t110_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &thunderx3t110_prefetch_tune
> >> > +  &thunderx3t110_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params neoversen1_tunings =
> >> > @@ -1899,7 +1930,9 @@ static const struct tune_params
> neoversen1_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params ampere1_tunings =
> >> > @@ -1935,8 +1968,10 @@ static const struct tune_params
> ampere1_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> >> > -  &ampere1_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &ampere1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params ampere1a_tunings =
> >> > @@ -1973,8 +2008,10 @@ static const struct tune_params
> ampere1a_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> >> > -  &ampere1_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &ampere1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> >> > @@ -2155,7 +2192,9 @@ static const struct tune_params
> neoversev1_tunings =
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> >> > @@ -2292,7 +2331,9 @@ static const struct tune_params
> neoverse512tvb_tunings =
> >> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> >> > @@ -2482,7 +2523,9 @@ static const struct tune_params
> neoversen2_tunings =
> >> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> >> > @@ -2672,7 +2715,9 @@ static const struct tune_params
> neoversev2_tunings =
> >> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params a64fx_tunings =
> >> > @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &a64fx_prefetch_tune
> >> > +  &a64fx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  /* Support for fine-grained override of the tuning structures.  */
> >> > @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse,
> const struct processor **res)
> >> >    return AARCH_PARSE_INVALID_ARG;
> >> >  }
> >> >
> >> > +/* Validate a command-line -mldp-policy option.  Parse the policy
> >> > +   specified in STR and throw errors if appropriate.  */
> >> > +
> >> > +static bool
> >> > +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
> >> > +{
> >> > +  /* Check the value of the option to be one of the accepted.  */
> >> > +  if (strcmp (str, "always") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> >> > +  else if (strcmp (str, "never") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> >> > +  else if (strcmp (str, "aligned") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> >> > +  else if (strcmp (str, "default") != 0)
> >> > +    {
> >> > +      error ("unknown value %qs for %<-mldp-policy%>", str);
> >> > +      return false;
> >> > +    }
> >> > +
> >> > +  return true;
> >> > +}
> >> > +
> >> > +/* Validate a command-line -mstp-policy option.  Parse the policy
> >> > +   specified in STR and throw errors if appropriate.  */
> >> > +
> >> > +static bool
> >> > +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
> >> > +{
> >> > +  /* Check the value of the option to be one of the accepted.  */
> >> > +  if (strcmp (str, "always") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> >> > +  else if (strcmp (str, "never") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> >> > +  else if (strcmp (str, "aligned") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> >> > +  else if (strcmp (str, "default") != 0)
> >> > +    {
> >> > +      error ("unknown value %qs for %<-mstp-policy%>", str);
> >> > +      return false;
> >> > +    }
> >> > +
> >> > +  return true;
> >> > +}
> >> > +
> >> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> >> >     described in FLAG.  If it is, return the index bit for that
> fusion type.
> >> >     If not, error (printing OPTION_NAME) and return zero.  */
> >> > @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct
> gcc_options *opts)
> >> >      aarch64_parse_override_string
> (opts->x_aarch64_override_tune_string,
> >> >                                  &aarch64_tune_params);
> >> >
> >> > +  if (opts->x_aarch64_ldp_policy_string)
> >> > +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
> >> > +                           &aarch64_tune_params);
> >> > +
> >> > +  if (opts->x_aarch64_stp_policy_string)
> >> > +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
> >> > +                           &aarch64_tune_params);
> >> > +
> >> >    /* This target defaults to strict volatile bitfields.  */
> >> >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> abi_version_at_least (2))
> >> >      opts->x_flag_strict_volatile_bitfields = 1;
> >> > @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> >> >    enum reg_class rclass_1, rclass_2;
> >> >    rtx mem_1, mem_2, reg_1, reg_2;
> >> >
> >> > -  /* Allow the tuning structure to disable LDP instruction formation
> >> > -     from combining instructions (e.g., in peephole2).
> >> > -     TODO: Implement fine-grained tuning control for LDP and STP:
> >> > -        1. control policies for load and store separately;
> >> > -        2. support the following policies:
> >> > -           - default (use what is in the tuning structure)
> >> > -           - always
> >> > -           - never
> >> > -           - aligned (only if the compiler can prove that the
> >> > -             load will be aligned to 2 * element_size)  */
> >> > -  if (load && (aarch64_tune_params.extra_tuning_flags
> >> > -            & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> >> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> >> >      return false;
> >>
> >> ...here we could do something like:
> >>
> >>   auto policy = (load
> >>                  ? aarch64_tune_params.ldp_policy_model
> >>                  : aarch64_tune_params.stp_policy_model);
> >>
> >> Also:
> >>
> >> >
> >> >    if (load)
> >> > @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands, bool load,
> >> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> >> >      return false;
> >> >
> >> > -  /* If we have SImode and slow unaligned ldp,
> >> > -     check the alignment to be at least 8 byte. */
> >> > -  if (mode == SImode
> >> > -      && (aarch64_tune_params.extra_tuning_flags
> >> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> >> > +  /* If we have LDP_POLICY_ALIGNED,
> >> > +     do not emit the load pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> >> >        && !optimize_size
> >> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> >> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_ALIGNED,
> >> > +     do not emit the store pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> >> > +      && !optimize_size
> >> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >> >      return false;
> >> >
> >> >    /* Check if the addresses are in the form of [base+offset].  */
> >> > @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >> >    HOST_WIDE_INT offvals[num_insns], msize;
> >> >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> offset[num_insns];
> >> >
> >> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> >    if (load)
> >> >      {
> >> >        for (int i = 0; i < num_insns; i++)
> >> > @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> *operands, bool load,
> >> >    if (offvals[0] % msize != offvals[2] % msize)
> >> >      return false;
> >> >
> >> > -  /* If we have SImode and slow unaligned ldp,
> >> > -     check the alignment to be at least 8 byte. */
> >> > -  if (mode == SImode
> >> > -      && (aarch64_tune_params.extra_tuning_flags
> >> > -       & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> >> > +  /* If we have LDP_POLICY_ALIGNED,
> >> > +     do not emit the load pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> tune_params::LDP_POLICY_ALIGNED
> >> > +      && !optimize_size
> >> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_ALIGNED,
> >> > +     do not emit the store pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> tune_params::STP_POLICY_ALIGNED
> >> >        && !optimize_size
> >> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> >> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >> >      return false;
> >> >
> >> >    return true;
> >> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> >> > index 4a0580435a8..e5302947ce7 100644
> >> > --- a/gcc/config/aarch64/aarch64.opt
> >> > +++ b/gcc/config/aarch64/aarch64.opt
> >> > @@ -205,6 +205,14 @@ msign-return-address=
> >> >  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t)
> Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
> >> >  Select return address signing scope.
> >> >
> >> > +mldp-policy=
> >> > +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
> >> > +Fine-grained policy for load pairs.
> >> > +
> >> > +mstp-policy=
> >> > +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
> >> > +Fine-grained policy for store pairs.
> >> > +
> >> >  Enum
> >> >  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
> >> >  Supported AArch64 return address signing scope (for use with
> -msign-return-address= option):
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> > new file mode 100644
> >> > index 00000000000..6e29b265168
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>
> >> It might be better to split this into two tests, one for the aligned
> >> accesses and one for the unaligned accesses.  Same for the store
> version.
> >> (Splitting isn't necessary or useful for =always and =never though.)
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> > new file mode 100644
> >> > index 00000000000..d2c4cf343e9
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> > new file mode 100644
> >> > index 00000000000..f8a45ee18be
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> > new file mode 100644
> >> > index 00000000000..ae47b42efc4
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> > new file mode 100644
> >> > index 00000000000..c1c51f9ae88
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> > new file mode 100644
> >> > index 00000000000..c28fcafa0ed
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >> > +
> >
> >
> >
> > --
> > Manos Anagnostakis | Compiler Engineer |
> > E: manos.anagnostakis@vrull.eu
> >
> > VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> >  W: www.vrull.eu | LinkedIn
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-09-28 13:33   ` Manos Anagnostakis
  2023-09-28 13:37     ` Philipp Tomsich
@ 2023-09-28 13:42     ` Richard Sandiford
  2023-09-28 13:50       ` Manos Anagnostakis
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2023-09-28 13:42 UTC (permalink / raw)
  To: Manos Anagnostakis
  Cc: Kyrylo Tkachov, Philipp Tomsich, gcc-patches, Tamar Christina

Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> Hey Richard,
>
> Thanks for taking the time to review this, but it has been commited since
> yesterday after getting reviewed by Kyrill and Tamar.
>
> Discussions:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631285.html
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631300.html
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631389.html
>
> Commited version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631484.html

Sorry about that.  I had v3 being filtered differently and so it went
into a different inbox.

Richard

>
> Manos.
>
> On Thu, Sep 28, 2023 at 4:17 PM Richard Sandiford <richard.sandiford@arm.com>
> wrote:
>
>> Thanks for the patch and sorry for the slow review.
>>
>> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
>> > This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
>> > to provide the requested behaviour for handling ldp and stp:
>> >
>> >   /* Allow the tuning structure to disable LDP instruction formation
>> >      from combining instructions (e.g., in peephole2).
>> >      TODO: Implement fine-grained tuning control for LDP and STP:
>> >            1. control policies for load and store separately;
>> >            2. support the following policies:
>> >               - default (use what is in the tuning structure)
>> >               - always
>> >               - never
>> >               - aligned (only if the compiler can prove that the
>> >                 load will be aligned to 2 * element_size)  */
>> >
>> > It provides two new and concrete command-line options -mldp-policy and
>> -mstp-policy
>> > to give the ability to control load and store policies seperately as
>> > stated in part 1 of the TODO.
>> >
>> > The accepted values for both options are:
>> > - default: Use the ldp/stp policy defined in the corresponding tuning
>> >   structure.
>> > - always: Emit ldp/stp regardless of alignment.
>> > - never: Do not emit ldp/stp.
>> > - aligned: In order to emit ldp/stp, first check if the load/store will
>> >   be aligned to 2 * element_size.
>> >
>> > gcc/ChangeLog:
>> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
>> >       appropriate enums for the policies.
>> >         * config/aarch64/aarch64-tuning-flags.def
>> >       (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>> >       options.
>> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>> >       function to parse ldp-policy option.
>> >         (aarch64_parse_stp_policy): New function to parse stp-policy
>> option.
>> >         (aarch64_override_options_internal): Call parsing functions.
>> >         (aarch64_operands_ok_for_ldpstp): Add option-value check and
>> >       alignment check and remove superseded ones
>> >         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check
>> and
>> >       alignment check and remove superseded ones.
>> >         * config/aarch64/aarch64.opt: Add options.
>> >
>> > gcc/testsuite/ChangeLog:
>> >         * gcc.target/aarch64/ldp_aligned.c: New test.
>> >         * gcc.target/aarch64/ldp_always.c: New test.
>> >         * gcc.target/aarch64/ldp_never.c: New test.
>> >         * gcc.target/aarch64/stp_aligned.c: New test.
>> >         * gcc.target/aarch64/stp_always.c: New test.
>> >         * gcc.target/aarch64/stp_never.c: New test.
>> >
>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
>> > ---
>> > Changes in v2:
>> >         - Fixed commited ldp tests to correctly trigger
>> >           and test aarch64_operands_adjust_ok_for_ldpstp in aarch64.cc.
>> >         - Added "-mcpu=generic" to commited tests to guarantee generic
>> target code
>> >           generation and not cause the regressions of v1.
>> >
>> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
>> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>> >  gcc/config/aarch64/aarch64.cc                 | 229 ++++++++++++++----
>> >  gcc/config/aarch64/aarch64.opt                |   8 +
>> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
>> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
>> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
>> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
>> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
>> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
>> >  10 files changed, 586 insertions(+), 61 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-protos.h
>> b/gcc/config/aarch64/aarch64-protos.h
>> > index 70303d6fd95..be1d73490ed 100644
>> > --- a/gcc/config/aarch64/aarch64-protos.h
>> > +++ b/gcc/config/aarch64/aarch64-protos.h
>> > @@ -568,6 +568,30 @@ struct tune_params
>> >    /* Place prefetch struct pointer at the end to enable type checking
>> >       errors when tune_params misses elements (e.g., from erroneous
>> merges).  */
>> >    const struct cpu_prefetch_tune *prefetch;
>> > +/* An enum specifying how to handle load pairs using a fine-grained
>> policy:
>> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
>> > +   to at least double the alignment of the type.
>> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
>> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
>> > +
>> > +  enum aarch64_ldp_policy_model
>> > +  {
>> > +    LDP_POLICY_ALIGNED,
>> > +    LDP_POLICY_ALWAYS,
>> > +    LDP_POLICY_NEVER
>> > +  } ldp_policy_model;
>> > +/* An enum specifying how to handle store pairs using a fine-grained
>> policy:
>> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
>> > +   to at least double the alignment of the type.
>> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
>> > +   - STP_POLICY_NEVER: Do not emit stp.  */
>> > +
>> > +  enum aarch64_stp_policy_model
>> > +  {
>> > +    STP_POLICY_ALIGNED,
>> > +    STP_POLICY_ALWAYS,
>> > +    STP_POLICY_NEVER
>> > +  } stp_policy_model;
>> >  };
>>
>> Generally the patch looks really good.  But I think we can use a single
>> enum type for both LDP and STP, with the values having the prefix
>> AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
>> and that:
>>
>> >  /* Classifies an address.
>> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>> > index 52112ba7c48..774568e9106 100644
>> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>> > @@ -30,11 +30,6 @@
>> >
>> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
>> >
>> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
>> > -two load/stores are not at least 8 byte aligned don't create load/store
>> > -pairs.   */
>> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW)
>> > -
>> >  /* Some of the optional shift to some arthematic instructions are
>> >     considered cheap.  Logical shift left <=4 with or without a
>> >     zero extend are considered cheap.  Sign extend; non logical shift
>> left
>> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
>> CHEAP_SHIFT_EXTEND)
>> >  /* Disallow load/store pair instructions on Q-registers.  */
>> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
>> >
>> > -/* Disallow load-pair instructions to be formed in combine/peephole.  */
>> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
>> > -
>> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
>> >
>> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
>> CSE_SVE_VL_CONSTANTS)
>> > diff --git a/gcc/config/aarch64/aarch64.cc
>> b/gcc/config/aarch64/aarch64.cc
>> > index eba5d4a7e04..43d88c68647 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings =
>> >       Neoverse V1.  It does not have a noticeable effect on A64FX and
>> should
>> >       have at most a very minor effect on SVE2 cores.  */
>> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa35_tunings =
>> > @@ -1390,7 +1392,9 @@ static const struct tune_params cortexa35_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa53_tunings =
>> > @@ -1424,7 +1428,9 @@ static const struct tune_params cortexa53_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa57_tunings =
>> > @@ -1458,7 +1464,9 @@ static const struct tune_params cortexa57_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),      /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa72_tunings =
>> > @@ -1492,7 +1500,9 @@ static const struct tune_params cortexa72_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params cortexa73_tunings =
>> > @@ -1526,7 +1536,9 @@ static const struct tune_params cortexa73_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >
>> > @@ -1561,7 +1573,9 @@ static const struct tune_params exynosm1_tunings =
>> >    48,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &exynosm1_prefetch_tune
>> > +  &exynosm1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderxt88_tunings =
>> > @@ -1593,8 +1607,10 @@ static const struct tune_params
>> thunderxt88_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),  /* tune_flags.  */
>> > -  &thunderxt88_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &thunderxt88_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx_tunings =
>> > @@ -1626,9 +1642,10 @@ static const struct tune_params thunderx_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
>> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
>> > -  &thunderx_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
>> > +  &thunderx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params tsv110_tunings =
>> > @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
>> >    0,    /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
>> > -  &tsv110_prefetch_tune
>> > +  &tsv110_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params xgene1_tunings =
>> > @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
>> >    17,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
>> > -  &xgene1_prefetch_tune
>> > +  &xgene1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params emag_tunings =
>> > @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
>> >    17,        /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
>> > -  &xgene1_prefetch_tune
>> > +  &xgene1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params qdf24xx_tunings =
>> > @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
>> > -  &qdf24xx_prefetch_tune
>> > +  &qdf24xx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
>> values
>> > @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE),         /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx2t99_tunings =
>> > @@ -1832,7 +1859,9 @@ static const struct tune_params
>> thunderx2t99_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &thunderx2t99_prefetch_tune
>> > +  &thunderx2t99_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params thunderx3t110_tunings =
>> > @@ -1866,7 +1895,9 @@ static const struct tune_params
>> thunderx3t110_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &thunderx3t110_prefetch_tune
>> > +  &thunderx3t110_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params neoversen1_tunings =
>> > @@ -1899,7 +1930,9 @@ static const struct tune_params neoversen1_tunings
>> =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params ampere1_tunings =
>> > @@ -1935,8 +1968,10 @@ static const struct tune_params ampere1_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
>> > -  &ampere1_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &ampere1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params ampere1a_tunings =
>> > @@ -1973,8 +2008,10 @@ static const struct tune_params ampere1a_tunings =
>> >    2, /* min_div_recip_mul_df.  */
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
>> > -  &ampere1_prefetch_tune
>> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > +  &ampere1_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
>> > @@ -2155,7 +2192,9 @@ static const struct tune_params neoversev1_tunings
>> =
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
>> > @@ -2292,7 +2331,9 @@ static const struct tune_params
>> neoverse512tvb_tunings =
>> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
>> > @@ -2482,7 +2523,9 @@ static const struct tune_params neoversen2_tunings
>> =
>> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
>> > @@ -2672,7 +2715,9 @@ static const struct tune_params neoversev2_tunings
>> =
>> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> > -  &generic_prefetch_tune
>> > +  &generic_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  static const struct tune_params a64fx_tunings =
>> > @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
>> >    0, /* max_case_values.  */
>> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
>> > -  &a64fx_prefetch_tune
>> > +  &a64fx_prefetch_tune,
>> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
>> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
>> >  };
>> >
>> >  /* Support for fine-grained override of the tuning structures.  */
>> > @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse, const
>> struct processor **res)
>> >    return AARCH_PARSE_INVALID_ARG;
>> >  }
>> >
>> > +/* Validate a command-line -mldp-policy option.  Parse the policy
>> > +   specified in STR and throw errors if appropriate.  */
>> > +
>> > +static bool
>> > +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
>> > +{
>> > +  /* Check the value of the option to be one of the accepted.  */
>> > +  if (strcmp (str, "always") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
>> > +  else if (strcmp (str, "never") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
>> > +  else if (strcmp (str, "aligned") == 0)
>> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
>> > +  else if (strcmp (str, "default") != 0)
>> > +    {
>> > +      error ("unknown value %qs for %<-mldp-policy%>", str);
>> > +      return false;
>> > +    }
>> > +
>> > +  return true;
>> > +}
>> > +
>> > +/* Validate a command-line -mstp-policy option.  Parse the policy
>> > +   specified in STR and throw errors if appropriate.  */
>> > +
>> > +static bool
>> > +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
>> > +{
>> > +  /* Check the value of the option to be one of the accepted.  */
>> > +  if (strcmp (str, "always") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
>> > +  else if (strcmp (str, "never") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
>> > +  else if (strcmp (str, "aligned") == 0)
>> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
>> > +  else if (strcmp (str, "default") != 0)
>> > +    {
>> > +      error ("unknown value %qs for %<-mstp-policy%>", str);
>> > +      return false;
>> > +    }
>> > +
>> > +  return true;
>> > +}
>> > +
>> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
>> >     described in FLAG.  If it is, return the index bit for that fusion
>> type.
>> >     If not, error (printing OPTION_NAME) and return zero.  */
>> > @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct
>> gcc_options *opts)
>> >      aarch64_parse_override_string (opts->x_aarch64_override_tune_string,
>> >                                  &aarch64_tune_params);
>> >
>> > +  if (opts->x_aarch64_ldp_policy_string)
>> > +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
>> > +                           &aarch64_tune_params);
>> > +
>> > +  if (opts->x_aarch64_stp_policy_string)
>> > +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
>> > +                           &aarch64_tune_params);
>> > +
>> >    /* This target defaults to strict volatile bitfields.  */
>> >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
>> abi_version_at_least (2))
>> >      opts->x_flag_strict_volatile_bitfields = 1;
>> > @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
>> bool load,
>> >    enum reg_class rclass_1, rclass_2;
>> >    rtx mem_1, mem_2, reg_1, reg_2;
>> >
>> > -  /* Allow the tuning structure to disable LDP instruction formation
>> > -     from combining instructions (e.g., in peephole2).
>> > -     TODO: Implement fine-grained tuning control for LDP and STP:
>> > -        1. control policies for load and store separately;
>> > -        2. support the following policies:
>> > -           - default (use what is in the tuning structure)
>> > -           - always
>> > -           - never
>> > -           - aligned (only if the compiler can prove that the
>> > -             load will be aligned to 2 * element_size)  */
>> > -  if (load && (aarch64_tune_params.extra_tuning_flags
>> > -            & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
>> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model ==
>> tune_params::LDP_POLICY_NEVER)
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model ==
>> tune_params::STP_POLICY_NEVER)
>> >      return false;
>>
>> ...here we could do something like:
>>
>>   auto policy = (load
>>                  ? aarch64_tune_params.ldp_policy_model
>>                  : aarch64_tune_params.stp_policy_model);
>>
>> Also:
>>
>> >
>> >    if (load)
>> > @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx *operands,
>> bool load,
>> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
>> >      return false;
>> >
>> > -  /* If we have SImode and slow unaligned ldp,
>> > -     check the alignment to be at least 8 byte. */
>> > -  if (mode == SImode
>> > -      && (aarch64_tune_params.extra_tuning_flags
>> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
>> > +  /* If we have LDP_POLICY_ALIGNED,
>> > +     do not emit the load pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model ==
>> tune_params::LDP_POLICY_ALIGNED
>> >        && !optimize_size
>> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
>> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_ALIGNED,
>> > +     do not emit the store pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model ==
>> tune_params::STP_POLICY_ALIGNED
>> > +      && !optimize_size
>> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
>> >      return false;
>> >
>> >    /* Check if the addresses are in the form of [base+offset].  */
>> > @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
>> *operands, bool load,
>> >    HOST_WIDE_INT offvals[num_insns], msize;
>> >    rtx mem[num_insns], reg[num_insns], base[num_insns],
>> offset[num_insns];
>> >
>> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model ==
>> tune_params::LDP_POLICY_NEVER)
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model ==
>> tune_params::STP_POLICY_NEVER)
>> > +    return false;
>> > +
>> >    if (load)
>> >      {
>> >        for (int i = 0; i < num_insns; i++)
>> > @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
>> *operands, bool load,
>> >    if (offvals[0] % msize != offvals[2] % msize)
>> >      return false;
>> >
>> > -  /* If we have SImode and slow unaligned ldp,
>> > -     check the alignment to be at least 8 byte. */
>> > -  if (mode == SImode
>> > -      && (aarch64_tune_params.extra_tuning_flags
>> > -       & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
>> > +  /* If we have LDP_POLICY_ALIGNED,
>> > +     do not emit the load pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (load
>> > +      && aarch64_tune_params.ldp_policy_model ==
>> tune_params::LDP_POLICY_ALIGNED
>> > +      && !optimize_size
>> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>> > +    return false;
>> > +
>> > +  /* If we have STP_POLICY_ALIGNED,
>> > +     do not emit the store pair unless the alignment is checked to be
>> > +     at least double the alignment of the type.  */
>> > +  if (!load
>> > +      && aarch64_tune_params.stp_policy_model ==
>> tune_params::STP_POLICY_ALIGNED
>> >        && !optimize_size
>> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
>> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
>> >      return false;
>> >
>> >    return true;
>> > diff --git a/gcc/config/aarch64/aarch64.opt
>> b/gcc/config/aarch64/aarch64.opt
>> > index 4a0580435a8..e5302947ce7 100644
>> > --- a/gcc/config/aarch64/aarch64.opt
>> > +++ b/gcc/config/aarch64/aarch64.opt
>> > @@ -205,6 +205,14 @@ msign-return-address=
>> >  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t)
>> Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
>> >  Select return address signing scope.
>> >
>> > +mldp-policy=
>> > +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
>> > +Fine-grained policy for load pairs.
>> > +
>> > +mstp-policy=
>> > +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
>> > +Fine-grained policy for store pairs.
>> > +
>> >  Enum
>> >  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
>> >  Supported AArch64 return address signing scope (for use with
>> -msign-return-address= option):
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> > new file mode 100644
>> > index 00000000000..6e29b265168
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>>
>> It might be better to split this into two tests, one for the aligned
>> accesses and one for the unaligned accesses.  Same for the store version.
>> (Splitting isn't necessary or useful for =always and =never though.)
>>
>> Thanks,
>> Richard
>>
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> > new file mode 100644
>> > index 00000000000..d2c4cf343e9
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> > new file mode 100644
>> > index 00000000000..f8a45ee18be
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define LDP_TEST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[0]; \
>> > +    a_1 = arr[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[0]; \
>> > +    a_1 = a[1]; \
>> > +    return a_0 + a_1; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    a_0 = arr[100]; \
>> > +    a_1 = arr[101]; \
>> > +    a_2 = arr[102]; \
>> > +    a_3 = arr[103]; \
>> > +    a_4 = arr[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
>> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a_0 = a[100]; \
>> > +    a_1 = a[101]; \
>> > +    a_2 = a[102]; \
>> > +    a_3 = a[103]; \
>> > +    a_4 = a[110]; \
>> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
>> > +}
>> > +
>> > +LDP_TEST_ALIGNED(int32_t);
>> > +LDP_TEST_ALIGNED(int64_t);
>> > +LDP_TEST_ALIGNED(v4si);
>> > +LDP_TEST_UNALIGNED(int32_t);
>> > +LDP_TEST_UNALIGNED(int64_t);
>> > +LDP_TEST_UNALIGNED(v4si);
>> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> > new file mode 100644
>> > index 00000000000..ae47b42efc4
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
>> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
>> > new file mode 100644
>> > index 00000000000..c1c51f9ae88
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 } } */
>> > +
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
>> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
>> > new file mode 100644
>> > index 00000000000..c28fcafa0ed
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
>> > @@ -0,0 +1,60 @@
>> > +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
>> > +
>> > +#include <stdlib.h>
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +#define STP_TEST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[0] = x; \
>> > +    arr[1] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[0] = x; \
>> > +    a[1] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
>> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    arr[100] = x; \
>> > +    arr[101] = x; \
>> > +    arr[102] = x; \
>> > +    arr[103] = x; \
>> > +    return arr; \
>> > +}
>> > +
>> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
>> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
>> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
>> 1)); \
>> > +    TYPE *a = arr+1; \
>> > +    a[100] = x; \
>> > +    a[101] = x; \
>> > +    a[102] = x; \
>> > +    a[103] = x; \
>> > +    return a; \
>> > +}
>> > +
>> > +STP_TEST_ALIGNED(int32_t);
>> > +STP_TEST_ALIGNED(int64_t);
>> > +STP_TEST_ALIGNED(v4si);
>> > +STP_TEST_UNALIGNED(int32_t);
>> > +STP_TEST_UNALIGNED(int64_t);
>> > +STP_TEST_UNALIGNED(v4si);
>> > +STP_TEST_ADJUST_ALIGNED(int32_t);
>> > +STP_TEST_ADJUST_ALIGNED(int64_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
>> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
>> > +
>> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 } } */
>> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 } } */
>> > +
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases.
  2023-09-28 13:42     ` Richard Sandiford
@ 2023-09-28 13:50       ` Manos Anagnostakis
  0 siblings, 0 replies; 8+ messages in thread
From: Manos Anagnostakis @ 2023-09-28 13:50 UTC (permalink / raw)
  To: Manos Anagnostakis, Kyrylo Tkachov, Philipp Tomsich, gcc-patches,
	Tamar Christina, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 49104 bytes --]

No problem!

I'll send a follow up with the requested changes.

Thanks for the input!

Manos.

On Thu, Sep 28, 2023 at 4:42 PM Richard Sandiford <richard.sandiford@arm.com>
wrote:

> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> > Hey Richard,
> >
> > Thanks for taking the time to review this, but it has been commited since
> > yesterday after getting reviewed by Kyrill and Tamar.
> >
> > Discussions:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631285.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631300.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631389.html
> >
> > Commited version:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631484.html
>
> Sorry about that.  I had v3 being filtered differently and so it went
> into a different inbox.
>
> Richard
>
> >
> > Manos.
> >
> > On Thu, Sep 28, 2023 at 4:17 PM Richard Sandiford <
> richard.sandiford@arm.com>
> > wrote:
> >
> >> Thanks for the patch and sorry for the slow review.
> >>
> >> Manos Anagnostakis <manos.anagnostakis@vrull.eu> writes:
> >> > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> >> > to provide the requested behaviour for handling ldp and stp:
> >> >
> >> >   /* Allow the tuning structure to disable LDP instruction formation
> >> >      from combining instructions (e.g., in peephole2).
> >> >      TODO: Implement fine-grained tuning control for LDP and STP:
> >> >            1. control policies for load and store separately;
> >> >            2. support the following policies:
> >> >               - default (use what is in the tuning structure)
> >> >               - always
> >> >               - never
> >> >               - aligned (only if the compiler can prove that the
> >> >                 load will be aligned to 2 * element_size)  */
> >> >
> >> > It provides two new and concrete command-line options -mldp-policy and
> >> -mstp-policy
> >> > to give the ability to control load and store policies seperately as
> >> > stated in part 1 of the TODO.
> >> >
> >> > The accepted values for both options are:
> >> > - default: Use the ldp/stp policy defined in the corresponding tuning
> >> >   structure.
> >> > - always: Emit ldp/stp regardless of alignment.
> >> > - never: Do not emit ldp/stp.
> >> > - aligned: In order to emit ldp/stp, first check if the load/store
> will
> >> >   be aligned to 2 * element_size.
> >> >
> >> > gcc/ChangeLog:
> >> >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> >> >       appropriate enums for the policies.
> >> >         * config/aarch64/aarch64-tuning-flags.def
> >> >       (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> >> >       options.
> >> >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> >> >       function to parse ldp-policy option.
> >> >         (aarch64_parse_stp_policy): New function to parse stp-policy
> >> option.
> >> >         (aarch64_override_options_internal): Call parsing functions.
> >> >         (aarch64_operands_ok_for_ldpstp): Add option-value check and
> >> >       alignment check and remove superseded ones
> >> >         (aarch64_operands_adjust_ok_for_ldpstp): Add option-value
> check
> >> and
> >> >       alignment check and remove superseded ones.
> >> >         * config/aarch64/aarch64.opt: Add options.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >         * gcc.target/aarch64/ldp_aligned.c: New test.
> >> >         * gcc.target/aarch64/ldp_always.c: New test.
> >> >         * gcc.target/aarch64/ldp_never.c: New test.
> >> >         * gcc.target/aarch64/stp_aligned.c: New test.
> >> >         * gcc.target/aarch64/stp_always.c: New test.
> >> >         * gcc.target/aarch64/stp_never.c: New test.
> >> >
> >> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> >> > ---
> >> > Changes in v2:
> >> >         - Fixed commited ldp tests to correctly trigger
> >> >           and test aarch64_operands_adjust_ok_for_ldpstp in
> aarch64.cc.
> >> >         - Added "-mcpu=generic" to commited tests to guarantee generic
> >> target code
> >> >           generation and not cause the regressions of v1.
> >> >
> >> >  gcc/config/aarch64/aarch64-protos.h           |  24 ++
> >> >  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
> >> >  gcc/config/aarch64/aarch64.cc                 | 229
> ++++++++++++++----
> >> >  gcc/config/aarch64/aarch64.opt                |   8 +
> >> >  .../gcc.target/aarch64/ldp_aligned.c          |  66 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 +++++
> >> >  .../gcc.target/aarch64/stp_aligned.c          |  60 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +++++
> >> >  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +++++
> >> >  10 files changed, 586 insertions(+), 61 deletions(-)
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> >> b/gcc/config/aarch64/aarch64-protos.h
> >> > index 70303d6fd95..be1d73490ed 100644
> >> > --- a/gcc/config/aarch64/aarch64-protos.h
> >> > +++ b/gcc/config/aarch64/aarch64-protos.h
> >> > @@ -568,6 +568,30 @@ struct tune_params
> >> >    /* Place prefetch struct pointer at the end to enable type checking
> >> >       errors when tune_params misses elements (e.g., from erroneous
> >> merges).  */
> >> >    const struct cpu_prefetch_tune *prefetch;
> >> > +/* An enum specifying how to handle load pairs using a fine-grained
> >> policy:
> >> > +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> >> > +   to at least double the alignment of the type.
> >> > +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> >> > +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> >> > +
> >> > +  enum aarch64_ldp_policy_model
> >> > +  {
> >> > +    LDP_POLICY_ALIGNED,
> >> > +    LDP_POLICY_ALWAYS,
> >> > +    LDP_POLICY_NEVER
> >> > +  } ldp_policy_model;
> >> > +/* An enum specifying how to handle store pairs using a fine-grained
> >> policy:
> >> > +   - STP_POLICY_ALIGNED: Emit stp if the source pointer is aligned
> >> > +   to at least double the alignment of the type.
> >> > +   - STP_POLICY_ALWAYS: Emit stp regardless of alignment.
> >> > +   - STP_POLICY_NEVER: Do not emit stp.  */
> >> > +
> >> > +  enum aarch64_stp_policy_model
> >> > +  {
> >> > +    STP_POLICY_ALIGNED,
> >> > +    STP_POLICY_ALWAYS,
> >> > +    STP_POLICY_NEVER
> >> > +  } stp_policy_model;
> >> >  };
> >>
> >> Generally the patch looks really good.  But I think we can use a single
> >> enum type for both LDP and STP, with the values having the prefix
> >> AARCH&4_LDP_STP_POLICY.  That means that we only need one parser,
> >> and that:
> >>
> >> >  /* Classifies an address.
> >> > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> >> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > index 52112ba7c48..774568e9106 100644
> >> > --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >> > @@ -30,11 +30,6 @@
> >> >
> >> >  AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
> >> >
> >> > -/* Don't create non-8 byte aligned load/store pair.  That is if the
> >> > -two load/stores are not at least 8 byte aligned don't create
> load/store
> >> > -pairs.   */
> >> > -AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw",
> SLOW_UNALIGNED_LDPW)
> >> > -
> >> >  /* Some of the optional shift to some arthematic instructions are
> >> >     considered cheap.  Logical shift left <=4 with or without a
> >> >     zero extend are considered cheap.  Sign extend; non logical shift
> >> left
> >> > @@ -44,9 +39,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> >> CHEAP_SHIFT_EXTEND)
> >> >  /* Disallow load/store pair instructions on Q-registers.  */
> >> >  AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
> >> >
> >> > -/* Disallow load-pair instructions to be formed in
> combine/peephole.  */
> >> > -AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE)
> >> > -
> >> >  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
> >> >
> >> >  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> >> CSE_SVE_VL_CONSTANTS)
> >> > diff --git a/gcc/config/aarch64/aarch64.cc
> >> b/gcc/config/aarch64/aarch64.cc
> >> > index eba5d4a7e04..43d88c68647 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -1356,7 +1356,9 @@ static const struct tune_params generic_tunings
> =
> >> >       Neoverse V1.  It does not have a noticeable effect on A64FX and
> >> should
> >> >       have at most a very minor effect on SVE2 cores.  */
> >> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa35_tunings =
> >> > @@ -1390,7 +1392,9 @@ static const struct tune_params
> cortexa35_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa53_tunings =
> >> > @@ -1424,7 +1428,9 @@ static const struct tune_params
> cortexa53_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa57_tunings =
> >> > @@ -1458,7 +1464,9 @@ static const struct tune_params
> cortexa57_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),      /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa72_tunings =
> >> > @@ -1492,7 +1500,9 @@ static const struct tune_params
> cortexa72_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params cortexa73_tunings =
> >> > @@ -1526,7 +1536,9 @@ static const struct tune_params
> cortexa73_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >
> >> > @@ -1561,7 +1573,9 @@ static const struct tune_params
> exynosm1_tunings =
> >> >    48,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &exynosm1_prefetch_tune
> >> > +  &exynosm1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderxt88_tunings =
> >> > @@ -1593,8 +1607,10 @@ static const struct tune_params
> >> thunderxt88_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW),  /* tune_flags.  */
> >> > -  &thunderxt88_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &thunderxt88_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx_tunings =
> >> > @@ -1626,9 +1642,10 @@ static const struct tune_params
> thunderx_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW
> >> > -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> >> > -  &thunderx_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> >> > +  &thunderx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params tsv110_tunings =
> >> > @@ -1662,7 +1679,9 @@ static const struct tune_params tsv110_tunings =
> >> >    0,    /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,     /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE),     /* tune_flags.  */
> >> > -  &tsv110_prefetch_tune
> >> > +  &tsv110_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params xgene1_tunings =
> >> > @@ -1695,7 +1714,9 @@ static const struct tune_params xgene1_tunings =
> >> >    17,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> >> > -  &xgene1_prefetch_tune
> >> > +  &xgene1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params emag_tunings =
> >> > @@ -1728,7 +1749,9 @@ static const struct tune_params emag_tunings =
> >> >    17,        /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_OFF,   /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS),     /* tune_flags.  */
> >> > -  &xgene1_prefetch_tune
> >> > +  &xgene1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params qdf24xx_tunings =
> >> > @@ -1762,7 +1785,9 @@ static const struct tune_params qdf24xx_tunings
> =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
> >> > -  &qdf24xx_prefetch_tune
> >> > +  &qdf24xx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  /* Tuning structure for the Qualcomm Saphira core.  Default to falkor
> >> values
> >> > @@ -1798,7 +1823,9 @@ static const struct tune_params saphira_tunings
> =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE),         /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx2t99_tunings =
> >> > @@ -1832,7 +1859,9 @@ static const struct tune_params
> >> thunderx2t99_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &thunderx2t99_prefetch_tune
> >> > +  &thunderx2t99_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params thunderx3t110_tunings =
> >> > @@ -1866,7 +1895,9 @@ static const struct tune_params
> >> thunderx3t110_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &thunderx3t110_prefetch_tune
> >> > +  &thunderx3t110_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params neoversen1_tunings =
> >> > @@ -1899,7 +1930,9 @@ static const struct tune_params
> neoversen1_tunings
> >> =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params ampere1_tunings =
> >> > @@ -1935,8 +1968,10 @@ static const struct tune_params
> ampere1_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> >> > -  &ampere1_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &ampere1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params ampere1a_tunings =
> >> > @@ -1973,8 +2008,10 @@ static const struct tune_params
> ampere1a_tunings =
> >> >    2, /* min_div_recip_mul_df.  */
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> > -  (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE),       /* tune_flags.  */
> >> > -  &ampere1_prefetch_tune
> >> > +  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > +  &ampere1_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALIGNED,   /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALIGNED    /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
> >> > @@ -2155,7 +2192,9 @@ static const struct tune_params
> neoversev1_tunings
> >> =
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >> >     | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const sve_vec_cost neoverse512tvb_sve_vector_cost =
> >> > @@ -2292,7 +2331,9 @@ static const struct tune_params
> >> neoverse512tvb_tunings =
> >> >    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversen2_advsimd_vector_cost =
> >> > @@ -2482,7 +2523,9 @@ static const struct tune_params
> neoversen2_tunings
> >> =
> >> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const advsimd_vec_cost neoversev2_advsimd_vector_cost =
> >> > @@ -2672,7 +2715,9 @@ static const struct tune_params
> neoversev2_tunings
> >> =
> >> >     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >> >     | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> >     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.
> */
> >> > -  &generic_prefetch_tune
> >> > +  &generic_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  static const struct tune_params a64fx_tunings =
> >> > @@ -2705,7 +2750,9 @@ static const struct tune_params a64fx_tunings =
> >> >    0, /* max_case_values.  */
> >> >    tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >> >    (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
> >> > -  &a64fx_prefetch_tune
> >> > +  &a64fx_prefetch_tune,
> >> > +  tune_params::LDP_POLICY_ALWAYS,    /* ldp_policy_model.  */
> >> > +  tune_params::STP_POLICY_ALWAYS     /* stp_policy_model.  */
> >> >  };
> >> >
> >> >  /* Support for fine-grained override of the tuning structures.  */
> >> > @@ -17726,6 +17773,50 @@ aarch64_parse_tune (const char *to_parse,
> const
> >> struct processor **res)
> >> >    return AARCH_PARSE_INVALID_ARG;
> >> >  }
> >> >
> >> > +/* Validate a command-line -mldp-policy option.  Parse the policy
> >> > +   specified in STR and throw errors if appropriate.  */
> >> > +
> >> > +static bool
> >> > +aarch64_parse_ldp_policy (const char *str, struct tune_params* tune)
> >> > +{
> >> > +  /* Check the value of the option to be one of the accepted.  */
> >> > +  if (strcmp (str, "always") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALWAYS;
> >> > +  else if (strcmp (str, "never") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_NEVER;
> >> > +  else if (strcmp (str, "aligned") == 0)
> >> > +    tune->ldp_policy_model = tune_params::LDP_POLICY_ALIGNED;
> >> > +  else if (strcmp (str, "default") != 0)
> >> > +    {
> >> > +      error ("unknown value %qs for %<-mldp-policy%>", str);
> >> > +      return false;
> >> > +    }
> >> > +
> >> > +  return true;
> >> > +}
> >> > +
> >> > +/* Validate a command-line -mstp-policy option.  Parse the policy
> >> > +   specified in STR and throw errors if appropriate.  */
> >> > +
> >> > +static bool
> >> > +aarch64_parse_stp_policy (const char *str, struct tune_params* tune)
> >> > +{
> >> > +  /* Check the value of the option to be one of the accepted.  */
> >> > +  if (strcmp (str, "always") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALWAYS;
> >> > +  else if (strcmp (str, "never") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_NEVER;
> >> > +  else if (strcmp (str, "aligned") == 0)
> >> > +      tune->stp_policy_model = tune_params::STP_POLICY_ALIGNED;
> >> > +  else if (strcmp (str, "default") != 0)
> >> > +    {
> >> > +      error ("unknown value %qs for %<-mstp-policy%>", str);
> >> > +      return false;
> >> > +    }
> >> > +
> >> > +  return true;
> >> > +}
> >> > +
> >> >  /* Parse TOKEN, which has length LENGTH to see if it is an option
> >> >     described in FLAG.  If it is, return the index bit for that fusion
> >> type.
> >> >     If not, error (printing OPTION_NAME) and return zero.  */
> >> > @@ -18074,6 +18165,14 @@ aarch64_override_options_internal (struct
> >> gcc_options *opts)
> >> >      aarch64_parse_override_string
> (opts->x_aarch64_override_tune_string,
> >> >                                  &aarch64_tune_params);
> >> >
> >> > +  if (opts->x_aarch64_ldp_policy_string)
> >> > +    aarch64_parse_ldp_policy (opts->x_aarch64_ldp_policy_string,
> >> > +                           &aarch64_tune_params);
> >> > +
> >> > +  if (opts->x_aarch64_stp_policy_string)
> >> > +    aarch64_parse_stp_policy (opts->x_aarch64_stp_policy_string,
> >> > +                           &aarch64_tune_params);
> >> > +
> >> >    /* This target defaults to strict volatile bitfields.  */
> >> >    if (opts->x_flag_strict_volatile_bitfields < 0 &&
> >> abi_version_at_least (2))
> >> >      opts->x_flag_strict_volatile_bitfields = 1;
> >> > @@ -26382,18 +26481,14 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands,
> >> bool load,
> >> >    enum reg_class rclass_1, rclass_2;
> >> >    rtx mem_1, mem_2, reg_1, reg_2;
> >> >
> >> > -  /* Allow the tuning structure to disable LDP instruction formation
> >> > -     from combining instructions (e.g., in peephole2).
> >> > -     TODO: Implement fine-grained tuning control for LDP and STP:
> >> > -        1. control policies for load and store separately;
> >> > -        2. support the following policies:
> >> > -           - default (use what is in the tuning structure)
> >> > -           - always
> >> > -           - never
> >> > -           - aligned (only if the compiler can prove that the
> >> > -             load will be aligned to 2 * element_size)  */
> >> > -  if (load && (aarch64_tune_params.extra_tuning_flags
> >> > -            & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE))
> >> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> >> tune_params::LDP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> >> tune_params::STP_POLICY_NEVER)
> >> >      return false;
> >>
> >> ...here we could do something like:
> >>
> >>   auto policy = (load
> >>                  ? aarch64_tune_params.ldp_policy_model
> >>                  : aarch64_tune_params.stp_policy_model);
> >>
> >> Also:
> >>
> >> >
> >> >    if (load)
> >> > @@ -26420,13 +26515,22 @@ aarch64_operands_ok_for_ldpstp (rtx
> *operands,
> >> bool load,
> >> >    if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2))
> >> >      return false;
> >> >
> >> > -  /* If we have SImode and slow unaligned ldp,
> >> > -     check the alignment to be at least 8 byte. */
> >> > -  if (mode == SImode
> >> > -      && (aarch64_tune_params.extra_tuning_flags
> >> > -          & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> >> > +  /* If we have LDP_POLICY_ALIGNED,
> >> > +     do not emit the load pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> >> tune_params::LDP_POLICY_ALIGNED
> >> >        && !optimize_size
> >> > -      && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT)
> >> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_ALIGNED,
> >> > +     do not emit the store pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> >> tune_params::STP_POLICY_ALIGNED
> >> > +      && !optimize_size
> >> > +      && MEM_ALIGN (mem_1) < 2 * GET_MODE_ALIGNMENT (mode))
> >> >      return false;
> >> >
> >> >    /* Check if the addresses are in the form of [base+offset].  */
> >> > @@ -26556,6 +26660,16 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> >> *operands, bool load,
> >> >    HOST_WIDE_INT offvals[num_insns], msize;
> >> >    rtx mem[num_insns], reg[num_insns], base[num_insns],
> >> offset[num_insns];
> >> >
> >> > +  /* If we have LDP_POLICY_NEVER, reject the load pair.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> >> tune_params::LDP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_NEVER, reject the store pair.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> >> tune_params::STP_POLICY_NEVER)
> >> > +    return false;
> >> > +
> >> >    if (load)
> >> >      {
> >> >        for (int i = 0; i < num_insns; i++)
> >> > @@ -26645,13 +26759,22 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx
> >> *operands, bool load,
> >> >    if (offvals[0] % msize != offvals[2] % msize)
> >> >      return false;
> >> >
> >> > -  /* If we have SImode and slow unaligned ldp,
> >> > -     check the alignment to be at least 8 byte. */
> >> > -  if (mode == SImode
> >> > -      && (aarch64_tune_params.extra_tuning_flags
> >> > -       & AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW)
> >> > +  /* If we have LDP_POLICY_ALIGNED,
> >> > +     do not emit the load pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (load
> >> > +      && aarch64_tune_params.ldp_policy_model ==
> >> tune_params::LDP_POLICY_ALIGNED
> >> > +      && !optimize_size
> >> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >> > +    return false;
> >> > +
> >> > +  /* If we have STP_POLICY_ALIGNED,
> >> > +     do not emit the store pair unless the alignment is checked to be
> >> > +     at least double the alignment of the type.  */
> >> > +  if (!load
> >> > +      && aarch64_tune_params.stp_policy_model ==
> >> tune_params::STP_POLICY_ALIGNED
> >> >        && !optimize_size
> >> > -      && MEM_ALIGN (mem[0]) < 8 * BITS_PER_UNIT)
> >> > +      && MEM_ALIGN (mem[0]) < 2 * GET_MODE_ALIGNMENT (mode))
> >> >      return false;
> >> >
> >> >    return true;
> >> > diff --git a/gcc/config/aarch64/aarch64.opt
> >> b/gcc/config/aarch64/aarch64.opt
> >> > index 4a0580435a8..e5302947ce7 100644
> >> > --- a/gcc/config/aarch64/aarch64.opt
> >> > +++ b/gcc/config/aarch64/aarch64.opt
> >> > @@ -205,6 +205,14 @@ msign-return-address=
> >> >  Target WarnRemoved RejectNegative Joined Enum(aarch_ra_sign_scope_t)
> >> Var(aarch_ra_sign_scope) Init(AARCH_FUNCTION_NONE) Save
> >> >  Select return address signing scope.
> >> >
> >> > +mldp-policy=
> >> > +Target RejectNegative Joined Var(aarch64_ldp_policy_string) Save
> >> > +Fine-grained policy for load pairs.
> >> > +
> >> > +mstp-policy=
> >> > +Target RejectNegative Joined Var(aarch64_stp_policy_string) Save
> >> > +Fine-grained policy for store pairs.
> >> > +
> >> >  Enum
> >> >  Name(aarch_ra_sign_scope_t) Type(enum aarch_function_type)
> >> >  Supported AArch64 return address signing scope (for use with
> >> -msign-return-address= option):
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> > new file mode 100644
> >> > index 00000000000..6e29b265168
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=aligned -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>
> >> It might be better to split this into two tests, one for the aligned
> >> accesses and one for the unaligned accesses.  Same for the store
> version.
> >> (Splitting isn't necessary or useful for =always and =never though.)
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> > new file mode 100644
> >> > index 00000000000..d2c4cf343e9
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_always.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=always -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 2 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> > new file mode 100644
> >> > index 00000000000..f8a45ee18be
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_never.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-options "-O2 -mldp-policy=never -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define LDP_TEST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[0]; \
> >> > +    a_1 = arr[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[0]; \
> >> > +    a_1 = a[1]; \
> >> > +    return a_0 + a_1; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE ldp_aligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    a_0 = arr[100]; \
> >> > +    a_1 = arr[101]; \
> >> > +    a_2 = arr[102]; \
> >> > +    a_3 = arr[103]; \
> >> > +    a_4 = arr[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +#define LDP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE ldp_unaligned_adjust_##TYPE(char* ptr){ \
> >> > +    TYPE a_0, a_1, a_2, a_3, a_4; \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a_0 = a[100]; \
> >> > +    a_1 = a[101]; \
> >> > +    a_2 = a[102]; \
> >> > +    a_3 = a[103]; \
> >> > +    a_4 = a[110]; \
> >> > +    return a_0 + a_1 + a_2 + a_3 + a_4; \
> >> > +}
> >> > +
> >> > +LDP_TEST_ALIGNED(int32_t);
> >> > +LDP_TEST_ALIGNED(int64_t);
> >> > +LDP_TEST_ALIGNED(v4si);
> >> > +LDP_TEST_UNALIGNED(int32_t);
> >> > +LDP_TEST_UNALIGNED(int64_t);
> >> > +LDP_TEST_UNALIGNED(v4si);
> >> > +LDP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +LDP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> > new file mode 100644
> >> > index 00000000000..ae47b42efc4
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_aligned.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=aligned -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> > new file mode 100644
> >> > index 00000000000..c1c51f9ae88
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_always.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=always -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 6 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 2 }
> } */
> >> > +
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> > new file mode 100644
> >> > index 00000000000..c28fcafa0ed
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/stp_never.c
> >> > @@ -0,0 +1,60 @@
> >> > +/* { dg-options "-O2 -mstp-policy=never -mcpu=generic" } */
> >> > +
> >> > +#include <stdlib.h>
> >> > +#include <stdint.h>
> >> > +
> >> > +typedef int v4si __attribute__ ((vector_size (16)));
> >> > +
> >> > +#define STP_TEST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[0] = x; \
> >> > +    arr[1] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[0] = x; \
> >> > +    a[1] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_ALIGNED(TYPE) \
> >> > +TYPE *stp_aligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    arr[100] = x; \
> >> > +    arr[101] = x; \
> >> > +    arr[102] = x; \
> >> > +    arr[103] = x; \
> >> > +    return arr; \
> >> > +}
> >> > +
> >> > +#define STP_TEST_ADJUST_UNALIGNED(TYPE) \
> >> > +TYPE *stp_unaligned_adjust_##TYPE(char* ptr, TYPE x){ \
> >> > +    TYPE *arr = (TYPE*) ((uintptr_t)ptr & ~(2 * 8 * _Alignof(TYPE) -
> >> 1)); \
> >> > +    TYPE *a = arr+1; \
> >> > +    a[100] = x; \
> >> > +    a[101] = x; \
> >> > +    a[102] = x; \
> >> > +    a[103] = x; \
> >> > +    return a; \
> >> > +}
> >> > +
> >> > +STP_TEST_ALIGNED(int32_t);
> >> > +STP_TEST_ALIGNED(int64_t);
> >> > +STP_TEST_ALIGNED(v4si);
> >> > +STP_TEST_UNALIGNED(int32_t);
> >> > +STP_TEST_UNALIGNED(int64_t);
> >> > +STP_TEST_UNALIGNED(v4si);
> >> > +STP_TEST_ADJUST_ALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_ALIGNED(int64_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int32_t);
> >> > +STP_TEST_ADJUST_UNALIGNED(int64_t);
> >> > +
> >> > +/* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >> > +/* { dg-final { scan-assembler-times "stp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >> > +
> >>
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-28 13:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-28 14:37 [PATCH v2] aarch64: Fine-grained ldp and stp policies with test-cases Manos Anagnostakis
2023-09-22  8:51 ` [PING] " Manos Anagnostakis
2023-09-28 13:17 ` Richard Sandiford
2023-09-28 13:33   ` Manos Anagnostakis
2023-09-28 13:37     ` Philipp Tomsich
2023-09-28 13:40       ` Manos Anagnostakis
2023-09-28 13:42     ` Richard Sandiford
2023-09-28 13:50       ` Manos Anagnostakis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).