* Allow the number of iterations to be smaller than VF
@ 2017-11-17 15:15 Richard Sandiford
2017-11-20 3:11 ` Jeff Law
0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2017-11-17 15:15 UTC (permalink / raw)
To: gcc-patches
Fully-masked loops can be profitable even if the iteration
count is smaller than the vectorisation factor. In this case
we're effectively doing a complete unroll followed by SLP.
The documentation for min-vect-loop-bound says that the
default value is 0, but actually the default and minimum
were 1. We need it to be 0 for this case since the parameter
counts a whole number of vector iterations.
Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu. OK to install?
Richard
2017-11-17 Richard Sandiford <richard.sandiford@linaro.org>
Alan Hayward <alan.hayward@arm.com>
David Sherwood <david.sherwood@arm.com>
gcc/
* doc/sourcebuild.texi (vect_fully_masked): Document.
* params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
default value to 0.
* tree-vect-loop.c (vect_analyze_loop_costing): New function,
split out from...
(vect_analyze_loop_2): ...here. Don't check the vectorization
factor against the number of loop iterations if the loop is
fully-masked.
gcc/testsuite/
* lib/target-supports.exp (check_effective_target_vect_fully_masked):
New proc.
* gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if
vect_fully_masked.
* gcc.target/aarch64/sve_loop_add_4.c: New test.
* gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.
* gcc.target/aarch64/sve_loop_add_5.c: Likewise.
* gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.
* gcc.target/aarch64/sve_miniloop_1.c: Likewise.
* gcc.target/aarch64/sve_miniloop_2.c: Likewise.
Index: gcc/doc/sourcebuild.texi
===================================================================
--- gcc/doc/sourcebuild.texi 2017-11-17 15:09:28.740330131 +0000
+++ gcc/doc/sourcebuild.texi 2017-11-17 15:09:28.967330125 +0000
@@ -1403,6 +1403,10 @@ Target supports hardware vectors of @cod
@item vect_long_long
Target supports hardware vectors of @code{long long}.
+@item vect_fully_masked
+Target supports fully-masked (also known as fully-predicated) loops,
+so that vector loops can handle partial as well as full vectors.
+
@item vect_masked_store
Target supports vector masked stores.
Index: gcc/params.def
===================================================================
--- gcc/params.def 2017-11-17 15:09:28.740330131 +0000
+++ gcc/params.def 2017-11-17 15:09:28.967330125 +0000
@@ -139,7 +139,7 @@ DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,
DEFPARAM (PARAM_MIN_VECT_LOOP_BOUND,
"min-vect-loop-bound",
"If -ftree-vectorize is used, the minimal loop bound of a loop to be considered for vectorization.",
- 1, 1, 0)
+ 0, 0, 0)
/* The maximum number of instructions to consider when looking for an
instruction to fill a delay slot. If more than this arbitrary
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c 2017-11-17 15:09:28.740330131 +0000
+++ gcc/tree-vect-loop.c 2017-11-17 15:09:28.969330125 +0000
@@ -1893,6 +1893,101 @@ vect_analyze_loop_operations (loop_vec_i
return true;
}
+/* Analyze the cost of the loop described by LOOP_VINFO. Decide if it
+ is worthwhile to vectorize. Return 1 if definitely yes, 0 if
+ definitely no, or -1 if it's worth retrying. */
+
+static int
+vect_analyze_loop_costing (loop_vec_info loop_vinfo)
+{
+ struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+ unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
+
+ /* Only fully-masked loops can have iteration counts less than the
+ vectorization factor. */
+ if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+ {
+ HOST_WIDE_INT max_niter;
+
+ if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+ max_niter = LOOP_VINFO_INT_NITERS (loop_vinfo);
+ else
+ max_niter = max_stmt_executions_int (loop);
+
+ if (max_niter != -1
+ && (unsigned HOST_WIDE_INT) max_niter < assumed_vf)
+ {
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ "not vectorized: iteration count smaller than "
+ "vectorization factor.\n");
+ return 0;
+ }
+ }
+
+ int min_profitable_iters, min_profitable_estimate;
+ vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
+ &min_profitable_estimate);
+
+ if (min_profitable_iters < 0)
+ {
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ "not vectorized: vectorization not profitable.\n");
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ "not vectorized: vector version will never be "
+ "profitable.\n");
+ return -1;
+ }
+
+ int min_scalar_loop_bound = (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
+ * assumed_vf);
+
+ /* Use the cost model only if it is more conservative than user specified
+ threshold. */
+ unsigned int th = (unsigned) MAX (min_scalar_loop_bound,
+ min_profitable_iters);
+
+ LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
+
+ if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+ && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
+ {
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ "not vectorized: vectorization not profitable.\n");
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_NOTE, vect_location,
+ "not vectorized: iteration count smaller than user "
+ "specified loop bound parameter or minimum profitable "
+ "iterations (whichever is more conservative).\n");
+ return 0;
+ }
+
+ HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+ if (estimated_niter == -1)
+ estimated_niter = likely_max_stmt_executions_int (loop);
+ if (estimated_niter != -1
+ && ((unsigned HOST_WIDE_INT) estimated_niter
+ < MAX (th, (unsigned) min_profitable_estimate)))
+ {
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+ "not vectorized: estimated iteration count too "
+ "small.\n");
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_NOTE, vect_location,
+ "not vectorized: estimated iteration count smaller "
+ "than specified loop bound parameter or minimum "
+ "profitable iterations (whichever is more "
+ "conservative).\n");
+ return -1;
+ }
+
+ return 1;
+}
+
/* Function vect_analyze_loop_2.
@@ -1903,6 +1998,7 @@ vect_analyze_loop_operations (loop_vec_i
vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
{
bool ok;
+ int res;
unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
poly_uint64 min_vf = 2;
unsigned int n_stmts = 0;
@@ -2060,9 +2156,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
vect_compute_single_scalar_iteration_cost (loop_vinfo);
poly_uint64 saved_vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
- HOST_WIDE_INT estimated_niter;
unsigned th;
- int min_scalar_loop_bound;
/* Check the SLP opportunities in the loop, analyze and build SLP trees. */
ok = vect_analyze_slp (loop_vinfo, n_stmts);
@@ -2092,7 +2186,6 @@ vect_analyze_loop_2 (loop_vec_info loop_
/* Now the vectorization factor is final. */
poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
gcc_assert (must_ne (vectorization_factor, 0U));
- unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && dump_enabled_p ())
{
@@ -2105,17 +2198,6 @@ vect_analyze_loop_2 (loop_vec_info loop_
HOST_WIDE_INT max_niter
= likely_max_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
- if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
- && (LOOP_VINFO_INT_NITERS (loop_vinfo) < assumed_vf))
- || (max_niter != -1
- && (unsigned HOST_WIDE_INT) max_niter < assumed_vf))
- {
- if (dump_enabled_p ())
- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "not vectorized: iteration count smaller than "
- "vectorization factor.\n");
- return false;
- }
/* Analyze the alignment of the data-refs in the loop.
Fail if a data reference is found that cannot be vectorized. */
@@ -2229,65 +2311,16 @@ vect_analyze_loop_2 (loop_vec_info loop_
}
}
- /* Analyze cost. Decide if worth while to vectorize. */
- int min_profitable_estimate, min_profitable_iters;
- vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
- &min_profitable_estimate);
-
- if (min_profitable_iters < 0)
- {
- if (dump_enabled_p ())
- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "not vectorized: vectorization not profitable.\n");
- if (dump_enabled_p ())
- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "not vectorized: vector version will never be "
- "profitable.\n");
- goto again;
- }
-
- min_scalar_loop_bound = (PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
- * assumed_vf);
-
- /* Use the cost model only if it is more conservative than user specified
- threshold. */
- th = (unsigned) MAX (min_scalar_loop_bound, min_profitable_iters);
-
- LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
-
- if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
- && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
- {
- if (dump_enabled_p ())
- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "not vectorized: vectorization not profitable.\n");
- if (dump_enabled_p ())
- dump_printf_loc (MSG_NOTE, vect_location,
- "not vectorized: iteration count smaller than user "
- "specified loop bound parameter or minimum profitable "
- "iterations (whichever is more conservative).\n");
- goto again;
- }
-
- estimated_niter
- = estimated_stmt_executions_int (LOOP_VINFO_LOOP (loop_vinfo));
- if (estimated_niter == -1)
- estimated_niter = max_niter;
- if (estimated_niter != -1
- && ((unsigned HOST_WIDE_INT) estimated_niter
- < MAX (th, (unsigned) min_profitable_estimate)))
+ /* Check the costings of the loop make vectorizing worthwhile. */
+ res = vect_analyze_loop_costing (loop_vinfo);
+ if (res < 0)
+ goto again;
+ if (!res)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- "not vectorized: estimated iteration count too "
- "small.\n");
- if (dump_enabled_p ())
- dump_printf_loc (MSG_NOTE, vect_location,
- "not vectorized: estimated iteration count smaller "
- "than specified loop bound parameter or minimum "
- "profitable iterations (whichever is more "
- "conservative).\n");
- goto again;
+ "Loop costings not worthwhile.\n");
+ return false;
}
/* Decide whether we need to create an epilogue loop to handle
@@ -3869,7 +3902,6 @@ vect_estimate_min_profitable_iters (loop
* assumed_vf
- vec_inside_cost * peel_iters_prologue
- vec_inside_cost * peel_iters_epilogue);
-
if (min_profitable_iters <= 0)
min_profitable_iters = 0;
else
Index: gcc/testsuite/lib/target-supports.exp
===================================================================
--- gcc/testsuite/lib/target-supports.exp 2017-11-17 15:09:28.740330131 +0000
+++ gcc/testsuite/lib/target-supports.exp 2017-11-17 15:09:28.968330125 +0000
@@ -6434,6 +6434,12 @@ proc check_effective_target_vect_natural
return $et_vect_natural_alignment
}
+# Return true if fully-masked loops are supported.
+
+proc check_effective_target_vect_fully_masked { } {
+ return [check_effective_target_aarch64_sve]
+}
+
# Return 1 if the target doesn't prefer any alignment beyond element
# alignment during vectorization.
Index: gcc/testsuite/gcc.dg/vect/slp-3.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-3.c 2017-11-17 15:09:28.740330131 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-3.c 2017-11-17 15:09:28.967330125 +0000
@@ -141,6 +141,8 @@ int main (void)
return 0;
}
-/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { ! vect_fully_masked } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect" { target vect_fully_masked } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target { ! vect_fully_masked } } } }*/
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target vect_fully_masked } } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_4.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_4.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,96 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define LOOP(TYPE, NAME, STEP) \
+ __attribute__((noinline, noclone)) \
+ void \
+ test_##TYPE##_##NAME (TYPE *dst, TYPE base, int count) \
+ { \
+ for (int i = 0; i < count; ++i, base += STEP) \
+ dst[i] += base; \
+ }
+
+#define TEST_TYPE(T, TYPE) \
+ T (TYPE, m17, -17) \
+ T (TYPE, m16, -16) \
+ T (TYPE, m15, -15) \
+ T (TYPE, m1, -1) \
+ T (TYPE, 1, 1) \
+ T (TYPE, 15, 15) \
+ T (TYPE, 16, 16) \
+ T (TYPE, 17, 17)
+
+#define TEST_ALL(T) \
+ TEST_TYPE (T, int8_t) \
+ TEST_TYPE (T, int16_t) \
+ TEST_TYPE (T, int32_t) \
+ TEST_TYPE (T, int64_t)
+
+TEST_ALL (LOOP)
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]+/z, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7]+, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincb\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-not {\tdecb\tz[0-9]+\.b} } } */
+/* We don't need to increment the vector IV for steps -16 and 16, since the
+ increment is always a multiple of 256. */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 14 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincb\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdech\tz[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tinch\tz[0-9]+\.h, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincw\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecw\tz[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincw\tz[0-9]+\.s, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tincd\tx[0-9]+\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tdecd\tz[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d, all, mul #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tincd\tz[0-9]+\.d, all, mul #16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 10 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_4_run.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_4_run.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,30 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_loop_add_4.c"
+
+#define N 131
+#define BASE 41
+
+#define TEST_LOOP(TYPE, NAME, STEP) \
+ { \
+ TYPE a[N]; \
+ for (int i = 0; i < N; ++i) \
+ { \
+ a[i] = i * i + i % 5; \
+ asm volatile ("" ::: "memory"); \
+ } \
+ test_##TYPE##_##NAME (a, BASE, N); \
+ for (int i = 0; i < N; ++i) \
+ { \
+ TYPE expected = i * i + i % 5 + BASE + i * STEP; \
+ if (a[i] != expected) \
+ __builtin_abort (); \
+ } \
+ }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+ TEST_ALL (TEST_LOOP)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_5.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_5.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=256" } */
+
+#include "sve_loop_add_4.c"
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-16\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #-15\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, #15\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.b, w[0-9]+, w[0-9]+\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]+/z, \[x[0-9]+, x[0-9]+\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7]+, \[x[0-9]+, x[0-9]+\]} 8 } } */
+
+/* The induction vector is invariant for steps of -16 and 16. */
+/* { dg-final { scan-assembler-not {\tsub\tz[0-9]+\.b, z[0-9]+\.b, #} } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, #} 6 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-16\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #-15\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.h, w[0-9]+, w[0-9]+\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1h\tz[0-9]+\.h, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 1\]} 8 } } */
+
+/* The (-)17 * 16 is out of range. */
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.h, z[0-9]+\.h, #} 2 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 10 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.s, w[0-9]+, w[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 2\]} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.s, z[0-9]+\.s, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #-15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #1\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, #15\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tindex\tz[0-9]+\.d, x[0-9]+, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]+/z, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7]+, \[x[0-9]+, x[0-9]+, lsl 3\]} 8 } } */
+
+/* { dg-final { scan-assembler-times {\tsub\tz[0-9]+\.d, z[0-9]+\.d, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, #} 4 } } */
+/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 8 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_loop_add_5_run.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_loop_add_5_run.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,5 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+
+#include "sve_loop_add_4_run.c"
Index: gcc/testsuite/gcc.target/aarch64/sve_miniloop_1.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_miniloop_1.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,23 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve --save-temps" } */
+
+void loop (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c,
+ int * __restrict__ d, int * __restrict__ e, int * __restrict__ f,
+ int * __restrict__ g, int * __restrict__ h)
+{
+ int i = 0;
+ for (i = 0; i < 3; i++)
+ {
+ a[i] += i;
+ b[i] += i;
+ c[i] += i;
+ d[i] += i;
+ e[i] += i;
+ f[i] += a[i] + 7;
+ g[i] += b[i] - 3;
+ h[i] += c[i] + 3;
+ }
+}
+
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, } 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, } 8 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_miniloop_2.c
===================================================================
--- /dev/null 2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_miniloop_2.c 2017-11-17 15:09:28.967330125 +0000
@@ -0,0 +1,7 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve --save-temps -msve-vector-bits=256" } */
+
+#include "sve_miniloop_1.c"
+
+/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, } 8 } } */
+/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, } 8 } } */
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Allow the number of iterations to be smaller than VF
2017-11-17 15:15 Allow the number of iterations to be smaller than VF Richard Sandiford
@ 2017-11-20 3:11 ` Jeff Law
2018-01-07 20:52 ` James Greenhalgh
0 siblings, 1 reply; 4+ messages in thread
From: Jeff Law @ 2017-11-20 3:11 UTC (permalink / raw)
To: gcc-patches, richard.sandiford
On 11/17/2017 08:11 AM, Richard Sandiford wrote:
> Fully-masked loops can be profitable even if the iteration
> count is smaller than the vectorisation factor. In this case
> we're effectively doing a complete unroll followed by SLP.
>
> The documentation for min-vect-loop-bound says that the
> default value is 0, but actually the default and minimum
> were 1. We need it to be 0 for this case since the parameter
> counts a whole number of vector iterations.
>
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu. OK to install?
>
> Richard
>
>
> 2017-11-17 Richard Sandiford <richard.sandiford@linaro.org>
> Alan Hayward <alan.hayward@arm.com>
> David Sherwood <david.sherwood@arm.com>
>
> gcc/
> * doc/sourcebuild.texi (vect_fully_masked): Document.
> * params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
> default value to 0.
> * tree-vect-loop.c (vect_analyze_loop_costing): New function,
> split out from...
> (vect_analyze_loop_2): ...here. Don't check the vectorization
> factor against the number of loop iterations if the loop is
> fully-masked.
>
> gcc/testsuite/
> * lib/target-supports.exp (check_effective_target_vect_fully_masked):
> New proc.
> * gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if
> vect_fully_masked.
> * gcc.target/aarch64/sve_loop_add_4.c: New test.
> * gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.
> * gcc.target/aarch64/sve_loop_add_5.c: Likewise.
> * gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.
> * gcc.target/aarch64/sve_miniloop_1.c: Likewise.
> * gcc.target/aarch64/sve_miniloop_2.c: Likewise.
OK.
Jeff
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Allow the number of iterations to be smaller than VF
2017-11-20 3:11 ` Jeff Law
@ 2018-01-07 20:52 ` James Greenhalgh
2018-01-15 10:20 ` Christophe Lyon
0 siblings, 1 reply; 4+ messages in thread
From: James Greenhalgh @ 2018-01-07 20:52 UTC (permalink / raw)
To: Jeff Law; +Cc: gcc-patches, richard.sandiford, nd
On Mon, Nov 20, 2017 at 12:12:38AM +0000, Jeff Law wrote:
> On 11/17/2017 08:11 AM, Richard Sandiford wrote:
> > Fully-masked loops can be profitable even if the iteration
> > count is smaller than the vectorisation factor. In this case
> > we're effectively doing a complete unroll followed by SLP.
> >
> > The documentation for min-vect-loop-bound says that the
> > default value is 0, but actually the default and minimum
> > were 1. We need it to be 0 for this case since the parameter
> > counts a whole number of vector iterations.
> >
> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> > and powerpc64le-linux-gnu. OK to install?
> >
> > Richard
> >
> >
> > 2017-11-17 Richard Sandiford <richard.sandiford@linaro.org>
> > Alan Hayward <alan.hayward@arm.com>
> > David Sherwood <david.sherwood@arm.com>
> >
> > gcc/
> > * doc/sourcebuild.texi (vect_fully_masked): Document.
> > * params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
> > default value to 0.
> > * tree-vect-loop.c (vect_analyze_loop_costing): New function,
> > split out from...
> > (vect_analyze_loop_2): ...here. Don't check the vectorization
> > factor against the number of loop iterations if the loop is
> > fully-masked.
> >
> > gcc/testsuite/
> > * lib/target-supports.exp (check_effective_target_vect_fully_masked):
> > New proc.
> > * gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if
> > vect_fully_masked.
> > * gcc.target/aarch64/sve_loop_add_4.c: New test.
> > * gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.
> > * gcc.target/aarch64/sve_loop_add_5.c: Likewise.
> > * gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.
> > * gcc.target/aarch64/sve_miniloop_1.c: Likewise.
> > * gcc.target/aarch64/sve_miniloop_2.c: Likewise.
> OK.
> Jeff
The AArch64 tests are OK.
James
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Allow the number of iterations to be smaller than VF
2018-01-07 20:52 ` James Greenhalgh
@ 2018-01-15 10:20 ` Christophe Lyon
0 siblings, 0 replies; 4+ messages in thread
From: Christophe Lyon @ 2018-01-15 10:20 UTC (permalink / raw)
To: James Greenhalgh; +Cc: Jeff Law, gcc-patches, richard.sandiford, nd
On 7 January 2018 at 21:51, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Mon, Nov 20, 2017 at 12:12:38AM +0000, Jeff Law wrote:
>> On 11/17/2017 08:11 AM, Richard Sandiford wrote:
>> > Fully-masked loops can be profitable even if the iteration
>> > count is smaller than the vectorisation factor. In this case
>> > we're effectively doing a complete unroll followed by SLP.
>> >
>> > The documentation for min-vect-loop-bound says that the
>> > default value is 0, but actually the default and minimum
>> > were 1. We need it to be 0 for this case since the parameter
>> > counts a whole number of vector iterations.
>> >
>> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>> > and powerpc64le-linux-gnu. OK to install?
>> >
>> > Richard
>> >
>> >
>> > 2017-11-17 Richard Sandiford <richard.sandiford@linaro.org>
>> > Alan Hayward <alan.hayward@arm.com>
>> > David Sherwood <david.sherwood@arm.com>
>> >
>> > gcc/
>> > * doc/sourcebuild.texi (vect_fully_masked): Document.
>> > * params.def (PARAM_MIN_VECT_LOOP_BOUND): Change minimum and
>> > default value to 0.
>> > * tree-vect-loop.c (vect_analyze_loop_costing): New function,
>> > split out from...
>> > (vect_analyze_loop_2): ...here. Don't check the vectorization
>> > factor against the number of loop iterations if the loop is
>> > fully-masked.
>> >
>> > gcc/testsuite/
>> > * lib/target-supports.exp (check_effective_target_vect_fully_masked):
>> > New proc.
>> > * gcc.dg/vect/slp-3.c: Expect all loops to be vectorized if
>> > vect_fully_masked.
>> > * gcc.target/aarch64/sve_loop_add_4.c: New test.
>> > * gcc.target/aarch64/sve_loop_add_4_run.c: Likewise.
>> > * gcc.target/aarch64/sve_loop_add_5.c: Likewise.
>> > * gcc.target/aarch64/sve_loop_add_5_run.c: Likewise.
>> > * gcc.target/aarch64/sve_miniloop_1.c: Likewise.
>> > * gcc.target/aarch64/sve_miniloop_2.c: Likewise.
>> OK.
>> Jeff
>
> The AArch64 tests are OK.
>
I've reported the failures on aarch64-none-elf -mabi=ilp32 in:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83849
Christophe
> James
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-01-15 10:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17 15:15 Allow the number of iterations to be smaller than VF Richard Sandiford
2017-11-20 3:11 ` Jeff Law
2018-01-07 20:52 ` James Greenhalgh
2018-01-15 10:20 ` Christophe Lyon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).