[PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
@ 2024-06-16  7:31 Feng Xue OS
  2024-06-20  5:59 ` Feng Xue OS
  2024-06-20 12:26 ` Richard Biener
  0 siblings, 2 replies; 9+ messages in thread
From: Feng Xue OS @ 2024-06-16  7:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 40921 bytes --]

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

Thanks,
Feng

---
gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (vectorizable_lane_reducing): New function
	declaration.
	* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
	vectorizable_lane_reducing to analyze lane-reducing operation.
	* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
	code related to	emulated_mixed_dot_prod.
	(vect_reduction_update_partial_vector_usage): Compute ncopies as the
	original means for single-lane slp node.
	(vectorizable_lane_reducing): New function.
	(vectorizable_reduction): Allow multiple lane-reducing operations in
	loop reduction. Move some original lane-reducing related code to
	vectorizable_lane_reducing.
	(vect_transform_reduction): Extend transformation to support reduction
	statements with mixed input vectypes.

gcc/testsuite/
	PR tree-optimization/114440
	* gcc.dg/vect/vect-reduc-chain-1.c
	* gcc.dg/vect/vect-reduc-chain-2.c
	* gcc.dg/vect/vect-reduc-chain-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
	* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
 gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 802 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..e17d6291f75
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,35 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b)
+{
+  for (int i = 0; i < 64; i += 4)
+    {
+      res0 += a[i + 0] * b[i + 0];
+      res1 += a[i + 1] * b[i + 1];
+      res2 += a[i + 2] * b[i + 2];
+      res3 += a[i + 3] * b[i + 3];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index e0561feddce..6d91665a341 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();
 
-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
 	   initial result of the data reduction, initial value of the index
 	   reduction.  */
 	prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-	/* We need the initial reduction value and two invariants:
-	   one that contains the minimum signed value and one that
-	   contains half of its negative.  */
-	prologue_stmts = 3;
       else
+	/* We need the initial reduction value.  */
 	prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
 					 scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;
 
-      if (slp_node)
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
 	nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
 	nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }
 
+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+			    slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  if (!slp_node)
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+			       &slp_op, &dt, &vectype, &def_stmt_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "use not simple.\n");
+	  return false;
+	}
+
+      if (!vectype)
+	{
+	  vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+						 slp_op);
+	  if (!vectype)
+	    return false;
+	}
+
+      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "incompatible vector types for invariants\n");
+	  return false;
+	}
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+	continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+	return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction. */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+	 come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+	 value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+					scalar_to_vec, stmt_info, 0,
+					vect_prologue);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+		     "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+		    vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+						  slp_node, code, type,
+						  vectype_in);
+    }
+
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.
 
    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;
 
-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);
 
-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop
 
 	 This isn't necessary for the lane reduction codes, since they
 	 can only be produced by pattern matching, and it's up to the
@@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	 mixed-sign dot-products can be implemented using signed
 	 dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-	  && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
 	  if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
 	      || !vect_can_vectorize_without_simd_p (op.code))
-	    ok = false;
+	    single_defuse_cycle = false;
 	  else
 	    if (dump_enabled_p ())
 	      dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	    dump_printf (MSG_NOTE, "using word mode not possible.\n");
 	  return false;
 	}
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-	{
-	  if (lane_reducing)
-	    return false;
-	  else
-	    single_defuse_cycle = false;
-	}
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 		     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
 
-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "multi def-use cycle not possible for lane-reducing "
-			 "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;
 
   if (slp_node
-      && !(!single_defuse_cycle
-	   && !lane_reducing
-	   && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
 	{
@@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
 			     reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-	/* Three dot-products and a subtraction.  */
-	factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-			stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
 
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
 	= vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);
 
+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+
+      /* Now all lane-reducing ops are covered by some slp node.  */
+      gcc_assert (slp_node);
+    }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
@@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 			 reduc_index == 2 ? op.ops[2] : NULL_TREE,
 			 &vec_oprnds[2]);
     }
+  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
+	   && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+	 vectype of the reduction PHI determines copies of vectorized def-use
+	 cycles, which might be more than effective copies of vectorized lane-
+	 reducing reduction statements.  This could be complemented by
+	 generating extra trivial pass-through copies.  For example:
+
+	   int sum = 0;
+	   for (i)
+	     {
+	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+	       sum += n[i];               // normal <vector(4) int>
+	     }
+
+	 The vector size is 128-bit,vectorization factor is 16.  Reduction
+	 statements would be transformed as:
+
+	   vector<4> int sum_v0 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v1 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v2 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+	   for (i / 16)
+	     {
+	       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 += n_v0[i: 0  ~ 3 ];
+	       sum_v1 += n_v1[i: 4  ~ 7 ];
+	       sum_v2 += n_v2[i: 8  ~ 11];
+	       sum_v3 += n_v3[i: 12 ~ 15];
+	     }
+	*/
+      unsigned using_ncopies = vec_oprnds[0].length ();
+      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+      for (unsigned i = 0; i < op.num_ops - 1; i++)
+	{
+	  gcc_assert (vec_oprnds[i].length () == using_ncopies);
+	  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+	}
+    }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+	{
+	  tree reduc_vop = vec_oprnds[reduc_index][i];
+
+	  /* Insert trivial copy if no need to generate vectorized
+	     statement.  */
+	  gcc_assert (reduc_vop);
+
+	  new_stmt = gimple_build_assign (vec_dest, reduc_vop);
+	  new_temp = make_ssa_name (vec_dest, new_stmt);
+	  gimple_set_lhs (new_stmt, new_temp);
+	  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+	}
+      else if (masked_loop_p && !mask_by_cond_expr)
 	{
 	  /* No conditional ifns have been defined for lane-reducing op
 	     yet.  */
@@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
 	  if (masked_loop_p && mask_by_cond_expr)
 	    {
+	      tree stmt_vectype_in = vectype_in;
+	      unsigned nvectors = vec_num * ncopies;
+	
+	      if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+		{
+		  /* Input vectype of the reduction PHI may be defferent from
+		     that of lane-reducing operation.  */
+		  stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+		  nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
+		}
+
 	      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-					      vec_num * ncopies, vectype_in, i);
+					      nvectors, stmt_vectype_in, i);
 	      build_vect_cond_expr (code, vop, mask, gsi);
 	    }
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..1b73ef01ade 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
 				      NULL, NULL, node, cost_vec)
 	  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
 	  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+	  || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+					 stmt_info, node, cost_vec)
 	  || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
 				     node, node_instance, cost_vec)
 	  || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 					 slp_tree, slp_instance, int,
 					 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+					slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
 				    slp_tree, slp_instance,
 				    stmt_vector_for_cost *);
-- 
2.17.1

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0007-vect-Support-multiple-lane-reducing-operations-for-l.patch --]
[-- Type: text/x-patch; name="0007-vect-Support-multiple-lane-reducing-operations-for-l.patch", Size: 40043 bytes --]

From 67045272c75c3016c33cb87f893ce4cd3a8374a0 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:22:36 +0800
Subject: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop
 reduction [PR114440]

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitray lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trival pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (vectorizable_lane_reducing): New function
	declaration.
	* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
	vectorizable_lane_reducing to analyze lane-reducing operation.
	* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
	code related to	emulated_mixed_dot_prod.
	(vect_reduction_update_partial_vector_usage): Compute ncopies as the
	original means for single-lane slp node.
	(vectorizable_lane_reducing): New function.
	(vectorizable_reduction): Allow multiple lane-reducing operations in
	loop reduction. Move some original lane-reducing related code to
	vectorizable_lane_reducing.
	(vect_transform_reduction): Extend transformation to support reduction
	statements with mixed input vectypes.

gcc/testsuite/
	PR tree-optimization/114440
	* gcc.dg/vect/vect-reduc-chain-1.c
	* gcc.dg/vect/vect-reduc-chain-2.c
	* gcc.dg/vect/vect-reduc-chain-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
	* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
 gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 802 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..e17d6291f75
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,35 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b)
+{
+  for (int i = 0; i < 64; i += 4)
+    {
+      res0 += a[i + 0] * b[i + 0];
+      res1 += a[i + 1] * b[i + 1];
+      res2 += a[i + 2] * b[i + 2];
+      res3 += a[i + 3] * b[i + 3];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index e0561feddce..6d91665a341 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();
 
-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
 	   initial result of the data reduction, initial value of the index
 	   reduction.  */
 	prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-	/* We need the initial reduction value and two invariants:
-	   one that contains the minimum signed value and one that
-	   contains half of its negative.  */
-	prologue_stmts = 3;
       else
+	/* We need the initial reduction value.  */
 	prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
 					 scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;
 
-      if (slp_node)
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
 	nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
 	nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }
 
+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+			    slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  if (!slp_node)
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+			       &slp_op, &dt, &vectype, &def_stmt_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "use not simple.\n");
+	  return false;
+	}
+
+      if (!vectype)
+	{
+	  vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+						 slp_op);
+	  if (!vectype)
+	    return false;
+	}
+
+      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "incompatible vector types for invariants\n");
+	  return false;
+	}
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+	continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+	return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction. */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+	 come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+	 value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+					scalar_to_vec, stmt_info, 0,
+					vect_prologue);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+		     "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+		    vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+						  slp_node, code, type,
+						  vectype_in);
+    }
+
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.
 
    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;
 
-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);
 
-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop
 
 	 This isn't necessary for the lane reduction codes, since they
 	 can only be produced by pattern matching, and it's up to the
@@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	 mixed-sign dot-products can be implemented using signed
 	 dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-	  && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
 	  if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
 	      || !vect_can_vectorize_without_simd_p (op.code))
-	    ok = false;
+	    single_defuse_cycle = false;
 	  else
 	    if (dump_enabled_p ())
 	      dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	    dump_printf (MSG_NOTE, "using word mode not possible.\n");
 	  return false;
 	}
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-	{
-	  if (lane_reducing)
-	    return false;
-	  else
-	    single_defuse_cycle = false;
-	}
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 		     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
 
-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "multi def-use cycle not possible for lane-reducing "
-			 "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;
 
   if (slp_node
-      && !(!single_defuse_cycle
-	   && !lane_reducing
-	   && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
 	{
@@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
 			     reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-	/* Three dot-products and a subtraction.  */
-	factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-			stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
 
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
 	= vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);
 
+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+
+      /* Now all lane-reducing ops are covered by some slp node.  */
+      gcc_assert (slp_node);
+    }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
@@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 			 reduc_index == 2 ? op.ops[2] : NULL_TREE,
 			 &vec_oprnds[2]);
     }
+  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
+	   && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+	 vectype of the reduction PHI determines copies of vectorized def-use
+	 cycles, which might be more than effective copies of vectorized lane-
+	 reducing reduction statements.  This could be complemented by
+	 generating extra trivial pass-through copies.  For example:
+
+	   int sum = 0;
+	   for (i)
+	     {
+	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+	       sum += n[i];               // normal <vector(4) int>
+	     }
+
+	 The vector size is 128-bit，vectorization factor is 16.  Reduction
+	 statements would be transformed as:
+
+	   vector<4> int sum_v0 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v1 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v2 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+	   for (i / 16)
+	     {
+	       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 += n_v0[i: 0  ~ 3 ];
+	       sum_v1 += n_v1[i: 4  ~ 7 ];
+	       sum_v2 += n_v2[i: 8  ~ 11];
+	       sum_v3 += n_v3[i: 12 ~ 15];
+	     }
+	*/
+      unsigned using_ncopies = vec_oprnds[0].length ();
+      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+      for (unsigned i = 0; i < op.num_ops - 1; i++)
+	{
+	  gcc_assert (vec_oprnds[i].length () == using_ncopies);
+	  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+	}
+    }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+	{
+	  tree reduc_vop = vec_oprnds[reduc_index][i];
+
+	  /* Insert trivial copy if no need to generate vectorized
+	     statement.  */
+	  gcc_assert (reduc_vop);
+
+	  new_stmt = gimple_build_assign (vec_dest, reduc_vop);
+	  new_temp = make_ssa_name (vec_dest, new_stmt);
+	  gimple_set_lhs (new_stmt, new_temp);
+	  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+	}
+      else if (masked_loop_p && !mask_by_cond_expr)
 	{
 	  /* No conditional ifns have been defined for lane-reducing op
 	     yet.  */
@@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
 	  if (masked_loop_p && mask_by_cond_expr)
 	    {
+	      tree stmt_vectype_in = vectype_in;
+	      unsigned nvectors = vec_num * ncopies;
+	
+	      if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+		{
+		  /* Input vectype of the reduction PHI may be defferent from
+		     that of lane-reducing operation.  */
+		  stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+		  nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
+		}
+
 	      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-					      vec_num * ncopies, vectype_in, i);
+					      nvectors, stmt_vectype_in, i);
 	      build_vect_cond_expr (code, vop, mask, gsi);
 	    }
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..1b73ef01ade 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
 				      NULL, NULL, node, cost_vec)
 	  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
 	  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+	  || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+					 stmt_info, node, cost_vec)
 	  || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
 				     node, node_instance, cost_vec)
 	  || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 					 slp_tree, slp_instance, int,
 					 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+					slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
 				    slp_tree, slp_instance,
 				    stmt_vector_for_cost *);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop  reduction [PR114440]
  2024-06-16  7:31 [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Feng Xue OS
@ 2024-06-20  5:59 ` Feng Xue OS
  2024-06-20 12:26 ` Richard Biener
  1 sibling, 0 replies; 9+ messages in thread
From: Feng Xue OS @ 2024-06-20  5:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Updated the patch to some new changes.


For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (vectorizable_lane_reducing): New function
        declaration.
        * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
        vectorizable_lane_reducing to analyze lane-reducing operation.
        * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
        code related to emulated_mixed_dot_prod.
        (vect_reduction_update_partial_vector_usage): Compute ncopies as the
        original means for single-lane slp node.
        (vectorizable_lane_reducing): New function.
        (vectorizable_reduction): Allow multiple lane-reducing operations in
        loop reduction. Move some original lane-reducing related code to
        vectorizable_lane_reducing.
        (vect_transform_reduction): Extend transformation to support reduction
        statements with mixed input vectypes.

gcc/testsuite/
        PR tree-optimization/114440
        * gcc.dg/vect/vect-reduc-chain-1.c
        * gcc.dg/vect/vect-reduc-chain-2.c
        * gcc.dg/vect/vect-reduc-chain-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
        * gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 ++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 ++++
 gcc/tree-vect-loop.cc                         | 334 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 834 insertions(+), 73 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..84c82b023d4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,60 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_1 int res4,
+   SIGNEDNESS_1 int res5,
+   SIGNEDNESS_1 int res6,
+   SIGNEDNESS_1 int res7,
+   SIGNEDNESS_1 int res8,
+   SIGNEDNESS_1 int res9,
+   SIGNEDNESS_1 int resA,
+   SIGNEDNESS_1 int resB,
+   SIGNEDNESS_1 int resC,
+   SIGNEDNESS_1 int resD,
+   SIGNEDNESS_1 int resE,
+   SIGNEDNESS_1 int resF,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b)
+{
+  for (int i = 0; i < 64; i += 16)
+    {
+      res0 += a[i + 0x00] * b[i + 0x00];
+      res1 += a[i + 0x01] * b[i + 0x01];
+      res2 += a[i + 0x02] * b[i + 0x02];
+      res3 += a[i + 0x03] * b[i + 0x03];
+      res4 += a[i + 0x04] * b[i + 0x04];
+      res5 += a[i + 0x05] * b[i + 0x05];
+      res6 += a[i + 0x06] * b[i + 0x06];
+      res7 += a[i + 0x07] * b[i + 0x07];
+      res8 += a[i + 0x08] * b[i + 0x08];
+      res9 += a[i + 0x09] * b[i + 0x09];
+      resA += a[i + 0x0A] * b[i + 0x0A];
+      resB += a[i + 0x0B] * b[i + 0x0B];
+      resC += a[i + 0x0C] * b[i + 0x0C];
+      resD += a[i + 0x0D] * b[i + 0x0D];
+      resE += a[i + 0x0E] * b[i + 0x0E];
+      resF += a[i + 0x0F] * b[i + 0x0F];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^
+         res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6b9ca7a4df5..5a27a2c3d9c 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();

-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
           initial result of the data reduction, initial value of the index
           reduction.  */
        prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-       /* We need the initial reduction value and two invariants:
-          one that contains the minimum signed value and one that
-          contains half of its negative.  */
-       prologue_stmts = 3;
       else
+       /* We need the initial reduction value.  */
        prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
                                         scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;

-      if (slp_node)
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
        nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
        nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7472,149 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }

+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+                              &slp_op, &dt, &vectype, &def_stmt_info))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "use not simple.\n");
+         return false;
+       }
+
+      if (!vectype)
+       {
+         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+                                                slp_op);
+         if (!vectype)
+           return false;
+       }
+
+      if (slp_node && !vect_maybe_update_slp_op_vectype (slp_op, vectype))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "incompatible vector types for invariants\n");
+         return false;
+       }
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+       continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+       return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction. */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (slp_node && SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+        come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+        value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+                                       scalar_to_vec, stmt_info, 0,
+                                       vect_prologue);
+      if (dump_enabled_p ())
+       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+                    "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+                   vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+                                                 slp_node, code, type,
+                                                 vectype_in);
+    }
+
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.

    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7804,18 +7941,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;

-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8355,14 +8480,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;

-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);

-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop

         This isn't necessary for the lane reduction codes, since they
         can only be produced by pattern matching, and it's up to the
@@ -8371,14 +8493,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
         mixed-sign dot-products can be implemented using signed
         dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-         && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
          if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
              || !vect_can_vectorize_without_simd_p (op.code))
-           ok = false;
+           single_defuse_cycle = false;
          else
            if (dump_enabled_p ())
              dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8391,16 +8512,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
            dump_printf (MSG_NOTE, "using word mode not possible.\n");
          return false;
        }
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-       {
-         if (lane_reducing)
-           return false;
-         else
-           single_defuse_cycle = false;
-       }
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8408,22 +8519,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
                     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;

-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "multi def-use cycle not possible for lane-reducing "
-                        "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;

   if (slp_node
-      && !(!single_defuse_cycle
-          && !lane_reducing
-          && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
        {
@@ -8436,28 +8539,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
                             reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-       /* Three dot-products and a subtraction.  */
-       factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-                       stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);

   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
                     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
        = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8585,6 +8680,13 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
   int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
   tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  code_helper code = canonicalize_code (op.code, op.type);
+  bool lane_reducing = lane_reducing_op_p (code);
+
+  /* Each lane-reducing operation has its own input vectype, which might be
+     different from that of reduction PHI.  */
+  if (lane_reducing)
+    vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);

   if (slp_node)
     {
@@ -8597,9 +8699,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
       vec_num = 1;
     }

-  code_helper code = canonicalize_code (op.code, op.type);
   internal_fn cond_fn = get_conditional_internal_fn (code, op.type);
-
   vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
   vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
   bool mask_by_cond_expr = use_mask_by_cond_expr_p (code, cond_fn, vectype_in);
@@ -8644,7 +8744,6 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     }

   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);

   /* Create the destination vector  */
@@ -8691,6 +8790,73 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
                         reduc_index == 2 ? op.ops[2] : NULL_TREE,
                         &vec_oprnds[2]);
     }
+  else if (lane_reducing && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+        vectype of the reduction PHI determines copies of vectorized def-use
+        cycles, which might be more than effective copies of vectorized lane-
+        reducing reduction statements.  This could be complemented by
+        generating extra trivial pass-through copies.  For example:
+
+          int sum = 0;
+          for (i)
+            {
+              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+              sum += n[i];               // normal <vector(4) int>
+            }
+
+        The vector size is 128-bit?vectorization factor is 16.  Reduction
+        statements would be transformed as:
+
+          vector<4> int sum_v0 = { 0, 0, 0, 0 };
+          vector<4> int sum_v1 = { 0, 0, 0, 0 };
+          vector<4> int sum_v2 = { 0, 0, 0, 0 };
+          vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+          for (i / 16)
+            {
+              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+              sum_v1 = sum_v1;  // copy
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 += n_v0[i: 0  ~ 3 ];
+              sum_v1 += n_v1[i: 4  ~ 7 ];
+              sum_v2 += n_v2[i: 8  ~ 11];
+              sum_v3 += n_v3[i: 12 ~ 15];
+            }
+       */
+      tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+      unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in);
+      unsigned use_ncopies = vec_oprnds[0].length ();
+
+      if (use_ncopies < all_ncopies)
+       {
+         if (!slp_node)
+           {
+             tree reduc_oprnd = op.ops[reduc_index];
+
+             vec_oprnds[reduc_index].truncate (0);
+             vect_get_vec_defs_for_operand (loop_vinfo, stmt_info,
+                                            all_ncopies, reduc_oprnd,
+                                            &vec_oprnds[reduc_index]);
+           }
+         else
+           gcc_assert (all_ncopies == vec_oprnds[reduc_index].length ());
+
+         for (unsigned i = 0; i < op.num_ops - 1; i++)
+           {
+             gcc_assert (vec_oprnds[i].length () == use_ncopies);
+             vec_oprnds[i].safe_grow_cleared (all_ncopies);
+           }
+       }
+    }

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8699,7 +8865,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+       {
+         tree reduc_vop = vec_oprnds[reduc_index][i];
+
+         /* Insert trivial copy if no need to generate vectorized
+            statement.  */
+         gcc_assert (reduc_vop);
+
+         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
+         new_temp = make_ssa_name (vec_dest, new_stmt);
+         gimple_set_lhs (new_stmt, new_temp);
+         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+       }
+      else if (masked_loop_p && !mask_by_cond_expr)
        {
          /* No conditional ifns have been defined for lane-reducing op
             yet.  */
@@ -8728,8 +8908,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

          if (masked_loop_p && mask_by_cond_expr)
            {
+             unsigned nvectors = vec_num * ncopies;
+
+             /* For single-lane slp node on lane-reducing op, we need to
+                compute exact number of vector stmts from its input vectype,
+                since the value got from the slp node is over-estimated.  */
+             if (lane_reducing && slp_node && SLP_TREE_LANES (slp_node) == 1)
+               nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+
              tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-                                             vec_num * ncopies, vectype_in, i);
+                                             nvectors, vectype_in, i);
              build_vect_cond_expr (code, vop, mask, gsi);
            }

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..1b73ef01ade 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
                                      NULL, NULL, node, cost_vec)
          || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
          || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+                                        stmt_info, node, cost_vec)
          || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
                                     node, node_instance, cost_vec)
          || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
                                         slp_tree, slp_instance, int,
                                         bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+                                       slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
                                    slp_tree, slp_instance,
                                    stmt_vector_for_cost *);
--
2.17.1

________________________________________
From: Feng Xue OS <fxue@os.amperecomputing.com>
Sent: Sunday, June 16, 2024 3:31 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop  reduction [PR114440]

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

Thanks,
Feng

---
gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (vectorizable_lane_reducing): New function
        declaration.
        * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
        vectorizable_lane_reducing to analyze lane-reducing operation.
        * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
        code related to emulated_mixed_dot_prod.
        (vect_reduction_update_partial_vector_usage): Compute ncopies as the
        original means for single-lane slp node.
        (vectorizable_lane_reducing): New function.
        (vectorizable_reduction): Allow multiple lane-reducing operations in
        loop reduction. Move some original lane-reducing related code to
        vectorizable_lane_reducing.
        (vect_transform_reduction): Extend transformation to support reduction
        statements with mixed input vectypes.

gcc/testsuite/
        PR tree-optimization/114440
        * gcc.dg/vect/vect-reduc-chain-1.c
        * gcc.dg/vect/vect-reduc-chain-2.c
        * gcc.dg/vect/vect-reduc-chain-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
        * gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
 gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 802 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..e17d6291f75
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,35 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b)
+{
+  for (int i = 0; i < 64; i += 4)
+    {
+      res0 += a[i + 0] * b[i + 0];
+      res1 += a[i + 1] * b[i + 1];
+      res2 += a[i + 2] * b[i + 2];
+      res3 += a[i + 3] * b[i + 3];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index e0561feddce..6d91665a341 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();

-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
           initial result of the data reduction, initial value of the index
           reduction.  */
        prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-       /* We need the initial reduction value and two invariants:
-          one that contains the minimum signed value and one that
-          contains half of its negative.  */
-       prologue_stmts = 3;
       else
+       /* We need the initial reduction value.  */
        prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
                                         scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;

-      if (slp_node)
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
        nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
        nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }

+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  if (!slp_node)
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+                              &slp_op, &dt, &vectype, &def_stmt_info))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "use not simple.\n");
+         return false;
+       }
+
+      if (!vectype)
+       {
+         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+                                                slp_op);
+         if (!vectype)
+           return false;
+       }
+
+      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "incompatible vector types for invariants\n");
+         return false;
+       }
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+       continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+       return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction. */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+        come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+        value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+                                       scalar_to_vec, stmt_info, 0,
+                                       vect_prologue);
+      if (dump_enabled_p ())
+       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+                    "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+                   vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+                                                 slp_node, code, type,
+                                                 vectype_in);
+    }
+
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.

    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;

-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;

-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);

-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop

         This isn't necessary for the lane reduction codes, since they
         can only be produced by pattern matching, and it's up to the
@@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
         mixed-sign dot-products can be implemented using signed
         dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-         && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
          if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
              || !vect_can_vectorize_without_simd_p (op.code))
-           ok = false;
+           single_defuse_cycle = false;
          else
            if (dump_enabled_p ())
              dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
            dump_printf (MSG_NOTE, "using word mode not possible.\n");
          return false;
        }
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-       {
-         if (lane_reducing)
-           return false;
-         else
-           single_defuse_cycle = false;
-       }
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
                     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;

-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "multi def-use cycle not possible for lane-reducing "
-                        "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;

   if (slp_node
-      && !(!single_defuse_cycle
-          && !lane_reducing
-          && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
        {
@@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
                             reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-       /* Three dot-products and a subtraction.  */
-       factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-                       stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);

   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
                     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
        = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);

+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+
+      /* Now all lane-reducing ops are covered by some slp node.  */
+      gcc_assert (slp_node);
+    }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
@@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
                         reduc_index == 2 ? op.ops[2] : NULL_TREE,
                         &vec_oprnds[2]);
     }
+  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
+          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+        vectype of the reduction PHI determines copies of vectorized def-use
+        cycles, which might be more than effective copies of vectorized lane-
+        reducing reduction statements.  This could be complemented by
+        generating extra trivial pass-through copies.  For example:
+
+          int sum = 0;
+          for (i)
+            {
+              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+              sum += n[i];               // normal <vector(4) int>
+            }
+
+        The vector size is 128-bit?vectorization factor is 16.  Reduction
+        statements would be transformed as:
+
+          vector<4> int sum_v0 = { 0, 0, 0, 0 };
+          vector<4> int sum_v1 = { 0, 0, 0, 0 };
+          vector<4> int sum_v2 = { 0, 0, 0, 0 };
+          vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+          for (i / 16)
+            {
+              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+              sum_v1 = sum_v1;  // copy
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 += n_v0[i: 0  ~ 3 ];
+              sum_v1 += n_v1[i: 4  ~ 7 ];
+              sum_v2 += n_v2[i: 8  ~ 11];
+              sum_v3 += n_v3[i: 12 ~ 15];
+            }
+       */
+      unsigned using_ncopies = vec_oprnds[0].length ();
+      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+      for (unsigned i = 0; i < op.num_ops - 1; i++)
+       {
+         gcc_assert (vec_oprnds[i].length () == using_ncopies);
+         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+       }
+    }

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+       {
+         tree reduc_vop = vec_oprnds[reduc_index][i];
+
+         /* Insert trivial copy if no need to generate vectorized
+            statement.  */
+         gcc_assert (reduc_vop);
+
+         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
+         new_temp = make_ssa_name (vec_dest, new_stmt);
+         gimple_set_lhs (new_stmt, new_temp);
+         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+       }
+      else if (masked_loop_p && !mask_by_cond_expr)
        {
          /* No conditional ifns have been defined for lane-reducing op
             yet.  */
@@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

          if (masked_loop_p && mask_by_cond_expr)
            {
+             tree stmt_vectype_in = vectype_in;
+             unsigned nvectors = vec_num * ncopies;
+
+             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+               {
+                 /* Input vectype of the reduction PHI may be defferent from
+                    that of lane-reducing operation.  */
+                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
+               }
+
              tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-                                             vec_num * ncopies, vectype_in, i);
+                                             nvectors, stmt_vectype_in, i);
              build_vect_cond_expr (code, vop, mask, gsi);
            }

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..1b73ef01ade 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
                                      NULL, NULL, node, cost_vec)
          || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
          || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+                                        stmt_info, node, cost_vec)
          || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
                                     node, node_instance, cost_vec)
          || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
                                         slp_tree, slp_instance, int,
                                         bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+                                       slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
                                    slp_tree, slp_instance,
                                    stmt_vector_for_cost *);
--
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-16  7:31 [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Feng Xue OS
  2024-06-20  5:59 ` Feng Xue OS
@ 2024-06-20 12:26 ` Richard Biener
  2024-06-23 15:10   ` Feng Xue OS
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Biener @ 2024-06-20 12:26 UTC (permalink / raw)
  To: Feng Xue OS; +Cc: gcc-patches

On Sun, Jun 16, 2024 at 9:31 AM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> vectorizer could only handle the pattern if the reduction chain does not
> contain other operation, no matter the other is normal or lane-reducing.
>
> Actually, to allow multiple arbitrary lane-reducing operations, we need to
> support vectorization of loop reduction chain with mixed input vectypes. Since
> lanes of vectype may vary with operation, the effective ncopies of vectorized
> statements for operation also may not be same to each other, this causes
> mismatch on vectorized def-use cycles. A simple way is to align all operations
> with the one that has the most ncopies, the gap could be complemented by
> generating extra trivial pass-through copies. For example:
>
>    int sum = 0;
>    for (i)
>      {
>        sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>        sum += w[i];               // widen-sum <vector(16) char>
>        sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>        sum += n[i];               // normal <vector(4) int>
>      }
>
> The vector size is 128-bit vectorization factor is 16. Reduction statements
> would be transformed as:
>
>    vector<4> int sum_v0 = { 0, 0, 0, 0 };
>    vector<4> int sum_v1 = { 0, 0, 0, 0 };
>    vector<4> int sum_v2 = { 0, 0, 0, 0 };
>    vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>    for (i / 16)
>      {
>        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>        sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 += n_v0[i: 0  ~ 3 ];
>        sum_v1 += n_v1[i: 4  ~ 7 ];
>        sum_v2 += n_v2[i: 8  ~ 11];
>        sum_v3 += n_v3[i: 12 ~ 15];
>      }
>
> Thanks,
> Feng
>
> ---
> gcc/
>         PR tree-optimization/114440
>         * tree-vectorizer.h (vectorizable_lane_reducing): New function
>         declaration.
>         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
>         vectorizable_lane_reducing to analyze lane-reducing operation.
>         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
>         code related to emulated_mixed_dot_prod.
>         (vect_reduction_update_partial_vector_usage): Compute ncopies as the
>         original means for single-lane slp node.
>         (vectorizable_lane_reducing): New function.
>         (vectorizable_reduction): Allow multiple lane-reducing operations in
>         loop reduction. Move some original lane-reducing related code to
>         vectorizable_lane_reducing.
>         (vect_transform_reduction): Extend transformation to support reduction
>         statements with mixed input vectypes.
>
> gcc/testsuite/
>         PR tree-optimization/114440
>         * gcc.dg/vect/vect-reduc-chain-1.c
>         * gcc.dg/vect/vect-reduc-chain-2.c
>         * gcc.dg/vect/vect-reduc-chain-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>         * gcc.dg/vect/vect-reduc-dot-slp-1.c
> ---
>  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
>  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
>  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
>  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
>  gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
>  gcc/tree-vect-stmts.cc                        |   2 +
>  gcc/tree-vectorizer.h                         |   2 +
>  11 files changed, 802 insertions(+), 70 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> new file mode 100644
> index 00000000000..04bfc419dbd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> @@ -0,0 +1,62 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_2 char *restrict c,
> +   SIGNEDNESS_2 char *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_2 char c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      c[i] = BASE + i * 2;
> +      d[i] = BASE + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> new file mode 100644
> index 00000000000..6c803b80120
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> @@ -0,0 +1,77 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#define SIGNEDNESS_4 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +fn (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 char *restrict c,
> +   SIGNEDNESS_3 char *restrict d,
> +   SIGNEDNESS_4 short *restrict e,
> +   SIGNEDNESS_4 short *restrict f,
> +   SIGNEDNESS_1 int *restrict g)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += i + 1;
> +      res += c[i] * d[i];
> +      res += e[i] * f[i];
> +      res += g[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
> +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 char c[N], d[N];
> +  SIGNEDNESS_4 short e[N], f[N];
> +  SIGNEDNESS_1 int g[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 + OFFSET + i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = BASE4 + i * 6;
> +      f[i] = BASE4 + OFFSET + i * 5;
> +      g[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += i + 1;
> +      expected += c[i] * d[i];
> +      expected += e[i] * f[i];
> +      expected += g[i];
> +    }
> +  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> new file mode 100644
> index 00000000000..a41e4b176c4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> @@ -0,0 +1,66 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 short *restrict c,
> +   SIGNEDNESS_3 short *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      res += abs;
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 short c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 - i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      expected += abs;
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> new file mode 100644
> index 00000000000..c2831fbcc8e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> @@ -0,0 +1,95 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +      res += a[8] * b[8];
> +      res += a[9] * b[9];
> +      res += a[10] * b[10];
> +      res += a[11] * b[11];
> +      res += a[12] * b[12];
> +      res += a[13] * b[13];
> +      res += a[14] * b[14];
> +      res += a[15] * b[15];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 16;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      expected += a[t + 8] * b[t + 8];
> +      expected += a[t + 9] * b[t + 9];
> +      expected += a[t + 10] * b[t + 10];
> +      expected += a[t + 11] * b[t + 11];
> +      expected += a[t + 12] * b[t + 12];
> +      expected += a[t + 13] * b[t + 13];
> +      expected += a[t + 14] * b[t + 14];
> +      expected += a[t + 15] * b[t + 15];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> new file mode 100644
> index 00000000000..4114264a364
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> @@ -0,0 +1,67 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[5 * i + 0] * b[5 * i + 0];
> +      res += a[5 * i + 1] * b[5 * i + 1];
> +      res += a[5 * i + 2] * b[5 * i + 2];
> +      res += a[5 * i + 3] * b[5 * i + 3];
> +      res += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[5 * i + 0] * b[5 * i + 0];
> +      expected += a[5 * i + 1] * b[5 * i + 1];
> +      expected += a[5 * i + 2] * b[5 * i + 2];
> +      expected += a[5 * i + 3] * b[5 * i + 3];
> +      expected += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> new file mode 100644
> index 00000000000..2cdecc36d16
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> @@ -0,0 +1,79 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 8;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> new file mode 100644
> index 00000000000..32c0f30c77b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> @@ -0,0 +1,63 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[3 * i + 0] * b[3 * i + 0];
> +      res += a[3 * i + 1] * b[3 * i + 1];
> +      res += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[3 * i + 0] * b[3 * i + 0];
> +      expected += a[3 * i + 1] * b[3 * i + 1];
> +      expected += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> new file mode 100644
> index 00000000000..e17d6291f75
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> @@ -0,0 +1,35 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-do compile } */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res0,
> +   SIGNEDNESS_1 int res1,
> +   SIGNEDNESS_1 int res2,
> +   SIGNEDNESS_1 int res3,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b)
> +{
> +  for (int i = 0; i < 64; i += 4)
> +    {
> +      res0 += a[i + 0] * b[i + 0];
> +      res1 += a[i + 1] * b[i + 1];
> +      res2 += a[i + 2] * b[i + 2];
> +      res3 += a[i + 3] * b[i + 3];
> +    }
> +
> +  return res0 ^ res1 ^ res2 ^ res3;
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index e0561feddce..6d91665a341 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>      gcc_unreachable ();
>
> -  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> -
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      /* No extra instructions are needed in the prologue.  The loop body
>         operations are costed in vectorizable_condition.  */
> @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>            initial result of the data reduction, initial value of the index
>            reduction.  */
>         prologue_stmts = 4;
> -      else if (emulated_mixed_dot_prod)
> -       /* We need the initial reduction value and two invariants:
> -          one that contains the minimum signed value and one that
> -          contains half of its negative.  */
> -       prologue_stmts = 3;
>        else
> +       /* We need the initial reduction value.  */
>         prologue_stmts = 1;
>        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>                                          scalar_to_vec, stmt_info, 0,
> @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>        vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>        unsigned nvectors;
>
> -      if (slp_node)
> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)

Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
instead, which is bad.

>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>        else
>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>      }
>  }
>
> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> +   (sum-of-absolute-differences).
> +
> +   For a lane-reducing operation, the loop reduction path that it lies in,
> +   may contain normal operation, or other lane-reducing operation of different
> +   input type size, an example as:
> +
> +     int sum = 0;
> +     for (i)
> +       {
> +         ...
> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> +         sum += w[i];                // widen-sum <vector(16) char>
> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> +         sum += n[i];                // normal <vector(4) int>
> +         ...
> +       }
> +
> +   Vectorization factor is essentially determined by operation whose input
> +   vectype has the most lanes ("vector(16) char" in the example), while we
> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> +   example) for the reduction PHI statement.  */
> +
> +bool
> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> +{
> +  gimple *stmt = stmt_info->stmt;
> +
> +  if (!lane_reducing_stmt_p (stmt))
> +    return false;
> +
> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> +
> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> +    return false;
> +
> +  /* Do not try to vectorize bit-precision reductions.  */
> +  if (!type_has_mode_precision_p (type))
> +    return false;
> +
> +  if (!slp_node)
> +    return false;
> +
> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> +    {
> +      stmt_vec_info def_stmt_info;
> +      slp_tree slp_op;
> +      tree op;
> +      tree vectype;
> +      enum vect_def_type dt;
> +
> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "use not simple.\n");
> +         return false;
> +       }
> +
> +      if (!vectype)
> +       {
> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> +                                                slp_op);
> +         if (!vectype)
> +           return false;
> +       }
> +
> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "incompatible vector types for invariants\n");
> +         return false;
> +       }
> +
> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> +       continue;
> +
> +      /* There should be at most one cycle def in the stmt.  */
> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> +       return false;
> +    }
> +
> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> +
> +  /* TODO: Support lane-reducing operation that does not directly participate
> +     in loop reduction. */
> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> +    return false;
> +
> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> +     recoginized.  */
> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> +
> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +  int ncopies_for_cost;
> +
> +  if (SLP_TREE_LANES (slp_node) > 1)
> +    {
> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> +        come from the same loop reduction path.  */
> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> +      ncopies_for_cost = 1;
> +    }
> +  else
> +    {
> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);

OK, so the fact that the ops are lane-reducing means they effectively
change the VF for the result.  That's only possible as we tightly control
code generation and "adjust" to the expected VF (by inserting the copies
you mentioned above), but only up to the highest number of outputs
created in the reduction chain.  In that sense instead of talking and recording
"input vector types" wouldn't it make more sense to record the effective
vectorization factor for the reduction instance?  That VF would be at most
the loops VF but could be as low as 1.  Once we have a non-lane-reducing
operation in the reduction chain it would be always equal to the loops VF.

ncopies would then be always determined by that reduction instance VF and
the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
instance VF would also trivially indicate the force-single-def-use-cycle
case, possibly simplifying code?

> +      gcc_assert (ncopies_for_cost >= 1);
> +    }
> +
> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> +    {
> +      /* We need extra two invariants: one that contains the minimum signed
> +        value and one that contains half of its negative.  */
> +      int prologue_stmts = 2;
> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> +                                       scalar_to_vec, stmt_info, 0,
> +                                       vect_prologue);
> +      if (dump_enabled_p ())
> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> +                    "extra prologue_cost = %d .\n", cost);
> +
> +      /* Three dot-products and a subtraction.  */
> +      ncopies_for_cost *= 4;
> +    }
> +
> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> +                   vect_body);
> +
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> +                                                 slp_node, code, type,
> +                                                 vectype_in);
> +    }
> +

Add a comment:

    /* Transform via vect_transform_reduction.  */

> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> +  return true;
> +}
> +
>  /* Function vectorizable_reduction.
>
>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    if (!type_has_mode_precision_p (op.type))
>      return false;
>
> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> -     which means the only use of that may be in the lane-reducing operation.  */
> -  if (lane_reducing
> -      && reduc_chain_length != 1
> -      && !only_slp_reduc_chain)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "lane-reducing reduction with extra stmts.\n");
> -      return false;
> -    }
> -
>    /* Lane-reducing ops also never can be used in a SLP reduction group
>       since we'll mix lanes belonging to different reductions.  But it's
>       OK to use them in a reduction chain or when the reduction group
> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>        && loop_vinfo->suggested_unroll_factor == 1)
>      single_defuse_cycle = true;
>
> -  if (single_defuse_cycle || lane_reducing)
> +  if (single_defuse_cycle && !lane_reducing)

If there's also a non-lane-reducing plus in the chain don't we have to
check for that reduction op?  So shouldn't it be
single_defuse_cycle && ... fact that we don't record
(non-lane-reducing op there) ...

>      {
>        gcc_assert (op.code != COND_EXPR);
>
> -      /* 4. Supportable by target?  */
> -      bool ok = true;
> -
> -      /* 4.1. check support for the operation in the loop
> +      /* 4. check support for the operation in the loop
>
>          This isn't necessary for the lane reduction codes, since they
>          can only be produced by pattern matching, and it's up to the
> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>          mixed-sign dot-products can be implemented using signed
>          dot-products.  */
>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> -      if (!lane_reducing
> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>          {
>            if (dump_enabled_p ())
>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>               || !vect_can_vectorize_without_simd_p (op.code))
> -           ok = false;
> +           single_defuse_cycle = false;
>           else
>             if (dump_enabled_p ())
>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>           return false;
>         }
> -
> -      /* lane-reducing operations have to go through vect_transform_reduction.
> -         For the other cases try without the single cycle optimization.  */
> -      if (!ok)
> -       {
> -         if (lane_reducing)
> -           return false;
> -         else
> -           single_defuse_cycle = false;
> -       }
>      }
>    if (dump_enabled_p () && single_defuse_cycle)
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>                      "multiple vectors to one in the loop body\n");
>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>
> -  /* If the reduction stmt is one of the patterns that have lane
> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> -  if ((ncopies > 1 && ! single_defuse_cycle)
> -      && lane_reducing)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "multi def-use cycle not possible for lane-reducing "
> -                        "reduction operation\n");
> -      return false;
> -    }
> +  /* For lane-reducing operation, the below processing related to single
> +     defuse-cycle will be done in its own vectorizable function.  One more
> +     thing to note is that the operation must not be involved in fold-left
> +     reduction.  */
> +  single_defuse_cycle &= !lane_reducing;
>
>    if (slp_node
> -      && !(!single_defuse_cycle
> -          && !lane_reducing
> -          && reduction_type != FOLD_LEFT_REDUCTION))
> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>      for (i = 0; i < (int) op.num_ops; i++)
>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>         {
> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>                              reduction_type, ncopies, cost_vec);
>    /* Cost the reduction op inside the loop if transformed via
> -     vect_transform_reduction.  Otherwise this is costed by the
> -     separate vectorizable_* routines.  */
> -  if (single_defuse_cycle || lane_reducing)
> -    {
> -      int factor = 1;
> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> -       /* Three dot-products and a subtraction.  */
> -       factor = 4;
> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> -                       stmt_info, 0, vect_body);
> -    }
> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> +     this is costed by the separate vectorizable_* routines.  */
> +  if (single_defuse_cycle)
> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>
>    if (dump_enabled_p ()
>        && reduction_type == FOLD_LEFT_REDUCTION)
>      dump_printf_loc (MSG_NOTE, vect_location,
>                      "using an in-order (fold-left) reduction.\n");
>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> -     reductions go through their own vectorizable_* routines.  */
> -  if (!single_defuse_cycle
> -      && !lane_reducing
> -      && reduction_type != FOLD_LEFT_REDUCTION)
> +
> +  /* All but single defuse-cycle optimized and fold-left reductions go
> +     through their own vectorizable_* routines.  */
> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>      {
>        stmt_vec_info tem
>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    bool lane_reducing = lane_reducing_op_p (code);
>    gcc_assert (single_defuse_cycle || lane_reducing);
>
> +  if (lane_reducing)
> +    {
> +      /* The last operand of lane-reducing op is for reduction.  */
> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> +
> +      /* Now all lane-reducing ops are covered by some slp node.  */
> +      gcc_assert (slp_node);
> +    }
> +
>    /* Create the destination vector  */
>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>                          &vec_oprnds[2]);
>      }
> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> +    {
> +      /* For lane-reducing op covered by single-lane slp node, the input
> +        vectype of the reduction PHI determines copies of vectorized def-use
> +        cycles, which might be more than effective copies of vectorized lane-
> +        reducing reduction statements.  This could be complemented by
> +        generating extra trivial pass-through copies.  For example:
> +
> +          int sum = 0;
> +          for (i)
> +            {
> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> +              sum += n[i];               // normal <vector(4) int>
> +            }
> +
> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> +        statements would be transformed as:
> +
> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> +
> +          for (i / 16)
> +            {
> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> +              sum_v1 = sum_v1;  // copy
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> +              sum_v2 += n_v2[i: 8  ~ 11];
> +              sum_v3 += n_v3[i: 12 ~ 15];
> +            }
> +       */
> +      unsigned using_ncopies = vec_oprnds[0].length ();
> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> +

assert reduc_ncopies >= using_ncopies?  Maybe assert
reduc_index == op.num_ops - 1 given you use one above
and the other below?  Or simply iterate till op.num_ops
and sip i == reduc_index.

> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> +       {
> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> +       }
> +    }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>      {
>        gimple *new_stmt;
>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> -      if (masked_loop_p && !mask_by_cond_expr)
> +
> +      if (!vop[0] || !vop[1])
> +       {
> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> +
> +         /* Insert trivial copy if no need to generate vectorized
> +            statement.  */
> +         gcc_assert (reduc_vop);
> +
> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> +         gimple_set_lhs (new_stmt, new_temp);
> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);

I think you could simply do

               slp_node->push_vec_def (reduc_vop);
               continue;

without any code generation.

> +       }
> +      else if (masked_loop_p && !mask_by_cond_expr)
>         {
>           /* No conditional ifns have been defined for lane-reducing op
>              yet.  */
> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>
>           if (masked_loop_p && mask_by_cond_expr)
>             {
> +             tree stmt_vectype_in = vectype_in;
> +             unsigned nvectors = vec_num * ncopies;
> +
> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> +               {
> +                 /* Input vectype of the reduction PHI may be defferent from

different

> +                    that of lane-reducing operation.  */
> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);

I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.

Otherwise the patch looks good to me.

Richard.

> +               }
> +
>               tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> -                                             vec_num * ncopies, vectype_in, i);
> +                                             nvectors, stmt_vectype_in, i);
>               build_vect_cond_expr (code, vop, mask, gsi);
>             }
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index ca6052662a3..1b73ef01ade 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
>                                       NULL, NULL, node, cost_vec)
>           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
>           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
> +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
> +                                        stmt_info, node, cost_vec)
>           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
>                                      node, node_instance, cost_vec)
>           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 60224f4e284..94736736dcc 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
>  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
>                                          slp_tree, slp_instance, int,
>                                          bool, stmt_vector_for_cost *);
> +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
> +                                       slp_tree, stmt_vector_for_cost *);
>  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
>                                     slp_tree, slp_instance,
>                                     stmt_vector_for_cost *);
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-20 12:26 ` Richard Biener
@ 2024-06-23 15:10   ` Feng Xue OS
  2024-06-24 12:58     ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Feng Xue OS @ 2024-06-23 15:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

>> -      if (slp_node)
>> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> 
> Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> instead, which is bad.
> 
>>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>>        else
>>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>>      }
>>  }
>>
>> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
>> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
>> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
>> +   (sum-of-absolute-differences).
>> +
>> +   For a lane-reducing operation, the loop reduction path that it lies in,
>> +   may contain normal operation, or other lane-reducing operation of different
>> +   input type size, an example as:
>> +
>> +     int sum = 0;
>> +     for (i)
>> +       {
>> +         ...
>> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
>> +         sum += w[i];                // widen-sum <vector(16) char>
>> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
>> +         sum += n[i];                // normal <vector(4) int>
>> +         ...
>> +       }
>> +
>> +   Vectorization factor is essentially determined by operation whose input
>> +   vectype has the most lanes ("vector(16) char" in the example), while we
>> +   need to choose input vectype with the least lanes ("vector(4) int" in the
>> +   example) for the reduction PHI statement.  */
>> +
>> +bool
>> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
>> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
>> +{
>> +  gimple *stmt = stmt_info->stmt;
>> +
>> +  if (!lane_reducing_stmt_p (stmt))
>> +    return false;
>> +
>> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
>> +
>> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
>> +    return false;
>> +
>> +  /* Do not try to vectorize bit-precision reductions.  */
>> +  if (!type_has_mode_precision_p (type))
>> +    return false;
>> +
>> +  if (!slp_node)
>> +    return false;
>> +
>> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
>> +    {
>> +      stmt_vec_info def_stmt_info;
>> +      slp_tree slp_op;
>> +      tree op;
>> +      tree vectype;
>> +      enum vect_def_type dt;
>> +
>> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
>> +                              &slp_op, &dt, &vectype, &def_stmt_info))
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                            "use not simple.\n");
>> +         return false;
>> +       }
>> +
>> +      if (!vectype)
>> +       {
>> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
>> +                                                slp_op);
>> +         if (!vectype)
>> +           return false;
>> +       }
>> +
>> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                            "incompatible vector types for invariants\n");
>> +         return false;
>> +       }
>> +
>> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
>> +       continue;
>> +
>> +      /* There should be at most one cycle def in the stmt.  */
>> +      if (VECTORIZABLE_CYCLE_DEF (dt))
>> +       return false;
>> +    }
>> +
>> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
>> +
>> +  /* TODO: Support lane-reducing operation that does not directly participate
>> +     in loop reduction. */
>> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
>> +    return false;
>> +
>> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
>> +     recoginized.  */
>> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
>> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
>> +
>> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> +  int ncopies_for_cost;
>> +
>> +  if (SLP_TREE_LANES (slp_node) > 1)
>> +    {
>> +      /* Now lane-reducing operations in a non-single-lane slp node should only
>> +        come from the same loop reduction path.  */
>> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
>> +      ncopies_for_cost = 1;
>> +    }
>> +  else
>> +    {
>> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> 
> OK, so the fact that the ops are lane-reducing means they effectively
> change the VF for the result.  That's only possible as we tightly control
> code generation and "adjust" to the expected VF (by inserting the copies
> you mentioned above), but only up to the highest number of outputs
> created in the reduction chain.  In that sense instead of talking and recording
> "input vector types" wouldn't it make more sense to record the effective
> vectorization factor for the reduction instance?  That VF would be at most
> the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> operation in the reduction chain it would be always equal to the loops VF.
> 
> ncopies would then be always determined by that reduction instance VF and
> the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> instance VF would also trivially indicate the force-single-def-use-cycle
> case, possibly simplifying code?

I tried to add such an effective VF, while the vectype_in is still needed in some
scenarios, such as when checking whether a dot-prod stmt is emulated or not.
The former could be deduced from the later, so recording both things seems
to be redundant. Another consideration is that for normal op, ncopies
is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
it is from VF. So, a better means to make them unified? 
 
>> +      gcc_assert (ncopies_for_cost >= 1);
>> +    }
>> +
>> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> +    {
>> +      /* We need extra two invariants: one that contains the minimum signed
>> +        value and one that contains half of its negative.  */
>> +      int prologue_stmts = 2;
>> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
>> +                                       scalar_to_vec, stmt_info, 0,
>> +                                       vect_prologue);
>> +      if (dump_enabled_p ())
>> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
>> +                    "extra prologue_cost = %d .\n", cost);
>> +
>> +      /* Three dot-products and a subtraction.  */
>> +      ncopies_for_cost *= 4;
>> +    }
>> +
>> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
>> +                   vect_body);
>> +
>> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> +    {
>> +      enum tree_code code = gimple_assign_rhs_code (stmt);
>> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
>> +                                                 slp_node, code, type,
>> +                                                 vectype_in);
>> +    }
>> +
> 
> Add a comment:
> 
>     /* Transform via vect_transform_reduction.  */
> 
>> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>> +  return true;
>> +}
>> +
>>  /* Function vectorizable_reduction.
>>
>>     Check if STMT_INFO performs a reduction operation that can be vectorized.
>> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>    if (!type_has_mode_precision_p (op.type))
>>      return false;
>>
>> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
>> -     which means the only use of that may be in the lane-reducing operation.  */
>> -  if (lane_reducing
>> -      && reduc_chain_length != 1
>> -      && !only_slp_reduc_chain)
>> -    {
>> -      if (dump_enabled_p ())
>> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -                        "lane-reducing reduction with extra stmts.\n");
>> -      return false;
>> -    }
>> -
>>    /* Lane-reducing ops also never can be used in a SLP reduction group
>>       since we'll mix lanes belonging to different reductions.  But it's
>>       OK to use them in a reduction chain or when the reduction group
>> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>        && loop_vinfo->suggested_unroll_factor == 1)
>>      single_defuse_cycle = true;
>>
>> -  if (single_defuse_cycle || lane_reducing)
>> +  if (single_defuse_cycle && !lane_reducing)
> 
> If there's also a non-lane-reducing plus in the chain don't we have to
> check for that reduction op?  So shouldn't it be
> single_defuse_cycle && ... fact that we don't record
> (non-lane-reducing op there) ...

Quite not understand this point.  For a non-lane-reducing op in the chain,
it should be handled in its own vectorizable_xxx function? The below check
is only for the first statement (vect_reduction_def) in the reduction.

> 
>>      {
>>        gcc_assert (op.code != COND_EXPR);
>>
>> -      /* 4. Supportable by target?  */
>> -      bool ok = true;
>> -
>> -      /* 4.1. check support for the operation in the loop
>> +      /* 4. check support for the operation in the loop
>>
>>          This isn't necessary for the lane reduction codes, since they
>>          can only be produced by pattern matching, and it's up to the
>> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>          mixed-sign dot-products can be implemented using signed
>>          dot-products.  */
>>        machine_mode vec_mode = TYPE_MODE (vectype_in);
>> -      if (!lane_reducing
>> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
>> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>>          {
>>            if (dump_enabled_p ())
>>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>>               || !vect_can_vectorize_without_simd_p (op.code))
>> -           ok = false;
>> +           single_defuse_cycle = false;
>>           else
>>             if (dump_enabled_p ())
>>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
>> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>>           return false;
>>         }
>> -
>> -      /* lane-reducing operations have to go through vect_transform_reduction.
>> -         For the other cases try without the single cycle optimization.  */
>> -      if (!ok)
>> -       {
>> -         if (lane_reducing)
>> -           return false;
>> -         else
>> -           single_defuse_cycle = false;
>> -       }
>>      }
>>    if (dump_enabled_p () && single_defuse_cycle)
>>      dump_printf_loc (MSG_NOTE, vect_location,
>> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>                      "multiple vectors to one in the loop body\n");
>>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>>
>> -  /* If the reduction stmt is one of the patterns that have lane
>> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
>> -  if ((ncopies > 1 && ! single_defuse_cycle)
>> -      && lane_reducing)
>> -    {
>> -      if (dump_enabled_p ())
>> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -                        "multi def-use cycle not possible for lane-reducing "
>> -                        "reduction operation\n");
>> -      return false;
>> -    }
>> +  /* For lane-reducing operation, the below processing related to single
>> +     defuse-cycle will be done in its own vectorizable function.  One more
>> +     thing to note is that the operation must not be involved in fold-left
>> +     reduction.  */
>> +  single_defuse_cycle &= !lane_reducing;
>>
>>    if (slp_node
>> -      && !(!single_defuse_cycle
>> -          && !lane_reducing
>> -          && reduction_type != FOLD_LEFT_REDUCTION))
>> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>>      for (i = 0; i < (int) op.num_ops; i++)
>>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>>         {
>> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>>                              reduction_type, ncopies, cost_vec);
>>    /* Cost the reduction op inside the loop if transformed via
>> -     vect_transform_reduction.  Otherwise this is costed by the
>> -     separate vectorizable_* routines.  */
>> -  if (single_defuse_cycle || lane_reducing)
>> -    {
>> -      int factor = 1;
>> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> -       /* Three dot-products and a subtraction.  */
>> -       factor = 4;
>> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
>> -                       stmt_info, 0, vect_body);
>> -    }
>> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
>> +     this is costed by the separate vectorizable_* routines.  */
>> +  if (single_defuse_cycle)
>> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>>
>>    if (dump_enabled_p ()
>>        && reduction_type == FOLD_LEFT_REDUCTION)
>>      dump_printf_loc (MSG_NOTE, vect_location,
>>                      "using an in-order (fold-left) reduction.\n");
>>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
>> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
>> -     reductions go through their own vectorizable_* routines.  */
>> -  if (!single_defuse_cycle
>> -      && !lane_reducing
>> -      && reduction_type != FOLD_LEFT_REDUCTION)
>> +
>> +  /* All but single defuse-cycle optimized and fold-left reductions go
>> +     through their own vectorizable_* routines.  */
>> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>>      {
>>        stmt_vec_info tem
>>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
>> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>    bool lane_reducing = lane_reducing_op_p (code);
>>    gcc_assert (single_defuse_cycle || lane_reducing);
>>
>> +  if (lane_reducing)
>> +    {
>> +      /* The last operand of lane-reducing op is for reduction.  */
>> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
>> +
>> +      /* Now all lane-reducing ops are covered by some slp node.  */
>> +      gcc_assert (slp_node);
>> +    }
>> +
>>    /* Create the destination vector  */
>>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>>                          &vec_oprnds[2]);
>>      }
>> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
>> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
>> +    {
>> +      /* For lane-reducing op covered by single-lane slp node, the input
>> +        vectype of the reduction PHI determines copies of vectorized def-use
>> +        cycles, which might be more than effective copies of vectorized lane-
>> +        reducing reduction statements.  This could be complemented by
>> +        generating extra trivial pass-through copies.  For example:
>> +
>> +          int sum = 0;
>> +          for (i)
>> +            {
>> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>> +              sum += n[i];               // normal <vector(4) int>
>> +            }
>> +
>> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
>> +        statements would be transformed as:
>> +
>> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
>> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
>> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
>> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
>> +
>> +          for (i / 16)
>> +            {
>> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>> +              sum_v1 = sum_v1;  // copy
>> +              sum_v2 = sum_v2;  // copy
>> +              sum_v3 = sum_v3;  // copy
>> +
>> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>> +              sum_v2 = sum_v2;  // copy
>> +              sum_v3 = sum_v3;  // copy
>> +
>> +              sum_v0 += n_v0[i: 0  ~ 3 ];
>> +              sum_v1 += n_v1[i: 4  ~ 7 ];
>> +              sum_v2 += n_v2[i: 8  ~ 11];
>> +              sum_v3 += n_v3[i: 12 ~ 15];
>> +            }
>> +       */
>> +      unsigned using_ncopies = vec_oprnds[0].length ();
>> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
>> +
> 
> assert reduc_ncopies >= using_ncopies?  Maybe assert
> reduc_index == op.num_ops - 1 given you use one above
> and the other below?  Or simply iterate till op.num_ops
> and sip i == reduc_index.
> 
>> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
>> +       {
>> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
>> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
>> +       }
>> +    }
>>
>>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
>> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>      {
>>        gimple *new_stmt;
>>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
>> -      if (masked_loop_p && !mask_by_cond_expr)
>> +
>> +      if (!vop[0] || !vop[1])
>> +       {
>> +         tree reduc_vop = vec_oprnds[reduc_index][i];
>> +
>> +         /* Insert trivial copy if no need to generate vectorized
>> +            statement.  */
>> +         gcc_assert (reduc_vop);
>> +
>> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
>> +         new_temp = make_ssa_name (vec_dest, new_stmt);
>> +         gimple_set_lhs (new_stmt, new_temp);
>> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> 
> I think you could simply do
> 
>                slp_node->push_vec_def (reduc_vop);
>                continue;
> 
> without any code generation.
> 

OK, that would be easy. Here comes another question, this patch assumes
lane-reducing op would always be contained in a slp node, since single-lane
slp node feature has been enabled. But I got some regression if I enforced
such constraint on lane-reducing op check. Those cases are founded to
be unvectorizable with single-lane slp, so this should not be what we want?
and need to be fixed?

>> +       }
>> +      else if (masked_loop_p && !mask_by_cond_expr)
>>         {
>>           /* No conditional ifns have been defined for lane-reducing op
>>              yet.  */
>> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>
>>           if (masked_loop_p && mask_by_cond_expr)
>>             {
>> +             tree stmt_vectype_in = vectype_in;
>> +             unsigned nvectors = vec_num * ncopies;
>> +
>> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
>> +               {
>> +                 /* Input vectype of the reduction PHI may be defferent from
> 
> different
> 
>> +                    that of lane-reducing operation.  */
>> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
> 
> I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.

To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, 
we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
to vect_get_loop_mask?

Thanks,
Feng


________________________________________
From: Richard Biener <richard.guenther@gmail.com>
Sent: Thursday, June 20, 2024 8:26 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> vectorizer could only handle the pattern if the reduction chain does not
> contain other operation, no matter the other is normal or lane-reducing.
>
> Actually, to allow multiple arbitrary lane-reducing operations, we need to
> support vectorization of loop reduction chain with mixed input vectypes. Since
> lanes of vectype may vary with operation, the effective ncopies of vectorized
> statements for operation also may not be same to each other, this causes
> mismatch on vectorized def-use cycles. A simple way is to align all operations
> with the one that has the most ncopies, the gap could be complemented by
> generating extra trivial pass-through copies. For example:
>
>    int sum = 0;
>    for (i)
>      {
>        sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>        sum += w[i];               // widen-sum <vector(16) char>
>        sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>        sum += n[i];               // normal <vector(4) int>
>      }
>
> The vector size is 128-bit vectorization factor is 16. Reduction statements
> would be transformed as:
>
>    vector<4> int sum_v0 = { 0, 0, 0, 0 };
>    vector<4> int sum_v1 = { 0, 0, 0, 0 };
>    vector<4> int sum_v2 = { 0, 0, 0, 0 };
>    vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>    for (i / 16)
>      {
>        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>        sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 += n_v0[i: 0  ~ 3 ];
>        sum_v1 += n_v1[i: 4  ~ 7 ];
>        sum_v2 += n_v2[i: 8  ~ 11];
>        sum_v3 += n_v3[i: 12 ~ 15];
>      }
>
> Thanks,
> Feng
>
> ---
> gcc/
>         PR tree-optimization/114440
>         * tree-vectorizer.h (vectorizable_lane_reducing): New function
>         declaration.
>         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
>         vectorizable_lane_reducing to analyze lane-reducing operation.
>         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
>         code related to emulated_mixed_dot_prod.
>         (vect_reduction_update_partial_vector_usage): Compute ncopies as the
>         original means for single-lane slp node.
>         (vectorizable_lane_reducing): New function.
>         (vectorizable_reduction): Allow multiple lane-reducing operations in
>         loop reduction. Move some original lane-reducing related code to
>         vectorizable_lane_reducing.
>         (vect_transform_reduction): Extend transformation to support reduction
>         statements with mixed input vectypes.
>
> gcc/testsuite/
>         PR tree-optimization/114440
>         * gcc.dg/vect/vect-reduc-chain-1.c
>         * gcc.dg/vect/vect-reduc-chain-2.c
>         * gcc.dg/vect/vect-reduc-chain-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>         * gcc.dg/vect/vect-reduc-dot-slp-1.c
> ---
>  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
>  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
>  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
>  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
>  gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
>  gcc/tree-vect-stmts.cc                        |   2 +
>  gcc/tree-vectorizer.h                         |   2 +
>  11 files changed, 802 insertions(+), 70 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> new file mode 100644
> index 00000000000..04bfc419dbd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> @@ -0,0 +1,62 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_2 char *restrict c,
> +   SIGNEDNESS_2 char *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_2 char c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      c[i] = BASE + i * 2;
> +      d[i] = BASE + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> new file mode 100644
> index 00000000000..6c803b80120
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> @@ -0,0 +1,77 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#define SIGNEDNESS_4 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +fn (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 char *restrict c,
> +   SIGNEDNESS_3 char *restrict d,
> +   SIGNEDNESS_4 short *restrict e,
> +   SIGNEDNESS_4 short *restrict f,
> +   SIGNEDNESS_1 int *restrict g)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += i + 1;
> +      res += c[i] * d[i];
> +      res += e[i] * f[i];
> +      res += g[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
> +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 char c[N], d[N];
> +  SIGNEDNESS_4 short e[N], f[N];
> +  SIGNEDNESS_1 int g[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 + OFFSET + i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = BASE4 + i * 6;
> +      f[i] = BASE4 + OFFSET + i * 5;
> +      g[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += i + 1;
> +      expected += c[i] * d[i];
> +      expected += e[i] * f[i];
> +      expected += g[i];
> +    }
> +  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> new file mode 100644
> index 00000000000..a41e4b176c4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> @@ -0,0 +1,66 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 short *restrict c,
> +   SIGNEDNESS_3 short *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      res += abs;
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 short c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 - i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      expected += abs;
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> new file mode 100644
> index 00000000000..c2831fbcc8e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> @@ -0,0 +1,95 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +      res += a[8] * b[8];
> +      res += a[9] * b[9];
> +      res += a[10] * b[10];
> +      res += a[11] * b[11];
> +      res += a[12] * b[12];
> +      res += a[13] * b[13];
> +      res += a[14] * b[14];
> +      res += a[15] * b[15];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 16;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      expected += a[t + 8] * b[t + 8];
> +      expected += a[t + 9] * b[t + 9];
> +      expected += a[t + 10] * b[t + 10];
> +      expected += a[t + 11] * b[t + 11];
> +      expected += a[t + 12] * b[t + 12];
> +      expected += a[t + 13] * b[t + 13];
> +      expected += a[t + 14] * b[t + 14];
> +      expected += a[t + 15] * b[t + 15];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> new file mode 100644
> index 00000000000..4114264a364
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> @@ -0,0 +1,67 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[5 * i + 0] * b[5 * i + 0];
> +      res += a[5 * i + 1] * b[5 * i + 1];
> +      res += a[5 * i + 2] * b[5 * i + 2];
> +      res += a[5 * i + 3] * b[5 * i + 3];
> +      res += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[5 * i + 0] * b[5 * i + 0];
> +      expected += a[5 * i + 1] * b[5 * i + 1];
> +      expected += a[5 * i + 2] * b[5 * i + 2];
> +      expected += a[5 * i + 3] * b[5 * i + 3];
> +      expected += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> new file mode 100644
> index 00000000000..2cdecc36d16
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> @@ -0,0 +1,79 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 8;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> new file mode 100644
> index 00000000000..32c0f30c77b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> @@ -0,0 +1,63 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[3 * i + 0] * b[3 * i + 0];
> +      res += a[3 * i + 1] * b[3 * i + 1];
> +      res += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[3 * i + 0] * b[3 * i + 0];
> +      expected += a[3 * i + 1] * b[3 * i + 1];
> +      expected += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> new file mode 100644
> index 00000000000..e17d6291f75
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> @@ -0,0 +1,35 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-do compile } */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res0,
> +   SIGNEDNESS_1 int res1,
> +   SIGNEDNESS_1 int res2,
> +   SIGNEDNESS_1 int res3,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b)
> +{
> +  for (int i = 0; i < 64; i += 4)
> +    {
> +      res0 += a[i + 0] * b[i + 0];
> +      res1 += a[i + 1] * b[i + 1];
> +      res2 += a[i + 2] * b[i + 2];
> +      res3 += a[i + 3] * b[i + 3];
> +    }
> +
> +  return res0 ^ res1 ^ res2 ^ res3;
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index e0561feddce..6d91665a341 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>      gcc_unreachable ();
>
> -  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> -
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      /* No extra instructions are needed in the prologue.  The loop body
>         operations are costed in vectorizable_condition.  */
> @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>            initial result of the data reduction, initial value of the index
>            reduction.  */
>         prologue_stmts = 4;
> -      else if (emulated_mixed_dot_prod)
> -       /* We need the initial reduction value and two invariants:
> -          one that contains the minimum signed value and one that
> -          contains half of its negative.  */
> -       prologue_stmts = 3;
>        else
> +       /* We need the initial reduction value.  */
>         prologue_stmts = 1;
>        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>                                          scalar_to_vec, stmt_info, 0,
> @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>        vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>        unsigned nvectors;
>
> -      if (slp_node)
> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)

Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
instead, which is bad.

>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>        else
>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>      }
>  }
>
> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> +   (sum-of-absolute-differences).
> +
> +   For a lane-reducing operation, the loop reduction path that it lies in,
> +   may contain normal operation, or other lane-reducing operation of different
> +   input type size, an example as:
> +
> +     int sum = 0;
> +     for (i)
> +       {
> +         ...
> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> +         sum += w[i];                // widen-sum <vector(16) char>
> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> +         sum += n[i];                // normal <vector(4) int>
> +         ...
> +       }
> +
> +   Vectorization factor is essentially determined by operation whose input
> +   vectype has the most lanes ("vector(16) char" in the example), while we
> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> +   example) for the reduction PHI statement.  */
> +
> +bool
> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> +{
> +  gimple *stmt = stmt_info->stmt;
> +
> +  if (!lane_reducing_stmt_p (stmt))
> +    return false;
> +
> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> +
> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> +    return false;
> +
> +  /* Do not try to vectorize bit-precision reductions.  */
> +  if (!type_has_mode_precision_p (type))
> +    return false;
> +
> +  if (!slp_node)
> +    return false;
> +
> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> +    {
> +      stmt_vec_info def_stmt_info;
> +      slp_tree slp_op;
> +      tree op;
> +      tree vectype;
> +      enum vect_def_type dt;
> +
> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "use not simple.\n");
> +         return false;
> +       }
> +
> +      if (!vectype)
> +       {
> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> +                                                slp_op);
> +         if (!vectype)
> +           return false;
> +       }
> +
> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "incompatible vector types for invariants\n");
> +         return false;
> +       }
> +
> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> +       continue;
> +
> +      /* There should be at most one cycle def in the stmt.  */
> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> +       return false;
> +    }
> +
> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> +
> +  /* TODO: Support lane-reducing operation that does not directly participate
> +     in loop reduction. */
> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> +    return false;
> +
> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> +     recoginized.  */
> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> +
> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +  int ncopies_for_cost;
> +
> +  if (SLP_TREE_LANES (slp_node) > 1)
> +    {
> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> +        come from the same loop reduction path.  */
> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> +      ncopies_for_cost = 1;
> +    }
> +  else
> +    {
> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);

OK, so the fact that the ops are lane-reducing means they effectively
change the VF for the result.  That's only possible as we tightly control
code generation and "adjust" to the expected VF (by inserting the copies
you mentioned above), but only up to the highest number of outputs
created in the reduction chain.  In that sense instead of talking and recording
"input vector types" wouldn't it make more sense to record the effective
vectorization factor for the reduction instance?  That VF would be at most
the loops VF but could be as low as 1.  Once we have a non-lane-reducing
operation in the reduction chain it would be always equal to the loops VF.

ncopies would then be always determined by that reduction instance VF and
the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
instance VF would also trivially indicate the force-single-def-use-cycle
case, possibly simplifying code?

> +      gcc_assert (ncopies_for_cost >= 1);
> +    }
> +
> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> +    {
> +      /* We need extra two invariants: one that contains the minimum signed
> +        value and one that contains half of its negative.  */
> +      int prologue_stmts = 2;
> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> +                                       scalar_to_vec, stmt_info, 0,
> +                                       vect_prologue);
> +      if (dump_enabled_p ())
> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> +                    "extra prologue_cost = %d .\n", cost);
> +
> +      /* Three dot-products and a subtraction.  */
> +      ncopies_for_cost *= 4;
> +    }
> +
> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> +                   vect_body);
> +
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> +                                                 slp_node, code, type,
> +                                                 vectype_in);
> +    }
> +

Add a comment:

    /* Transform via vect_transform_reduction.  */

> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> +  return true;
> +}
> +
>  /* Function vectorizable_reduction.
>
>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    if (!type_has_mode_precision_p (op.type))
>      return false;
>
> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> -     which means the only use of that may be in the lane-reducing operation.  */
> -  if (lane_reducing
> -      && reduc_chain_length != 1
> -      && !only_slp_reduc_chain)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "lane-reducing reduction with extra stmts.\n");
> -      return false;
> -    }
> -
>    /* Lane-reducing ops also never can be used in a SLP reduction group
>       since we'll mix lanes belonging to different reductions.  But it's
>       OK to use them in a reduction chain or when the reduction group
> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>        && loop_vinfo->suggested_unroll_factor == 1)
>      single_defuse_cycle = true;
>
> -  if (single_defuse_cycle || lane_reducing)
> +  if (single_defuse_cycle && !lane_reducing)

If there's also a non-lane-reducing plus in the chain don't we have to
check for that reduction op?  So shouldn't it be
single_defuse_cycle && ... fact that we don't record
(non-lane-reducing op there) ...

>      {
>        gcc_assert (op.code != COND_EXPR);
>
> -      /* 4. Supportable by target?  */
> -      bool ok = true;
> -
> -      /* 4.1. check support for the operation in the loop
> +      /* 4. check support for the operation in the loop
>
>          This isn't necessary for the lane reduction codes, since they
>          can only be produced by pattern matching, and it's up to the
> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>          mixed-sign dot-products can be implemented using signed
>          dot-products.  */
>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> -      if (!lane_reducing
> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>          {
>            if (dump_enabled_p ())
>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>               || !vect_can_vectorize_without_simd_p (op.code))
> -           ok = false;
> +           single_defuse_cycle = false;
>           else
>             if (dump_enabled_p ())
>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>           return false;
>         }
> -
> -      /* lane-reducing operations have to go through vect_transform_reduction.
> -         For the other cases try without the single cycle optimization.  */
> -      if (!ok)
> -       {
> -         if (lane_reducing)
> -           return false;
> -         else
> -           single_defuse_cycle = false;
> -       }
>      }
>    if (dump_enabled_p () && single_defuse_cycle)
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>                      "multiple vectors to one in the loop body\n");
>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>
> -  /* If the reduction stmt is one of the patterns that have lane
> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> -  if ((ncopies > 1 && ! single_defuse_cycle)
> -      && lane_reducing)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "multi def-use cycle not possible for lane-reducing "
> -                        "reduction operation\n");
> -      return false;
> -    }
> +  /* For lane-reducing operation, the below processing related to single
> +     defuse-cycle will be done in its own vectorizable function.  One more
> +     thing to note is that the operation must not be involved in fold-left
> +     reduction.  */
> +  single_defuse_cycle &= !lane_reducing;
>
>    if (slp_node
> -      && !(!single_defuse_cycle
> -          && !lane_reducing
> -          && reduction_type != FOLD_LEFT_REDUCTION))
> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>      for (i = 0; i < (int) op.num_ops; i++)
>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>         {
> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>                              reduction_type, ncopies, cost_vec);
>    /* Cost the reduction op inside the loop if transformed via
> -     vect_transform_reduction.  Otherwise this is costed by the
> -     separate vectorizable_* routines.  */
> -  if (single_defuse_cycle || lane_reducing)
> -    {
> -      int factor = 1;
> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> -       /* Three dot-products and a subtraction.  */
> -       factor = 4;
> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> -                       stmt_info, 0, vect_body);
> -    }
> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> +     this is costed by the separate vectorizable_* routines.  */
> +  if (single_defuse_cycle)
> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>
>    if (dump_enabled_p ()
>        && reduction_type == FOLD_LEFT_REDUCTION)
>      dump_printf_loc (MSG_NOTE, vect_location,
>                      "using an in-order (fold-left) reduction.\n");
>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> -     reductions go through their own vectorizable_* routines.  */
> -  if (!single_defuse_cycle
> -      && !lane_reducing
> -      && reduction_type != FOLD_LEFT_REDUCTION)
> +
> +  /* All but single defuse-cycle optimized and fold-left reductions go
> +     through their own vectorizable_* routines.  */
> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>      {
>        stmt_vec_info tem
>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    bool lane_reducing = lane_reducing_op_p (code);
>    gcc_assert (single_defuse_cycle || lane_reducing);
>
> +  if (lane_reducing)
> +    {
> +      /* The last operand of lane-reducing op is for reduction.  */
> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> +
> +      /* Now all lane-reducing ops are covered by some slp node.  */
> +      gcc_assert (slp_node);
> +    }
> +
>    /* Create the destination vector  */
>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>                          &vec_oprnds[2]);
>      }
> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> +    {
> +      /* For lane-reducing op covered by single-lane slp node, the input
> +        vectype of the reduction PHI determines copies of vectorized def-use
> +        cycles, which might be more than effective copies of vectorized lane-
> +        reducing reduction statements.  This could be complemented by
> +        generating extra trivial pass-through copies.  For example:
> +
> +          int sum = 0;
> +          for (i)
> +            {
> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> +              sum += n[i];               // normal <vector(4) int>
> +            }
> +
> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> +        statements would be transformed as:
> +
> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> +
> +          for (i / 16)
> +            {
> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> +              sum_v1 = sum_v1;  // copy
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> +              sum_v2 += n_v2[i: 8  ~ 11];
> +              sum_v3 += n_v3[i: 12 ~ 15];
> +            }
> +       */
> +      unsigned using_ncopies = vec_oprnds[0].length ();
> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> +

assert reduc_ncopies >= using_ncopies?  Maybe assert
reduc_index == op.num_ops - 1 given you use one above
and the other below?  Or simply iterate till op.num_ops
and sip i == reduc_index.

> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> +       {
> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> +       }
> +    }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>      {
>        gimple *new_stmt;
>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> -      if (masked_loop_p && !mask_by_cond_expr)
> +
> +      if (!vop[0] || !vop[1])
> +       {
> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> +
> +         /* Insert trivial copy if no need to generate vectorized
> +            statement.  */
> +         gcc_assert (reduc_vop);
> +
> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> +         gimple_set_lhs (new_stmt, new_temp);
> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);

I think you could simply do

               slp_node->push_vec_def (reduc_vop);
               continue;

without any code generation.

> +       }
> +      else if (masked_loop_p && !mask_by_cond_expr)
>         {
>           /* No conditional ifns have been defined for lane-reducing op
>              yet.  */
> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>
>           if (masked_loop_p && mask_by_cond_expr)
>             {
> +             tree stmt_vectype_in = vectype_in;
> +             unsigned nvectors = vec_num * ncopies;
> +
> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> +               {
> +                 /* Input vectype of the reduction PHI may be defferent from

different

> +                    that of lane-reducing operation.  */
> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);

I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.

Otherwise the patch looks good to me.

Richard.

> +               }
> +
>               tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> -                                             vec_num * ncopies, vectype_in, i);
> +                                             nvectors, stmt_vectype_in, i);
>               build_vect_cond_expr (code, vop, mask, gsi);
>             }
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index ca6052662a3..1b73ef01ade 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
>                                       NULL, NULL, node, cost_vec)
>           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
>           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
> +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
> +                                        stmt_info, node, cost_vec)
>           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
>                                      node, node_instance, cost_vec)
>           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 60224f4e284..94736736dcc 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
>  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
>                                          slp_tree, slp_instance, int,
>                                          bool, stmt_vector_for_cost *);
> +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
> +                                       slp_tree, stmt_vector_for_cost *);
>  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
>                                     slp_tree, slp_instance,
>                                     stmt_vector_for_cost *);
> --
> 2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-23 15:10   ` Feng Xue OS
@ 2024-06-24 12:58     ` Richard Biener
  2024-06-25  9:32       ` Feng Xue OS
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2024-06-24 12:58 UTC (permalink / raw)
  To: Feng Xue OS; +Cc: gcc-patches

On Sun, Jun 23, 2024 at 5:10 PM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> >> -      if (slp_node)
> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >
> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> > instead, which is bad.
> >
> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >>        else
> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >>      }
> >>  }
> >>
> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> >> +   (sum-of-absolute-differences).
> >> +
> >> +   For a lane-reducing operation, the loop reduction path that it lies in,
> >> +   may contain normal operation, or other lane-reducing operation of different
> >> +   input type size, an example as:
> >> +
> >> +     int sum = 0;
> >> +     for (i)
> >> +       {
> >> +         ...
> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> >> +         sum += w[i];                // widen-sum <vector(16) char>
> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> +         sum += n[i];                // normal <vector(4) int>
> >> +         ...
> >> +       }
> >> +
> >> +   Vectorization factor is essentially determined by operation whose input
> >> +   vectype has the most lanes ("vector(16) char" in the example), while we
> >> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> >> +   example) for the reduction PHI statement.  */
> >> +
> >> +bool
> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> >> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> >> +{
> >> +  gimple *stmt = stmt_info->stmt;
> >> +
> >> +  if (!lane_reducing_stmt_p (stmt))
> >> +    return false;
> >> +
> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> >> +
> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> +    return false;
> >> +
> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> +  if (!type_has_mode_precision_p (type))
> >> +    return false;
> >> +
> >> +  if (!slp_node)
> >> +    return false;
> >> +
> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> +    {
> >> +      stmt_vec_info def_stmt_info;
> >> +      slp_tree slp_op;
> >> +      tree op;
> >> +      tree vectype;
> >> +      enum vect_def_type dt;
> >> +
> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> >> +       {
> >> +         if (dump_enabled_p ())
> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                            "use not simple.\n");
> >> +         return false;
> >> +       }
> >> +
> >> +      if (!vectype)
> >> +       {
> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> >> +                                                slp_op);
> >> +         if (!vectype)
> >> +           return false;
> >> +       }
> >> +
> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> +       {
> >> +         if (dump_enabled_p ())
> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                            "incompatible vector types for invariants\n");
> >> +         return false;
> >> +       }
> >> +
> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> >> +       continue;
> >> +
> >> +      /* There should be at most one cycle def in the stmt.  */
> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> +       return false;
> >> +    }
> >> +
> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> >> +
> >> +  /* TODO: Support lane-reducing operation that does not directly participate
> >> +     in loop reduction. */
> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> +    return false;
> >> +
> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> +     recoginized.  */
> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> >> +
> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> +  int ncopies_for_cost;
> >> +
> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> +    {
> >> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> >> +        come from the same loop reduction path.  */
> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> +      ncopies_for_cost = 1;
> >> +    }
> >> +  else
> >> +    {
> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> >
> > OK, so the fact that the ops are lane-reducing means they effectively
> > change the VF for the result.  That's only possible as we tightly control
> > code generation and "adjust" to the expected VF (by inserting the copies
> > you mentioned above), but only up to the highest number of outputs
> > created in the reduction chain.  In that sense instead of talking and recording
> > "input vector types" wouldn't it make more sense to record the effective
> > vectorization factor for the reduction instance?  That VF would be at most
> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> > operation in the reduction chain it would be always equal to the loops VF.
> >
> > ncopies would then be always determined by that reduction instance VF and
> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> > instance VF would also trivially indicate the force-single-def-use-cycle
> > case, possibly simplifying code?
>
> I tried to add such an effective VF, while the vectype_in is still needed in some
> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
> The former could be deduced from the later, so recording both things seems
> to be redundant. Another consideration is that for normal op, ncopies
> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> it is from VF. So, a better means to make them unified?

AFAICS reductions are special in that they, for the accumulation SSA cycle,
do not adhere to the loops VF but as optimization can chose a smaller one.
OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
operations which even for lane-reducing ops is adhered to - those just
may use a smaller VF, that of the reduction SSA cycle.

So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
it's not fully redundant but needlessly replicated over all stmts participating
in the reduction instead of recording the reduction VF in the reduc_info and
using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
for stmts in the reduction cycle.

At least that was my idea ...

> >> +      gcc_assert (ncopies_for_cost >= 1);
> >> +    }
> >> +
> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> +    {
> >> +      /* We need extra two invariants: one that contains the minimum signed
> >> +        value and one that contains half of its negative.  */
> >> +      int prologue_stmts = 2;
> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> >> +                                       scalar_to_vec, stmt_info, 0,
> >> +                                       vect_prologue);
> >> +      if (dump_enabled_p ())
> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> +                    "extra prologue_cost = %d .\n", cost);
> >> +
> >> +      /* Three dot-products and a subtraction.  */
> >> +      ncopies_for_cost *= 4;
> >> +    }
> >> +
> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> >> +                   vect_body);
> >> +
> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> +    {
> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> >> +                                                 slp_node, code, type,
> >> +                                                 vectype_in);
> >> +    }
> >> +
> >
> > Add a comment:
> >
> >     /* Transform via vect_transform_reduction.  */
> >
> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >> +  return true;
> >> +}
> >> +
> >>  /* Function vectorizable_reduction.
> >>
> >>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>    if (!type_has_mode_precision_p (op.type))
> >>      return false;
> >>
> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> >> -     which means the only use of that may be in the lane-reducing operation.  */
> >> -  if (lane_reducing
> >> -      && reduc_chain_length != 1
> >> -      && !only_slp_reduc_chain)
> >> -    {
> >> -      if (dump_enabled_p ())
> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                        "lane-reducing reduction with extra stmts.\n");
> >> -      return false;
> >> -    }
> >> -
> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
> >>       since we'll mix lanes belonging to different reductions.  But it's
> >>       OK to use them in a reduction chain or when the reduction group
> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>        && loop_vinfo->suggested_unroll_factor == 1)
> >>      single_defuse_cycle = true;
> >>
> >> -  if (single_defuse_cycle || lane_reducing)
> >> +  if (single_defuse_cycle && !lane_reducing)
> >
> > If there's also a non-lane-reducing plus in the chain don't we have to
> > check for that reduction op?  So shouldn't it be
> > single_defuse_cycle && ... fact that we don't record
> > (non-lane-reducing op there) ...
>
> Quite not understand this point.  For a non-lane-reducing op in the chain,
> it should be handled in its own vectorizable_xxx function? The below check
> is only for the first statement (vect_reduction_def) in the reduction.

Hmm.  So we have vectorizable_lane_reducing_* for the check on the
lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
following is then just for the case there's a single def that's not
lane-reducing
and we're forcing a single-def-use and thus go via vect_transform_reduction?

> >
> >>      {
> >>        gcc_assert (op.code != COND_EXPR);
> >>
> >> -      /* 4. Supportable by target?  */
> >> -      bool ok = true;
> >> -
> >> -      /* 4.1. check support for the operation in the loop
> >> +      /* 4. check support for the operation in the loop
> >>
> >>          This isn't necessary for the lane reduction codes, since they
> >>          can only be produced by pattern matching, and it's up to the
> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>          mixed-sign dot-products can be implemented using signed
> >>          dot-products.  */
> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> >> -      if (!lane_reducing
> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >>          {
> >>            if (dump_enabled_p ())
> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> -           ok = false;
> >> +           single_defuse_cycle = false;
> >>           else
> >>             if (dump_enabled_p ())
> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >>           return false;
> >>         }
> >> -
> >> -      /* lane-reducing operations have to go through vect_transform_reduction.
> >> -         For the other cases try without the single cycle optimization.  */
> >> -      if (!ok)
> >> -       {
> >> -         if (lane_reducing)
> >> -           return false;
> >> -         else
> >> -           single_defuse_cycle = false;
> >> -       }
> >>      }
> >>    if (dump_enabled_p () && single_defuse_cycle)
> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>                      "multiple vectors to one in the loop body\n");
> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >>
> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> -      && lane_reducing)
> >> -    {
> >> -      if (dump_enabled_p ())
> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                        "multi def-use cycle not possible for lane-reducing "
> >> -                        "reduction operation\n");
> >> -      return false;
> >> -    }
> >> +  /* For lane-reducing operation, the below processing related to single
> >> +     defuse-cycle will be done in its own vectorizable function.  One more
> >> +     thing to note is that the operation must not be involved in fold-left
> >> +     reduction.  */
> >> +  single_defuse_cycle &= !lane_reducing;
> >>
> >>    if (slp_node
> >> -      && !(!single_defuse_cycle
> >> -          && !lane_reducing
> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
> >> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
> >>      for (i = 0; i < (int) op.num_ops; i++)
> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >>         {
> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >>                              reduction_type, ncopies, cost_vec);
> >>    /* Cost the reduction op inside the loop if transformed via
> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> -     separate vectorizable_* routines.  */
> >> -  if (single_defuse_cycle || lane_reducing)
> >> -    {
> >> -      int factor = 1;
> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> -       /* Three dot-products and a subtraction.  */
> >> -       factor = 4;
> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> -                       stmt_info, 0, vect_body);
> >> -    }
> >> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> >> +     this is costed by the separate vectorizable_* routines.  */
> >> +  if (single_defuse_cycle)
> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> >>
> >>    if (dump_enabled_p ()
> >>        && reduction_type == FOLD_LEFT_REDUCTION)
> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >>                      "using an in-order (fold-left) reduction.\n");
> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> >> -     reductions go through their own vectorizable_* routines.  */
> >> -  if (!single_defuse_cycle
> >> -      && !lane_reducing
> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
> >> +
> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
> >> +     through their own vectorizable_* routines.  */
> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >>      {
> >>        stmt_vec_info tem
> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >>    bool lane_reducing = lane_reducing_op_p (code);
> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >>
> >> +  if (lane_reducing)
> >> +    {
> >> +      /* The last operand of lane-reducing op is for reduction.  */
> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> >> +
> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
> >> +      gcc_assert (slp_node);
> >> +    }
> >> +
> >>    /* Create the destination vector  */
> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >>                          &vec_oprnds[2]);
> >>      }
> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> >> +    {
> >> +      /* For lane-reducing op covered by single-lane slp node, the input
> >> +        vectype of the reduction PHI determines copies of vectorized def-use
> >> +        cycles, which might be more than effective copies of vectorized lane-
> >> +        reducing reduction statements.  This could be complemented by
> >> +        generating extra trivial pass-through copies.  For example:
> >> +
> >> +          int sum = 0;
> >> +          for (i)
> >> +            {
> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> +              sum += n[i];               // normal <vector(4) int>
> >> +            }
> >> +
> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> >> +        statements would be transformed as:
> >> +
> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >> +
> >> +          for (i / 16)
> >> +            {
> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >> +              sum_v1 = sum_v1;  // copy
> >> +              sum_v2 = sum_v2;  // copy
> >> +              sum_v3 = sum_v3;  // copy
> >> +
> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >> +              sum_v2 = sum_v2;  // copy
> >> +              sum_v3 = sum_v3;  // copy
> >> +
> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> >> +              sum_v2 += n_v2[i: 8  ~ 11];
> >> +              sum_v3 += n_v3[i: 12 ~ 15];
> >> +            }
> >> +       */
> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> >> +
> >
> > assert reduc_ncopies >= using_ncopies?  Maybe assert
> > reduc_index == op.num_ops - 1 given you use one above
> > and the other below?  Or simply iterate till op.num_ops
> > and sip i == reduc_index.
> >
> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> >> +       {
> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> +       }
> >> +    }
> >>
> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >>      {
> >>        gimple *new_stmt;
> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> +
> >> +      if (!vop[0] || !vop[1])
> >> +       {
> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> >> +
> >> +         /* Insert trivial copy if no need to generate vectorized
> >> +            statement.  */
> >> +         gcc_assert (reduc_vop);
> >> +
> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> >
> > I think you could simply do
> >
> >                slp_node->push_vec_def (reduc_vop);
> >                continue;
> >
> > without any code generation.
> >
>
> OK, that would be easy. Here comes another question, this patch assumes
> lane-reducing op would always be contained in a slp node, since single-lane
> slp node feature has been enabled. But I got some regression if I enforced
> such constraint on lane-reducing op check. Those cases are founded to
> be unvectorizable with single-lane slp, so this should not be what we want?
> and need to be fixed?

Yes, in the end we need to chase down all unsupported cases and fix them
(there's known issues with load permutes, I'm working on that - hopefully
when finding a continuous stretch of time...).

>
> >> +       }
> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >>         {
> >>           /* No conditional ifns have been defined for lane-reducing op
> >>              yet.  */
> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >>
> >>           if (masked_loop_p && mask_by_cond_expr)
> >>             {
> >> +             tree stmt_vectype_in = vectype_in;
> >> +             unsigned nvectors = vec_num * ncopies;
> >> +
> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> >> +               {
> >> +                 /* Input vectype of the reduction PHI may be defferent from
> >
> > different
> >
> >> +                    that of lane-reducing operation.  */
> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
> >
> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>
> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
> to vect_get_loop_mask?

Probably - it depends on the vectorization factor.  What I wanted to
point out is that
vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
or we should forgo with it (but that's possibly a post-only-SLP
cleanup to be done).

See vect_slp_analyze_node_operations_1 where that's computed.  For reductions
it's probably not quite right (and we might have latent issues like
those you are
"fixing" with code like above).  The order we analyze stmts might also be not
optimal for reductions with SLP - in fact given that stmt analysis
relies on a fixed VF
it would probably make sense to determine the reduction VF in advance as well.
But again this sounds like post-only-SLP cleanup opportunities.

In the end I might suggest to always use reduct-VF and vectype to determine
the number of vector stmts rather than computing ncopies/vec_num separately.

Richard.

> Thanks,
> Feng
>
>
> ________________________________________
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, June 20, 2024 8:26 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
>
> On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
> >
> > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> > vectorizer could only handle the pattern if the reduction chain does not
> > contain other operation, no matter the other is normal or lane-reducing.
> >
> > Actually, to allow multiple arbitrary lane-reducing operations, we need to
> > support vectorization of loop reduction chain with mixed input vectypes. Since
> > lanes of vectype may vary with operation, the effective ncopies of vectorized
> > statements for operation also may not be same to each other, this causes
> > mismatch on vectorized def-use cycles. A simple way is to align all operations
> > with the one that has the most ncopies, the gap could be complemented by
> > generating extra trivial pass-through copies. For example:
> >
> >    int sum = 0;
> >    for (i)
> >      {
> >        sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >        sum += w[i];               // widen-sum <vector(16) char>
> >        sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >        sum += n[i];               // normal <vector(4) int>
> >      }
> >
> > The vector size is 128-bit vectorization factor is 16. Reduction statements
> > would be transformed as:
> >
> >    vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >    vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >    vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >    vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >
> >    for (i / 16)
> >      {
> >        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >        sum_v1 = sum_v1;  // copy
> >        sum_v2 = sum_v2;  // copy
> >        sum_v3 = sum_v3;  // copy
> >
> >        sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
> >        sum_v1 = sum_v1;  // copy
> >        sum_v2 = sum_v2;  // copy
> >        sum_v3 = sum_v3;  // copy
> >
> >        sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >        sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >        sum_v2 = sum_v2;  // copy
> >        sum_v3 = sum_v3;  // copy
> >
> >        sum_v0 += n_v0[i: 0  ~ 3 ];
> >        sum_v1 += n_v1[i: 4  ~ 7 ];
> >        sum_v2 += n_v2[i: 8  ~ 11];
> >        sum_v3 += n_v3[i: 12 ~ 15];
> >      }
> >
> > Thanks,
> > Feng
> >
> > ---
> > gcc/
> >         PR tree-optimization/114440
> >         * tree-vectorizer.h (vectorizable_lane_reducing): New function
> >         declaration.
> >         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
> >         vectorizable_lane_reducing to analyze lane-reducing operation.
> >         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
> >         code related to emulated_mixed_dot_prod.
> >         (vect_reduction_update_partial_vector_usage): Compute ncopies as the
> >         original means for single-lane slp node.
> >         (vectorizable_lane_reducing): New function.
> >         (vectorizable_reduction): Allow multiple lane-reducing operations in
> >         loop reduction. Move some original lane-reducing related code to
> >         vectorizable_lane_reducing.
> >         (vect_transform_reduction): Extend transformation to support reduction
> >         statements with mixed input vectypes.
> >
> > gcc/testsuite/
> >         PR tree-optimization/114440
> >         * gcc.dg/vect/vect-reduc-chain-1.c
> >         * gcc.dg/vect/vect-reduc-chain-2.c
> >         * gcc.dg/vect/vect-reduc-chain-3.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> >         * gcc.dg/vect/vect-reduc-dot-slp-1.c
> > ---
> >  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
> >  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
> >  gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
> >  gcc/tree-vect-stmts.cc                        |   2 +
> >  gcc/tree-vectorizer.h                         |   2 +
> >  11 files changed, 802 insertions(+), 70 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> > new file mode 100644
> > index 00000000000..04bfc419dbd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> > @@ -0,0 +1,62 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_2 char *restrict c,
> > +   SIGNEDNESS_2 char *restrict d,
> > +   SIGNEDNESS_1 int *restrict e)
> > +{
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      res += a[i] * b[i];
> > +      res += c[i] * d[i];
> > +      res += e[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_2 char c[N], d[N];
> > +  SIGNEDNESS_1 int e[N];
> > +  int expected = 0x12345;
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 5;
> > +      b[i] = BASE + OFFSET + i * 4;
> > +      c[i] = BASE + i * 2;
> > +      d[i] = BASE + OFFSET + i * 3;
> > +      e[i] = i;
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[i] * b[i];
> > +      expected += c[i] * d[i];
> > +      expected += e[i];
> > +    }
> > +  if (f (0x12345, a, b, c, d, e) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> > new file mode 100644
> > index 00000000000..6c803b80120
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> > @@ -0,0 +1,77 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 unsigned
> > +#define SIGNEDNESS_3 signed
> > +#define SIGNEDNESS_4 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +fn (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_3 char *restrict c,
> > +   SIGNEDNESS_3 char *restrict d,
> > +   SIGNEDNESS_4 short *restrict e,
> > +   SIGNEDNESS_4 short *restrict f,
> > +   SIGNEDNESS_1 int *restrict g)
> > +{
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      res += a[i] * b[i];
> > +      res += i + 1;
> > +      res += c[i] * d[i];
> > +      res += e[i] * f[i];
> > +      res += g[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
> > +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_3 char c[N], d[N];
> > +  SIGNEDNESS_4 short e[N], f[N];
> > +  SIGNEDNESS_1 int g[N];
> > +  int expected = 0x12345;
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE2 + i * 5;
> > +      b[i] = BASE2 + OFFSET + i * 4;
> > +      c[i] = BASE3 + i * 2;
> > +      d[i] = BASE3 + OFFSET + i * 3;
> > +      e[i] = BASE4 + i * 6;
> > +      f[i] = BASE4 + OFFSET + i * 5;
> > +      g[i] = i;
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[i] * b[i];
> > +      expected += i + 1;
> > +      expected += c[i] * d[i];
> > +      expected += e[i] * f[i];
> > +      expected += g[i];
> > +    }
> > +  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> > new file mode 100644
> > index 00000000000..a41e4b176c4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> > @@ -0,0 +1,66 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 unsigned
> > +#define SIGNEDNESS_3 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_3 short *restrict c,
> > +   SIGNEDNESS_3 short *restrict d,
> > +   SIGNEDNESS_1 int *restrict e)
> > +{
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      short diff = a[i] - b[i];
> > +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> > +      res += abs;
> > +      res += c[i] * d[i];
> > +      res += e[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_3 short c[N], d[N];
> > +  SIGNEDNESS_1 int e[N];
> > +  int expected = 0x12345;
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE2 + i * 5;
> > +      b[i] = BASE2 - i * 4;
> > +      c[i] = BASE3 + i * 2;
> > +      d[i] = BASE3 + OFFSET + i * 3;
> > +      e[i] = i;
> > +      asm volatile ("" ::: "memory");
> > +      short diff = a[i] - b[i];
> > +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> > +      expected += abs;
> > +      expected += c[i] * d[i];
> > +      expected += e[i];
> > +    }
> > +  if (f (0x12345, a, b, c, d, e) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> > new file mode 100644
> > index 00000000000..c2831fbcc8e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> > @@ -0,0 +1,95 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *a,
> > +   SIGNEDNESS_2 char *b,
> > +   int step, int n)
> > +{
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      res += a[0] * b[0];
> > +      res += a[1] * b[1];
> > +      res += a[2] * b[2];
> > +      res += a[3] * b[3];
> > +      res += a[4] * b[4];
> > +      res += a[5] * b[5];
> > +      res += a[6] * b[6];
> > +      res += a[7] * b[7];
> > +      res += a[8] * b[8];
> > +      res += a[9] * b[9];
> > +      res += a[10] * b[10];
> > +      res += a[11] * b[11];
> > +      res += a[12] * b[12];
> > +      res += a[13] * b[13];
> > +      res += a[14] * b[14];
> > +      res += a[15] * b[15];
> > +
> > +      a += step;
> > +      b += step;
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[100], b[100];
> > +  int expected = 0x12345;
> > +  int step = 16;
> > +  int n = 2;
> > +  int t = 0;
> > +
> > +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] = BASE + i * 5;
> > +      b[i] = BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[t + 0] * b[t + 0];
> > +      expected += a[t + 1] * b[t + 1];
> > +      expected += a[t + 2] * b[t + 2];
> > +      expected += a[t + 3] * b[t + 3];
> > +      expected += a[t + 4] * b[t + 4];
> > +      expected += a[t + 5] * b[t + 5];
> > +      expected += a[t + 6] * b[t + 6];
> > +      expected += a[t + 7] * b[t + 7];
> > +      expected += a[t + 8] * b[t + 8];
> > +      expected += a[t + 9] * b[t + 9];
> > +      expected += a[t + 10] * b[t + 10];
> > +      expected += a[t + 11] * b[t + 11];
> > +      expected += a[t + 12] * b[t + 12];
> > +      expected += a[t + 13] * b[t + 13];
> > +      expected += a[t + 14] * b[t + 14];
> > +      expected += a[t + 15] * b[t + 15];
> > +      t += step;
> > +    }
> > +
> > +  if (f (0x12345, a, b, step, n) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> > new file mode 100644
> > index 00000000000..4114264a364
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> > @@ -0,0 +1,67 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *a,
> > +   SIGNEDNESS_2 char *b,
> > +   int n)
> > +{
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      res += a[5 * i + 0] * b[5 * i + 0];
> > +      res += a[5 * i + 1] * b[5 * i + 1];
> > +      res += a[5 * i + 2] * b[5 * i + 2];
> > +      res += a[5 * i + 3] * b[5 * i + 3];
> > +      res += a[5 * i + 4] * b[5 * i + 4];
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[100], b[100];
> > +  int expected = 0x12345;
> > +  int n = 18;
> > +
> > +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] = BASE + i * 5;
> > +      b[i] = BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[5 * i + 0] * b[5 * i + 0];
> > +      expected += a[5 * i + 1] * b[5 * i + 1];
> > +      expected += a[5 * i + 2] * b[5 * i + 2];
> > +      expected += a[5 * i + 3] * b[5 * i + 3];
> > +      expected += a[5 * i + 4] * b[5 * i + 4];
> > +    }
> > +
> > +  if (f (0x12345, a, b, n) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> > new file mode 100644
> > index 00000000000..2cdecc36d16
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> > @@ -0,0 +1,79 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b,
> > +   int step, int n)
> > +{
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      res += a[0] * b[0];
> > +      res += a[1] * b[1];
> > +      res += a[2] * b[2];
> > +      res += a[3] * b[3];
> > +      res += a[4] * b[4];
> > +      res += a[5] * b[5];
> > +      res += a[6] * b[6];
> > +      res += a[7] * b[7];
> > +
> > +      a += step;
> > +      b += step;
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 short a[100], b[100];
> > +  int expected = 0x12345;
> > +  int step = 8;
> > +  int n = 2;
> > +  int t = 0;
> > +
> > +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] = BASE + i * 5;
> > +      b[i] = BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[t + 0] * b[t + 0];
> > +      expected += a[t + 1] * b[t + 1];
> > +      expected += a[t + 2] * b[t + 2];
> > +      expected += a[t + 3] * b[t + 3];
> > +      expected += a[t + 4] * b[t + 4];
> > +      expected += a[t + 5] * b[t + 5];
> > +      expected += a[t + 6] * b[t + 6];
> > +      expected += a[t + 7] * b[t + 7];
> > +      t += step;
> > +    }
> > +
> > +  if (f (0x12345, a, b, step, n) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> > new file mode 100644
> > index 00000000000..32c0f30c77b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> > @@ -0,0 +1,63 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b,
> > +   int n)
> > +{
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      res += a[3 * i + 0] * b[3 * i + 0];
> > +      res += a[3 * i + 1] * b[3 * i + 1];
> > +      res += a[3 * i + 2] * b[3 * i + 2];
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 short a[100], b[100];
> > +  int expected = 0x12345;
> > +  int n = 18;
> > +
> > +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] = BASE + i * 5;
> > +      b[i] = BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i = 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected += a[3 * i + 0] * b[3 * i + 0];
> > +      expected += a[3 * i + 1] * b[3 * i + 1];
> > +      expected += a[3 * i + 2] * b[3 * i + 2];
> > +    }
> > +
> > +  if (f (0x12345, a, b, n) != expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> > new file mode 100644
> > index 00000000000..e17d6291f75
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> > @@ -0,0 +1,35 @@
> > +/* Disabling epilogues until we find a better way to deal with scans.  */
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res0,
> > +   SIGNEDNESS_1 int res1,
> > +   SIGNEDNESS_1 int res2,
> > +   SIGNEDNESS_1 int res3,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b)
> > +{
> > +  for (int i = 0; i < 64; i += 4)
> > +    {
> > +      res0 += a[i + 0] * b[i + 0];
> > +      res1 += a[i + 1] * b[i + 1];
> > +      res2 += a[i + 2] * b[i + 2];
> > +      res3 += a[i + 3] * b[i + 3];
> > +    }
> > +
> > +  return res0 ^ res1 ^ res2 ^ res3;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> > +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" } } */
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index e0561feddce..6d91665a341 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
> >    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
> >      gcc_unreachable ();
> >
> > -  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> > -
> >    if (reduction_type == EXTRACT_LAST_REDUCTION)
> >      /* No extra instructions are needed in the prologue.  The loop body
> >         operations are costed in vectorizable_condition.  */
> > @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
> >            initial result of the data reduction, initial value of the index
> >            reduction.  */
> >         prologue_stmts = 4;
> > -      else if (emulated_mixed_dot_prod)
> > -       /* We need the initial reduction value and two invariants:
> > -          one that contains the minimum signed value and one that
> > -          contains half of its negative.  */
> > -       prologue_stmts = 3;
> >        else
> > +       /* We need the initial reduction value.  */
> >         prologue_stmts = 1;
> >        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
> >                                          scalar_to_vec, stmt_info, 0,
> > @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >        vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> >        unsigned nvectors;
> >
> > -      if (slp_node)
> > +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>
> Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> instead, which is bad.
>
> >         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >        else
> >         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> > @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >      }
> >  }
> >
> > +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> > +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> > +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> > +   (sum-of-absolute-differences).
> > +
> > +   For a lane-reducing operation, the loop reduction path that it lies in,
> > +   may contain normal operation, or other lane-reducing operation of different
> > +   input type size, an example as:
> > +
> > +     int sum = 0;
> > +     for (i)
> > +       {
> > +         ...
> > +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> > +         sum += w[i];                // widen-sum <vector(16) char>
> > +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> > +         sum += n[i];                // normal <vector(4) int>
> > +         ...
> > +       }
> > +
> > +   Vectorization factor is essentially determined by operation whose input
> > +   vectype has the most lanes ("vector(16) char" in the example), while we
> > +   need to choose input vectype with the least lanes ("vector(4) int" in the
> > +   example) for the reduction PHI statement.  */
> > +
> > +bool
> > +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> > +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> > +{
> > +  gimple *stmt = stmt_info->stmt;
> > +
> > +  if (!lane_reducing_stmt_p (stmt))
> > +    return false;
> > +
> > +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> > +
> > +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> > +    return false;
> > +
> > +  /* Do not try to vectorize bit-precision reductions.  */
> > +  if (!type_has_mode_precision_p (type))
> > +    return false;
> > +
> > +  if (!slp_node)
> > +    return false;
> > +
> > +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> > +    {
> > +      stmt_vec_info def_stmt_info;
> > +      slp_tree slp_op;
> > +      tree op;
> > +      tree vectype;
> > +      enum vect_def_type dt;
> > +
> > +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> > +                              &slp_op, &dt, &vectype, &def_stmt_info))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "use not simple.\n");
> > +         return false;
> > +       }
> > +
> > +      if (!vectype)
> > +       {
> > +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> > +                                                slp_op);
> > +         if (!vectype)
> > +           return false;
> > +       }
> > +
> > +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "incompatible vector types for invariants\n");
> > +         return false;
> > +       }
> > +
> > +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> > +       continue;
> > +
> > +      /* There should be at most one cycle def in the stmt.  */
> > +      if (VECTORIZABLE_CYCLE_DEF (dt))
> > +       return false;
> > +    }
> > +
> > +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> > +
> > +  /* TODO: Support lane-reducing operation that does not directly participate
> > +     in loop reduction. */
> > +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> > +    return false;
> > +
> > +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> > +     recoginized.  */
> > +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> > +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> > +
> > +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +  int ncopies_for_cost;
> > +
> > +  if (SLP_TREE_LANES (slp_node) > 1)
> > +    {
> > +      /* Now lane-reducing operations in a non-single-lane slp node should only
> > +        come from the same loop reduction path.  */
> > +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> > +      ncopies_for_cost = 1;
> > +    }
> > +  else
> > +    {
> > +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
>
> OK, so the fact that the ops are lane-reducing means they effectively
> change the VF for the result.  That's only possible as we tightly control
> code generation and "adjust" to the expected VF (by inserting the copies
> you mentioned above), but only up to the highest number of outputs
> created in the reduction chain.  In that sense instead of talking and recording
> "input vector types" wouldn't it make more sense to record the effective
> vectorization factor for the reduction instance?  That VF would be at most
> the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> operation in the reduction chain it would be always equal to the loops VF.
>
> ncopies would then be always determined by that reduction instance VF and
> the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> instance VF would also trivially indicate the force-single-def-use-cycle
> case, possibly simplifying code?
>
> > +      gcc_assert (ncopies_for_cost >= 1);
> > +    }
> > +
> > +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> > +    {
> > +      /* We need extra two invariants: one that contains the minimum signed
> > +        value and one that contains half of its negative.  */
> > +      int prologue_stmts = 2;
> > +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> > +                                       scalar_to_vec, stmt_info, 0,
> > +                                       vect_prologue);
> > +      if (dump_enabled_p ())
> > +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> > +                    "extra prologue_cost = %d .\n", cost);
> > +
> > +      /* Three dot-products and a subtraction.  */
> > +      ncopies_for_cost *= 4;
> > +    }
> > +
> > +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> > +                   vect_body);
> > +
> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > +    {
> > +      enum tree_code code = gimple_assign_rhs_code (stmt);
> > +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> > +                                                 slp_node, code, type,
> > +                                                 vectype_in);
> > +    }
> > +
>
> Add a comment:
>
>     /* Transform via vect_transform_reduction.  */
>
> > +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> > +  return true;
> > +}
> > +
> >  /* Function vectorizable_reduction.
> >
> >     Check if STMT_INFO performs a reduction operation that can be vectorized.
> > @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    if (!type_has_mode_precision_p (op.type))
> >      return false;
> >
> > -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> > -     which means the only use of that may be in the lane-reducing operation.  */
> > -  if (lane_reducing
> > -      && reduc_chain_length != 1
> > -      && !only_slp_reduc_chain)
> > -    {
> > -      if (dump_enabled_p ())
> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > -                        "lane-reducing reduction with extra stmts.\n");
> > -      return false;
> > -    }
> > -
> >    /* Lane-reducing ops also never can be used in a SLP reduction group
> >       since we'll mix lanes belonging to different reductions.  But it's
> >       OK to use them in a reduction chain or when the reduction group
> > @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >        && loop_vinfo->suggested_unroll_factor == 1)
> >      single_defuse_cycle = true;
> >
> > -  if (single_defuse_cycle || lane_reducing)
> > +  if (single_defuse_cycle && !lane_reducing)
>
> If there's also a non-lane-reducing plus in the chain don't we have to
> check for that reduction op?  So shouldn't it be
> single_defuse_cycle && ... fact that we don't record
> (non-lane-reducing op there) ...
>
> >      {
> >        gcc_assert (op.code != COND_EXPR);
> >
> > -      /* 4. Supportable by target?  */
> > -      bool ok = true;
> > -
> > -      /* 4.1. check support for the operation in the loop
> > +      /* 4. check support for the operation in the loop
> >
> >          This isn't necessary for the lane reduction codes, since they
> >          can only be produced by pattern matching, and it's up to the
> > @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >          mixed-sign dot-products can be implemented using signed
> >          dot-products.  */
> >        machine_mode vec_mode = TYPE_MODE (vectype_in);
> > -      if (!lane_reducing
> > -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> > +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >          {
> >            if (dump_enabled_p ())
> >              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >               || !vect_can_vectorize_without_simd_p (op.code))
> > -           ok = false;
> > +           single_defuse_cycle = false;
> >           else
> >             if (dump_enabled_p ())
> >               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> > @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >           return false;
> >         }
> > -
> > -      /* lane-reducing operations have to go through vect_transform_reduction.
> > -         For the other cases try without the single cycle optimization.  */
> > -      if (!ok)
> > -       {
> > -         if (lane_reducing)
> > -           return false;
> > -         else
> > -           single_defuse_cycle = false;
> > -       }
> >      }
> >    if (dump_enabled_p () && single_defuse_cycle)
> >      dump_printf_loc (MSG_NOTE, vect_location,
> > @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >                      "multiple vectors to one in the loop body\n");
> >    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >
> > -  /* If the reduction stmt is one of the patterns that have lane
> > -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> > -  if ((ncopies > 1 && ! single_defuse_cycle)
> > -      && lane_reducing)
> > -    {
> > -      if (dump_enabled_p ())
> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > -                        "multi def-use cycle not possible for lane-reducing "
> > -                        "reduction operation\n");
> > -      return false;
> > -    }
> > +  /* For lane-reducing operation, the below processing related to single
> > +     defuse-cycle will be done in its own vectorizable function.  One more
> > +     thing to note is that the operation must not be involved in fold-left
> > +     reduction.  */
> > +  single_defuse_cycle &= !lane_reducing;
> >
> >    if (slp_node
> > -      && !(!single_defuse_cycle
> > -          && !lane_reducing
> > -          && reduction_type != FOLD_LEFT_REDUCTION))
> > +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
> >      for (i = 0; i < (int) op.num_ops; i++)
> >        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >         {
> > @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >                              reduction_type, ncopies, cost_vec);
> >    /* Cost the reduction op inside the loop if transformed via
> > -     vect_transform_reduction.  Otherwise this is costed by the
> > -     separate vectorizable_* routines.  */
> > -  if (single_defuse_cycle || lane_reducing)
> > -    {
> > -      int factor = 1;
> > -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> > -       /* Three dot-products and a subtraction.  */
> > -       factor = 4;
> > -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> > -                       stmt_info, 0, vect_body);
> > -    }
> > +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> > +     this is costed by the separate vectorizable_* routines.  */
> > +  if (single_defuse_cycle)
> > +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> >
> >    if (dump_enabled_p ()
> >        && reduction_type == FOLD_LEFT_REDUCTION)
> >      dump_printf_loc (MSG_NOTE, vect_location,
> >                      "using an in-order (fold-left) reduction.\n");
> >    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> > -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> > -     reductions go through their own vectorizable_* routines.  */
> > -  if (!single_defuse_cycle
> > -      && !lane_reducing
> > -      && reduction_type != FOLD_LEFT_REDUCTION)
> > +
> > +  /* All but single defuse-cycle optimized and fold-left reductions go
> > +     through their own vectorizable_* routines.  */
> > +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >      {
> >        stmt_vec_info tem
> >         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> > @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >    bool lane_reducing = lane_reducing_op_p (code);
> >    gcc_assert (single_defuse_cycle || lane_reducing);
> >
> > +  if (lane_reducing)
> > +    {
> > +      /* The last operand of lane-reducing op is for reduction.  */
> > +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> > +
> > +      /* Now all lane-reducing ops are covered by some slp node.  */
> > +      gcc_assert (slp_node);
> > +    }
> > +
> >    /* Create the destination vector  */
> >    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> > @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >                          &vec_oprnds[2]);
> >      }
> > +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> > +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> > +    {
> > +      /* For lane-reducing op covered by single-lane slp node, the input
> > +        vectype of the reduction PHI determines copies of vectorized def-use
> > +        cycles, which might be more than effective copies of vectorized lane-
> > +        reducing reduction statements.  This could be complemented by
> > +        generating extra trivial pass-through copies.  For example:
> > +
> > +          int sum = 0;
> > +          for (i)
> > +            {
> > +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> > +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> > +              sum += n[i];               // normal <vector(4) int>
> > +            }
> > +
> > +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> > +        statements would be transformed as:
> > +
> > +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> > +
> > +          for (i / 16)
> > +            {
> > +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> > +              sum_v1 = sum_v1;  // copy
> > +              sum_v2 = sum_v2;  // copy
> > +              sum_v3 = sum_v3;  // copy
> > +
> > +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> > +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> > +              sum_v2 = sum_v2;  // copy
> > +              sum_v3 = sum_v3;  // copy
> > +
> > +              sum_v0 += n_v0[i: 0  ~ 3 ];
> > +              sum_v1 += n_v1[i: 4  ~ 7 ];
> > +              sum_v2 += n_v2[i: 8  ~ 11];
> > +              sum_v3 += n_v3[i: 12 ~ 15];
> > +            }
> > +       */
> > +      unsigned using_ncopies = vec_oprnds[0].length ();
> > +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> > +
>
> assert reduc_ncopies >= using_ncopies?  Maybe assert
> reduc_index == op.num_ops - 1 given you use one above
> and the other below?  Or simply iterate till op.num_ops
> and sip i == reduc_index.
>
> > +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> > +       {
> > +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> > +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> > +       }
> > +    }
> >
> >    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> > @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >      {
> >        gimple *new_stmt;
> >        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> > -      if (masked_loop_p && !mask_by_cond_expr)
> > +
> > +      if (!vop[0] || !vop[1])
> > +       {
> > +         tree reduc_vop = vec_oprnds[reduc_index][i];
> > +
> > +         /* Insert trivial copy if no need to generate vectorized
> > +            statement.  */
> > +         gcc_assert (reduc_vop);
> > +
> > +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> > +         new_temp = make_ssa_name (vec_dest, new_stmt);
> > +         gimple_set_lhs (new_stmt, new_temp);
> > +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
>
> I think you could simply do
>
>                slp_node->push_vec_def (reduc_vop);
>                continue;
>
> without any code generation.
>
> > +       }
> > +      else if (masked_loop_p && !mask_by_cond_expr)
> >         {
> >           /* No conditional ifns have been defined for lane-reducing op
> >              yet.  */
> > @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >
> >           if (masked_loop_p && mask_by_cond_expr)
> >             {
> > +             tree stmt_vectype_in = vectype_in;
> > +             unsigned nvectors = vec_num * ncopies;
> > +
> > +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> > +               {
> > +                 /* Input vectype of the reduction PHI may be defferent from
>
> different
>
> > +                    that of lane-reducing operation.  */
> > +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
>
> I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>
> Otherwise the patch looks good to me.
>
> Richard.
>
> > +               }
> > +
> >               tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> > -                                             vec_num * ncopies, vectype_in, i);
> > +                                             nvectors, stmt_vectype_in, i);
> >               build_vect_cond_expr (code, vop, mask, gsi);
> >             }
> >
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index ca6052662a3..1b73ef01ade 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
> >                                       NULL, NULL, node, cost_vec)
> >           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
> >           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
> > +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
> > +                                        stmt_info, node, cost_vec)
> >           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
> >                                      node, node_instance, cost_vec)
> >           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 60224f4e284..94736736dcc 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
> >  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
> >                                          slp_tree, slp_instance, int,
> >                                          bool, stmt_vector_for_cost *);
> > +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
> > +                                       slp_tree, stmt_vector_for_cost *);
> >  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
> >                                     slp_tree, slp_instance,
> >                                     stmt_vector_for_cost *);
> > --
> > 2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-24 12:58     ` Richard Biener
@ 2024-06-25  9:32       ` Feng Xue OS
  2024-06-25 10:26         ` Richard Biener
  2024-06-26 14:50         ` Feng Xue OS
  0 siblings, 2 replies; 9+ messages in thread
From: Feng Xue OS @ 2024-06-25  9:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

>>
>> >> -      if (slp_node)
>> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>> >
>> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
>> > instead, which is bad.
>> >
>> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>> >>        else
>> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>> >>      }
>> >>  }
>> >>
>> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
>> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
>> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
>> >> +   (sum-of-absolute-differences).
>> >> +
>> >> +   For a lane-reducing operation, the loop reduction path that it lies in,
>> >> +   may contain normal operation, or other lane-reducing operation of different
>> >> +   input type size, an example as:
>> >> +
>> >> +     int sum = 0;
>> >> +     for (i)
>> >> +       {
>> >> +         ...
>> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
>> >> +         sum += w[i];                // widen-sum <vector(16) char>
>> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
>> >> +         sum += n[i];                // normal <vector(4) int>
>> >> +         ...
>> >> +       }
>> >> +
>> >> +   Vectorization factor is essentially determined by operation whose input
>> >> +   vectype has the most lanes ("vector(16) char" in the example), while we
>> >> +   need to choose input vectype with the least lanes ("vector(4) int" in the
>> >> +   example) for the reduction PHI statement.  */
>> >> +
>> >> +bool
>> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
>> >> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
>> >> +{
>> >> +  gimple *stmt = stmt_info->stmt;
>> >> +
>> >> +  if (!lane_reducing_stmt_p (stmt))
>> >> +    return false;
>> >> +
>> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
>> >> +
>> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
>> >> +    return false;
>> >> +
>> >> +  /* Do not try to vectorize bit-precision reductions.  */
>> >> +  if (!type_has_mode_precision_p (type))
>> >> +    return false;
>> >> +
>> >> +  if (!slp_node)
>> >> +    return false;
>> >> +
>> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
>> >> +    {
>> >> +      stmt_vec_info def_stmt_info;
>> >> +      slp_tree slp_op;
>> >> +      tree op;
>> >> +      tree vectype;
>> >> +      enum vect_def_type dt;
>> >> +
>> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
>> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
>> >> +       {
>> >> +         if (dump_enabled_p ())
>> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +                            "use not simple.\n");
>> >> +         return false;
>> >> +       }
>> >> +
>> >> +      if (!vectype)
>> >> +       {
>> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
>> >> +                                                slp_op);
>> >> +         if (!vectype)
>> >> +           return false;
>> >> +       }
>> >> +
>> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
>> >> +       {
>> >> +         if (dump_enabled_p ())
>> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +                            "incompatible vector types for invariants\n");
>> >> +         return false;
>> >> +       }
>> >> +
>> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
>> >> +       continue;
>> >> +
>> >> +      /* There should be at most one cycle def in the stmt.  */
>> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
>> >> +       return false;
>> >> +    }
>> >> +
>> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
>> >> +
>> >> +  /* TODO: Support lane-reducing operation that does not directly participate
>> >> +     in loop reduction. */
>> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
>> >> +    return false;
>> >> +
>> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
>> >> +     recoginized.  */
>> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
>> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
>> >> +
>> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> >> +  int ncopies_for_cost;
>> >> +
>> >> +  if (SLP_TREE_LANES (slp_node) > 1)
>> >> +    {
>> >> +      /* Now lane-reducing operations in a non-single-lane slp node should only
>> >> +        come from the same loop reduction path.  */
>> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
>> >> +      ncopies_for_cost = 1;
>> >> +    }
>> >> +  else
>> >> +    {
>> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
>> >
>> > OK, so the fact that the ops are lane-reducing means they effectively
>> > change the VF for the result.  That's only possible as we tightly control
>> > code generation and "adjust" to the expected VF (by inserting the copies
>> > you mentioned above), but only up to the highest number of outputs
>> > created in the reduction chain.  In that sense instead of talking and recording
>> > "input vector types" wouldn't it make more sense to record the effective
>> > vectorization factor for the reduction instance?  That VF would be at most
>> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
>> > operation in the reduction chain it would be always equal to the loops VF.
>> >
>> > ncopies would then be always determined by that reduction instance VF and
>> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
>> > instance VF would also trivially indicate the force-single-def-use-cycle
>> > case, possibly simplifying code?
>>
>> I tried to add such an effective VF, while the vectype_in is still needed in some
>> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
>> The former could be deduced from the later, so recording both things seems
>> to be redundant. Another consideration is that for normal op, ncopies
>> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
>> it is from VF. So, a better means to make them unified?
> 
> AFAICS reductions are special in that they, for the accumulation SSA cycle,
> do not adhere to the loops VF but as optimization can chose a smaller one.
> OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> operations which even for lane-reducing ops is adhered to - those just
> may use a smaller VF, that of the reduction SSA cycle.
> 
> So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> it's not fully redundant but needlessly replicated over all stmts participating
> in the reduction instead of recording the reduction VF in the reduc_info and
> using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> for stmts in the reduction cycle.
> 
> At least that was my idea ...
> 

For lane-reducing ops and single-defuse-cycle optimization, we could assume
no lane would be reduced, and always generate vectorization statements
according to the normal VF, if placeholder is needed, just insert some trivial
statement like zero-initialization, or pass-through copy. And define a"effective VF or
ncopies" to control lane-reducing related aspects in analysis and codegen (such
as the below vect_get_loop_mask).  Since all things will become SLP-based finally,
I think a suitable place to add such a field might be in slp_node, as a supplement to
"vect_stmts_size", and it is expected to be adjusted in vectorizable_reduction. So
could we do the refinement as separate patches when non-slp code path is to be
removed?

>> >> +      gcc_assert (ncopies_for_cost >= 1);
>> >> +    }
>> >> +
>> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> >> +    {
>> >> +      /* We need extra two invariants: one that contains the minimum signed
>> >> +        value and one that contains half of its negative.  */
>> >> +      int prologue_stmts = 2;
>> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
>> >> +                                       scalar_to_vec, stmt_info, 0,
>> >> +                                       vect_prologue);
>> >> +      if (dump_enabled_p ())
>> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
>> >> +                    "extra prologue_cost = %d .\n", cost);
>> >> +
>> >> +      /* Three dot-products and a subtraction.  */
>> >> +      ncopies_for_cost *= 4;
>> >> +    }
>> >> +
>> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
>> >> +                   vect_body);
>> >> +
>> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> >> +    {
>> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
>> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
>> >> +                                                 slp_node, code, type,
>> >> +                                                 vectype_in);
>> >> +    }
>> >> +
>> >
>> > Add a comment:
>> >
>> >     /* Transform via vect_transform_reduction.  */
>> >
>> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>> >> +  return true;
>> >> +}
>> >> +
>> >>  /* Function vectorizable_reduction.
>> >>
>> >>     Check if STMT_INFO performs a reduction operation that can be vectorized.
>> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>    if (!type_has_mode_precision_p (op.type))
>> >>      return false;
>> >>
>> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
>> >> -     which means the only use of that may be in the lane-reducing operation.  */
>> >> -  if (lane_reducing
>> >> -      && reduc_chain_length != 1
>> >> -      && !only_slp_reduc_chain)
>> >> -    {
>> >> -      if (dump_enabled_p ())
>> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> -                        "lane-reducing reduction with extra stmts.\n");
>> >> -      return false;
>> >> -    }
>> >> -
>> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
>> >>       since we'll mix lanes belonging to different reductions.  But it's
>> >>       OK to use them in a reduction chain or when the reduction group
>> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>        && loop_vinfo->suggested_unroll_factor == 1)
>> >>      single_defuse_cycle = true;
>> >>
>> >> -  if (single_defuse_cycle || lane_reducing)
>> >> +  if (single_defuse_cycle && !lane_reducing)
>> >
>> > If there's also a non-lane-reducing plus in the chain don't we have to
>> > check for that reduction op?  So shouldn't it be
>> > single_defuse_cycle && ... fact that we don't record
>> > (non-lane-reducing op there) ...
>>
>> Quite not understand this point.  For a non-lane-reducing op in the chain,
>> it should be handled in its own vectorizable_xxx function? The below check
>> is only for the first statement (vect_reduction_def) in the reduction.
> 
> Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> following is then just for the case there's a single def that's not
> lane-reducing
> and we're forcing a single-def-use and thus go via vect_transform_reduction?

Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
This logic is same as the original.

>> >
>> >>      {
>> >>        gcc_assert (op.code != COND_EXPR);
>> >>
>> >> -      /* 4. Supportable by target?  */
>> >> -      bool ok = true;
>> >> -
>> >> -      /* 4.1. check support for the operation in the loop
>> >> +      /* 4. check support for the operation in the loop
>> >>
>> >>          This isn't necessary for the lane reduction codes, since they
>> >>          can only be produced by pattern matching, and it's up to the
>> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>          mixed-sign dot-products can be implemented using signed
>> >>          dot-products.  */
>> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
>> >> -      if (!lane_reducing
>> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
>> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>> >>          {
>> >>            if (dump_enabled_p ())
>> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>> >>               || !vect_can_vectorize_without_simd_p (op.code))
>> >> -           ok = false;
>> >> +           single_defuse_cycle = false;
>> >>           else
>> >>             if (dump_enabled_p ())
>> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
>> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>> >>           return false;
>> >>         }
>> >> -
>> >> -      /* lane-reducing operations have to go through vect_transform_reduction.
>> >> -         For the other cases try without the single cycle optimization.  */
>> >> -      if (!ok)
>> >> -       {
>> >> -         if (lane_reducing)
>> >> -           return false;
>> >> -         else
>> >> -           single_defuse_cycle = false;
>> >> -       }
>> >>      }
>> >>    if (dump_enabled_p () && single_defuse_cycle)
>> >>      dump_printf_loc (MSG_NOTE, vect_location,
>> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>                      "multiple vectors to one in the loop body\n");
>> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>> >>
>> >> -  /* If the reduction stmt is one of the patterns that have lane
>> >> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
>> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
>> >> -      && lane_reducing)
>> >> -    {
>> >> -      if (dump_enabled_p ())
>> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> -                        "multi def-use cycle not possible for lane-reducing "
>> >> -                        "reduction operation\n");
>> >> -      return false;
>> >> -    }
>> >> +  /* For lane-reducing operation, the below processing related to single
>> >> +     defuse-cycle will be done in its own vectorizable function.  One more
>> >> +     thing to note is that the operation must not be involved in fold-left
>> >> +     reduction.  */
>> >> +  single_defuse_cycle &= !lane_reducing;
>> >>
>> >>    if (slp_node
>> >> -      && !(!single_defuse_cycle
>> >> -          && !lane_reducing
>> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
>> >> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>> >>      for (i = 0; i < (int) op.num_ops; i++)
>> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>> >>         {
>> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>> >>                              reduction_type, ncopies, cost_vec);
>> >>    /* Cost the reduction op inside the loop if transformed via
>> >> -     vect_transform_reduction.  Otherwise this is costed by the
>> >> -     separate vectorizable_* routines.  */
>> >> -  if (single_defuse_cycle || lane_reducing)
>> >> -    {
>> >> -      int factor = 1;
>> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> >> -       /* Three dot-products and a subtraction.  */
>> >> -       factor = 4;
>> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
>> >> -                       stmt_info, 0, vect_body);
>> >> -    }
>> >> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
>> >> +     this is costed by the separate vectorizable_* routines.  */
>> >> +  if (single_defuse_cycle)
>> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>> >>
>> >>    if (dump_enabled_p ()
>> >>        && reduction_type == FOLD_LEFT_REDUCTION)
>> >>      dump_printf_loc (MSG_NOTE, vect_location,
>> >>                      "using an in-order (fold-left) reduction.\n");
>> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
>> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
>> >> -     reductions go through their own vectorizable_* routines.  */
>> >> -  if (!single_defuse_cycle
>> >> -      && !lane_reducing
>> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
>> >> +
>> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
>> >> +     through their own vectorizable_* routines.  */
>> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>> >>      {
>> >>        stmt_vec_info tem
>> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
>> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>    bool lane_reducing = lane_reducing_op_p (code);
>> >>    gcc_assert (single_defuse_cycle || lane_reducing);
>> >>
>> >> +  if (lane_reducing)
>> >> +    {
>> >> +      /* The last operand of lane-reducing op is for reduction.  */
>> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
>> >> +
>> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
>> >> +      gcc_assert (slp_node);
>> >> +    }
>> >> +
>> >>    /* Create the destination vector  */
>> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>> >>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>> >>                          &vec_oprnds[2]);
>> >>      }
>> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
>> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
>> >> +    {
>> >> +      /* For lane-reducing op covered by single-lane slp node, the input
>> >> +        vectype of the reduction PHI determines copies of vectorized def-use
>> >> +        cycles, which might be more than effective copies of vectorized lane-
>> >> +        reducing reduction statements.  This could be complemented by
>> >> +        generating extra trivial pass-through copies.  For example:
>> >> +
>> >> +          int sum = 0;
>> >> +          for (i)
>> >> +            {
>> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>> >> +              sum += n[i];               // normal <vector(4) int>
>> >> +            }
>> >> +
>> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
>> >> +        statements would be transformed as:
>> >> +
>> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
>> >> +
>> >> +          for (i / 16)
>> >> +            {
>> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>> >> +              sum_v1 = sum_v1;  // copy
>> >> +              sum_v2 = sum_v2;  // copy
>> >> +              sum_v3 = sum_v3;  // copy
>> >> +
>> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>> >> +              sum_v2 = sum_v2;  // copy
>> >> +              sum_v3 = sum_v3;  // copy
>> >> +
>> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
>> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
>> >> +              sum_v2 += n_v2[i: 8  ~ 11];
>> >> +              sum_v3 += n_v3[i: 12 ~ 15];
>> >> +            }
>> >> +       */
>> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
>> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
>> >> +
>> >
>> > assert reduc_ncopies >= using_ncopies?  Maybe assert
>> > reduc_index == op.num_ops - 1 given you use one above
>> > and the other below?  Or simply iterate till op.num_ops
>> > and sip i == reduc_index.
>> >
>> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
>> >> +       {
>> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
>> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
>> >> +       }
>> >> +    }
>> >>
>> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
>> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>      {
>> >>        gimple *new_stmt;
>> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
>> >> -      if (masked_loop_p && !mask_by_cond_expr)
>> >> +
>> >> +      if (!vop[0] || !vop[1])
>> >> +       {
>> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
>> >> +
>> >> +         /* Insert trivial copy if no need to generate vectorized
>> >> +            statement.  */
>> >> +         gcc_assert (reduc_vop);
>> >> +
>> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
>> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
>> >> +         gimple_set_lhs (new_stmt, new_temp);
>> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
>> >
>> > I think you could simply do
>> >
>> >                slp_node->push_vec_def (reduc_vop);
>> >                continue;
>> >
>> > without any code generation.
>> >
>>
>> OK, that would be easy. Here comes another question, this patch assumes
>> lane-reducing op would always be contained in a slp node, since single-lane
>> slp node feature has been enabled. But I got some regression if I enforced
>> such constraint on lane-reducing op check. Those cases are founded to
>> be unvectorizable with single-lane slp, so this should not be what we want?
>> and need to be fixed?
> 
> Yes, in the end we need to chase down all unsupported cases and fix them
> (there's known issues with load permutes, I'm working on that - hopefully
> when finding a continuous stretch of time...).
> 
>>
>> >> +       }
>> >> +      else if (masked_loop_p && !mask_by_cond_expr)
>> >>         {
>> >>           /* No conditional ifns have been defined for lane-reducing op
>> >>              yet.  */
>> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>
>> >>           if (masked_loop_p && mask_by_cond_expr)
>> >>             {
>> >> +             tree stmt_vectype_in = vectype_in;
>> >> +             unsigned nvectors = vec_num * ncopies;
>> >> +
>> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
>> >> +               {
>> >> +                 /* Input vectype of the reduction PHI may be defferent from
>> >
>> > different
>> >
>> >> +                    that of lane-reducing operation.  */
>> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> >> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
>> >
>> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>>
>> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
>> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
>> to vect_get_loop_mask?
> 
> Probably - it depends on the vectorization factor.  What I wanted to
> point out is that
> vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> or we should forgo with it (but that's possibly a post-only-SLP
> cleanup to be done).
> 
> See vect_slp_analyze_node_operations_1 where that's computed.  For reductions
> it's probably not quite right (and we might have latent issues like
> those you are
> "fixing" with code like above).  The order we analyze stmts might also be not
> optimal for reductions with SLP - in fact given that stmt analysis
> relies on a fixed VF
> it would probably make sense to determine the reduction VF in advance as well.
> But again this sounds like post-only-SLP cleanup opportunities.
> 
> In the end I might suggest to always use reduct-VF and vectype to determine
> the number of vector stmts rather than computing ncopies/vec_num separately.
> 

Thanks,
Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-25  9:32       ` Feng Xue OS
@ 2024-06-25 10:26         ` Richard Biener
  2024-06-26 14:50         ` Feng Xue OS
  1 sibling, 0 replies; 9+ messages in thread
From: Richard Biener @ 2024-06-25 10:26 UTC (permalink / raw)
  To: Feng Xue OS; +Cc: gcc-patches

On Tue, Jun 25, 2024 at 11:32 AM Feng Xue OS
<fxue@os.amperecomputing.com> wrote:
>
> >>
> >> >> -      if (slp_node)
> >> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >> >
> >> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> >> > instead, which is bad.
> >> >
> >> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >> >>        else
> >> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >> >>      }
> >> >>  }
> >> >>
> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> >> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> >> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> >> >> +   (sum-of-absolute-differences).
> >> >> +
> >> >> +   For a lane-reducing operation, the loop reduction path that it lies in,
> >> >> +   may contain normal operation, or other lane-reducing operation of different
> >> >> +   input type size, an example as:
> >> >> +
> >> >> +     int sum = 0;
> >> >> +     for (i)
> >> >> +       {
> >> >> +         ...
> >> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> >> >> +         sum += w[i];                // widen-sum <vector(16) char>
> >> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> >> +         sum += n[i];                // normal <vector(4) int>
> >> >> +         ...
> >> >> +       }
> >> >> +
> >> >> +   Vectorization factor is essentially determined by operation whose input
> >> >> +   vectype has the most lanes ("vector(16) char" in the example), while we
> >> >> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> >> >> +   example) for the reduction PHI statement.  */
> >> >> +
> >> >> +bool
> >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> >> >> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> >> >> +{
> >> >> +  gimple *stmt = stmt_info->stmt;
> >> >> +
> >> >> +  if (!lane_reducing_stmt_p (stmt))
> >> >> +    return false;
> >> >> +
> >> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> >> >> +
> >> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> >> +    return false;
> >> >> +
> >> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> >> +  if (!type_has_mode_precision_p (type))
> >> >> +    return false;
> >> >> +
> >> >> +  if (!slp_node)
> >> >> +    return false;
> >> >> +
> >> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> >> +    {
> >> >> +      stmt_vec_info def_stmt_info;
> >> >> +      slp_tree slp_op;
> >> >> +      tree op;
> >> >> +      tree vectype;
> >> >> +      enum vect_def_type dt;
> >> >> +
> >> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> >> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "use not simple.\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vectype)
> >> >> +       {
> >> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> >> >> +                                                slp_op);
> >> >> +         if (!vectype)
> >> >> +           return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "incompatible vector types for invariants\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> >> >> +       continue;
> >> >> +
> >> >> +      /* There should be at most one cycle def in the stmt.  */
> >> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> >> +       return false;
> >> >> +    }
> >> >> +
> >> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> >> >> +
> >> >> +  /* TODO: Support lane-reducing operation that does not directly participate
> >> >> +     in loop reduction. */
> >> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> >> +    return false;
> >> >> +
> >> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> >> +     recoginized.  */
> >> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> >> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> >> >> +
> >> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +  int ncopies_for_cost;
> >> >> +
> >> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> >> +    {
> >> >> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> >> >> +        come from the same loop reduction path.  */
> >> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> >> +      ncopies_for_cost = 1;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >
> >> > OK, so the fact that the ops are lane-reducing means they effectively
> >> > change the VF for the result.  That's only possible as we tightly control
> >> > code generation and "adjust" to the expected VF (by inserting the copies
> >> > you mentioned above), but only up to the highest number of outputs
> >> > created in the reduction chain.  In that sense instead of talking and recording
> >> > "input vector types" wouldn't it make more sense to record the effective
> >> > vectorization factor for the reduction instance?  That VF would be at most
> >> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> >> > operation in the reduction chain it would be always equal to the loops VF.
> >> >
> >> > ncopies would then be always determined by that reduction instance VF and
> >> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> >> > instance VF would also trivially indicate the force-single-def-use-cycle
> >> > case, possibly simplifying code?
> >>
> >> I tried to add such an effective VF, while the vectype_in is still needed in some
> >> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
> >> The former could be deduced from the later, so recording both things seems
> >> to be redundant. Another consideration is that for normal op, ncopies
> >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> >> it is from VF. So, a better means to make them unified?
> >
> > AFAICS reductions are special in that they, for the accumulation SSA cycle,
> > do not adhere to the loops VF but as optimization can chose a smaller one.
> > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> > operations which even for lane-reducing ops is adhered to - those just
> > may use a smaller VF, that of the reduction SSA cycle.
> >
> > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> > it's not fully redundant but needlessly replicated over all stmts participating
> > in the reduction instead of recording the reduction VF in the reduc_info and
> > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> > for stmts in the reduction cycle.
> >
> > At least that was my idea ...
> >
>
> For lane-reducing ops and single-defuse-cycle optimization, we could assume
> no lane would be reduced, and always generate vectorization statements
> according to the normal VF, if placeholder is needed, just insert some trivial
> statement like zero-initialization, or pass-through copy. And define a"effective VF or
> ncopies" to control lane-reducing related aspects in analysis and codegen (such
> as the below vect_get_loop_mask).  Since all things will become SLP-based finally,
> I think a suitable place to add such a field might be in slp_node, as a supplement to
> "vect_stmts_size", and it is expected to be adjusted in vectorizable_reduction. So
> could we do the refinement as separate patches when non-slp code path is to be
> removed?

I suppose so.

Thanks,
Richard.

> >> >> +      gcc_assert (ncopies_for_cost >= 1);
> >> >> +    }
> >> >> +
> >> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> +    {
> >> >> +      /* We need extra two invariants: one that contains the minimum signed
> >> >> +        value and one that contains half of its negative.  */
> >> >> +      int prologue_stmts = 2;
> >> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> >> >> +                                       scalar_to_vec, stmt_info, 0,
> >> >> +                                       vect_prologue);
> >> >> +      if (dump_enabled_p ())
> >> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> >> +                    "extra prologue_cost = %d .\n", cost);
> >> >> +
> >> >> +      /* Three dot-products and a subtraction.  */
> >> >> +      ncopies_for_cost *= 4;
> >> >> +    }
> >> >> +
> >> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> >> >> +                   vect_body);
> >> >> +
> >> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> >> +    {
> >> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> >> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> >> >> +                                                 slp_node, code, type,
> >> >> +                                                 vectype_in);
> >> >> +    }
> >> >> +
> >> >
> >> > Add a comment:
> >> >
> >> >     /* Transform via vect_transform_reduction.  */
> >> >
> >> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >> >> +  return true;
> >> >> +}
> >> >> +
> >> >>  /* Function vectorizable_reduction.
> >> >>
> >> >>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>    if (!type_has_mode_precision_p (op.type))
> >> >>      return false;
> >> >>
> >> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> >> >> -     which means the only use of that may be in the lane-reducing operation.  */
> >> >> -  if (lane_reducing
> >> >> -      && reduc_chain_length != 1
> >> >> -      && !only_slp_reduc_chain)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "lane-reducing reduction with extra stmts.\n");
> >> >> -      return false;
> >> >> -    }
> >> >> -
> >> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
> >> >>       since we'll mix lanes belonging to different reductions.  But it's
> >> >>       OK to use them in a reduction chain or when the reduction group
> >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>        && loop_vinfo->suggested_unroll_factor == 1)
> >> >>      single_defuse_cycle = true;
> >> >>
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> +  if (single_defuse_cycle && !lane_reducing)
> >> >
> >> > If there's also a non-lane-reducing plus in the chain don't we have to
> >> > check for that reduction op?  So shouldn't it be
> >> > single_defuse_cycle && ... fact that we don't record
> >> > (non-lane-reducing op there) ...
> >>
> >> Quite not understand this point.  For a non-lane-reducing op in the chain,
> >> it should be handled in its own vectorizable_xxx function? The below check
> >> is only for the first statement (vect_reduction_def) in the reduction.
> >
> > Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> > lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> > following is then just for the case there's a single def that's not
> > lane-reducing
> > and we're forcing a single-def-use and thus go via vect_transform_reduction?
>
> Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
> This logic is same as the original.
>
> >> >
> >> >>      {
> >> >>        gcc_assert (op.code != COND_EXPR);
> >> >>
> >> >> -      /* 4. Supportable by target?  */
> >> >> -      bool ok = true;
> >> >> -
> >> >> -      /* 4.1. check support for the operation in the loop
> >> >> +      /* 4. check support for the operation in the loop
> >> >>
> >> >>          This isn't necessary for the lane reduction codes, since they
> >> >>          can only be produced by pattern matching, and it's up to the
> >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>          mixed-sign dot-products can be implemented using signed
> >> >>          dot-products.  */
> >> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> >> >> -      if (!lane_reducing
> >> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> >> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >> >>          {
> >> >>            if (dump_enabled_p ())
> >> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> >> -           ok = false;
> >> >> +           single_defuse_cycle = false;
> >> >>           else
> >> >>             if (dump_enabled_p ())
> >> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >> >>           return false;
> >> >>         }
> >> >> -
> >> >> -      /* lane-reducing operations have to go through vect_transform_reduction.
> >> >> -         For the other cases try without the single cycle optimization.  */
> >> >> -      if (!ok)
> >> >> -       {
> >> >> -         if (lane_reducing)
> >> >> -           return false;
> >> >> -         else
> >> >> -           single_defuse_cycle = false;
> >> >> -       }
> >> >>      }
> >> >>    if (dump_enabled_p () && single_defuse_cycle)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>                      "multiple vectors to one in the loop body\n");
> >> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >> >>
> >> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> >> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> >> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> >> -      && lane_reducing)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "multi def-use cycle not possible for lane-reducing "
> >> >> -                        "reduction operation\n");
> >> >> -      return false;
> >> >> -    }
> >> >> +  /* For lane-reducing operation, the below processing related to single
> >> >> +     defuse-cycle will be done in its own vectorizable function.  One more
> >> >> +     thing to note is that the operation must not be involved in fold-left
> >> >> +     reduction.  */
> >> >> +  single_defuse_cycle &= !lane_reducing;
> >> >>
> >> >>    if (slp_node
> >> >> -      && !(!single_defuse_cycle
> >> >> -          && !lane_reducing
> >> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
> >> >> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
> >> >>      for (i = 0; i < (int) op.num_ops; i++)
> >> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >> >>         {
> >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >> >>                              reduction_type, ncopies, cost_vec);
> >> >>    /* Cost the reduction op inside the loop if transformed via
> >> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> >> -     separate vectorizable_* routines.  */
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> -    {
> >> >> -      int factor = 1;
> >> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> -       /* Three dot-products and a subtraction.  */
> >> >> -       factor = 4;
> >> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> >> -                       stmt_info, 0, vect_body);
> >> >> -    }
> >> >> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> >> >> +     this is costed by the separate vectorizable_* routines.  */
> >> >> +  if (single_defuse_cycle)
> >> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> >> >>
> >> >>    if (dump_enabled_p ()
> >> >>        && reduction_type == FOLD_LEFT_REDUCTION)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >>                      "using an in-order (fold-left) reduction.\n");
> >> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> >> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> >> >> -     reductions go through their own vectorizable_* routines.  */
> >> >> -  if (!single_defuse_cycle
> >> >> -      && !lane_reducing
> >> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
> >> >> +
> >> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
> >> >> +     through their own vectorizable_* routines.  */
> >> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >> >>      {
> >> >>        stmt_vec_info tem
> >> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>    bool lane_reducing = lane_reducing_op_p (code);
> >> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >> >>
> >> >> +  if (lane_reducing)
> >> >> +    {
> >> >> +      /* The last operand of lane-reducing op is for reduction.  */
> >> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> >> >> +
> >> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
> >> >> +      gcc_assert (slp_node);
> >> >> +    }
> >> >> +
> >> >>    /* Create the destination vector  */
> >> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >> >>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >> >>                          &vec_oprnds[2]);
> >> >>      }
> >> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> >> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> >> >> +    {
> >> >> +      /* For lane-reducing op covered by single-lane slp node, the input
> >> >> +        vectype of the reduction PHI determines copies of vectorized def-use
> >> >> +        cycles, which might be more than effective copies of vectorized lane-
> >> >> +        reducing reduction statements.  This could be complemented by
> >> >> +        generating extra trivial pass-through copies.  For example:
> >> >> +
> >> >> +          int sum = 0;
> >> >> +          for (i)
> >> >> +            {
> >> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> >> +              sum += n[i];               // normal <vector(4) int>
> >> >> +            }
> >> >> +
> >> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> >> >> +        statements would be transformed as:
> >> >> +
> >> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >> >> +
> >> >> +          for (i / 16)
> >> >> +            {
> >> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >> >> +              sum_v1 = sum_v1;  // copy
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> >> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> >> >> +              sum_v2 += n_v2[i: 8  ~ 11];
> >> >> +              sum_v3 += n_v3[i: 12 ~ 15];
> >> >> +            }
> >> >> +       */
> >> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
> >> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> >> >> +
> >> >
> >> > assert reduc_ncopies >= using_ncopies?  Maybe assert
> >> > reduc_index == op.num_ops - 1 given you use one above
> >> > and the other below?  Or simply iterate till op.num_ops
> >> > and sip i == reduc_index.
> >> >
> >> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> >> >> +       {
> >> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> >> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> >> +       }
> >> >> +    }
> >> >>
> >> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>      {
> >> >>        gimple *new_stmt;
> >> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> >> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> >> +
> >> >> +      if (!vop[0] || !vop[1])
> >> >> +       {
> >> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> >> >> +
> >> >> +         /* Insert trivial copy if no need to generate vectorized
> >> >> +            statement.  */
> >> >> +         gcc_assert (reduc_vop);
> >> >> +
> >> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> >> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> >> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> >> >
> >> > I think you could simply do
> >> >
> >> >                slp_node->push_vec_def (reduc_vop);
> >> >                continue;
> >> >
> >> > without any code generation.
> >> >
> >>
> >> OK, that would be easy. Here comes another question, this patch assumes
> >> lane-reducing op would always be contained in a slp node, since single-lane
> >> slp node feature has been enabled. But I got some regression if I enforced
> >> such constraint on lane-reducing op check. Those cases are founded to
> >> be unvectorizable with single-lane slp, so this should not be what we want?
> >> and need to be fixed?
> >
> > Yes, in the end we need to chase down all unsupported cases and fix them
> > (there's known issues with load permutes, I'm working on that - hopefully
> > when finding a continuous stretch of time...).
> >
> >>
> >> >> +       }
> >> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >> >>         {
> >> >>           /* No conditional ifns have been defined for lane-reducing op
> >> >>              yet.  */
> >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>
> >> >>           if (masked_loop_p && mask_by_cond_expr)
> >> >>             {
> >> >> +             tree stmt_vectype_in = vectype_in;
> >> >> +             unsigned nvectors = vec_num * ncopies;
> >> >> +
> >> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> >> >> +               {
> >> >> +                 /* Input vectype of the reduction PHI may be defferent from
> >> >
> >> > different
> >> >
> >> >> +                    that of lane-reducing operation.  */
> >> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
> >> >
> >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
> >>
> >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> >> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
> >> to vect_get_loop_mask?
> >
> > Probably - it depends on the vectorization factor.  What I wanted to
> > point out is that
> > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> > or we should forgo with it (but that's possibly a post-only-SLP
> > cleanup to be done).
> >
> > See vect_slp_analyze_node_operations_1 where that's computed.  For reductions
> > it's probably not quite right (and we might have latent issues like
> > those you are
> > "fixing" with code like above).  The order we analyze stmts might also be not
> > optimal for reductions with SLP - in fact given that stmt analysis
> > relies on a fixed VF
> > it would probably make sense to determine the reduction VF in advance as well.
> > But again this sounds like post-only-SLP cleanup opportunities.
> >
> > In the end I might suggest to always use reduct-VF and vectype to determine
> > the number of vector stmts rather than computing ncopies/vec_num separately.
> >
>
> Thanks,
> Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-25  9:32       ` Feng Xue OS
  2024-06-25 10:26         ` Richard Biener
@ 2024-06-26 14:50         ` Feng Xue OS
  2024-06-28 13:06           ` Richard Biener
  1 sibling, 1 reply; 9+ messages in thread
From: Feng Xue OS @ 2024-06-26 14:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 69721 bytes --]

Updated the patch.

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (vectorizable_lane_reducing): New function
        declaration.
        * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
        vectorizable_lane_reducing to analyze lane-reducing operation.
        * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
        code related to emulated_mixed_dot_prod.
        (vect_reduction_update_partial_vector_usage): Compute ncopies as the
        original means for single-lane slp node.
        (vectorizable_lane_reducing): New function.
        (vectorizable_reduction): Allow multiple lane-reducing operations in
        loop reduction. Move some original lane-reducing related code to
        vectorizable_lane_reducing.
        (vect_transform_reduction): Extend transformation to support reduction
        statements with mixed input vectypes.

gcc/testsuite/
        PR tree-optimization/114440
        * gcc.dg/vect/vect-reduc-chain-1.c
        * gcc.dg/vect/vect-reduc-chain-2.c
        * gcc.dg/vect/vect-reduc-chain-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
        * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
        * gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 ++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 ++++
 gcc/tree-vect-loop.cc                         | 333 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 836 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..84c82b023d4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,60 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_1 int res4,
+   SIGNEDNESS_1 int res5,
+   SIGNEDNESS_1 int res6,
+   SIGNEDNESS_1 int res7,
+   SIGNEDNESS_1 int res8,
+   SIGNEDNESS_1 int res9,
+   SIGNEDNESS_1 int resA,
+   SIGNEDNESS_1 int resB,
+   SIGNEDNESS_1 int resC,
+   SIGNEDNESS_1 int resD,
+   SIGNEDNESS_1 int resE,
+   SIGNEDNESS_1 int resF,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b)
+{
+  for (int i = 0; i < 64; i += 16)
+    {
+      res0 += a[i + 0x00] * b[i + 0x00];
+      res1 += a[i + 0x01] * b[i + 0x01];
+      res2 += a[i + 0x02] * b[i + 0x02];
+      res3 += a[i + 0x03] * b[i + 0x03];
+      res4 += a[i + 0x04] * b[i + 0x04];
+      res5 += a[i + 0x05] * b[i + 0x05];
+      res6 += a[i + 0x06] * b[i + 0x06];
+      res7 += a[i + 0x07] * b[i + 0x07];
+      res8 += a[i + 0x08] * b[i + 0x08];
+      res9 += a[i + 0x09] * b[i + 0x09];
+      resA += a[i + 0x0A] * b[i + 0x0A];
+      resB += a[i + 0x0B] * b[i + 0x0B];
+      resC += a[i + 0x0C] * b[i + 0x0C];
+      resD += a[i + 0x0D] * b[i + 0x0D];
+      resE += a[i + 0x0E] * b[i + 0x0E];
+      resF += a[i + 0x0F] * b[i + 0x0F];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^
+         res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 419f4b08d2b..6bfb0e72905 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();

-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
           initial result of the data reduction, initial value of the index
           reduction.  */
        prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-       /* We need the initial reduction value and two invariants:
-          one that contains the minimum signed value and one that
-          contains half of its negative.  */
-       prologue_stmts = 3;
       else
+       /* We need the initial reduction value.  */
        prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
                                         scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,10 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;

-      if (slp_node)
+      /* TODO: The number of vector statements for lane-reducing op is over-
+        estimated, we have to recompute it when the containing slp node is
+        single-lane.  Need a general means to correct this value.  */
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
        nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
        nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7475,154 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }

+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  /* A lane-reducing op should be contained in some slp node.  */
+  if (!slp_node)
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+                              &slp_op, &dt, &vectype, &def_stmt_info))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "use not simple.\n");
+         return false;
+       }
+
+      if (!vectype)
+       {
+         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+                                                slp_op);
+         if (!vectype)
+           return false;
+       }
+
+      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "incompatible vector types for invariants\n");
+         return false;
+       }
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+       continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+       return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction.  */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+        come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+        value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+                                       scalar_to_vec, stmt_info, 0,
+                                       vect_prologue);
+      if (dump_enabled_p ())
+       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+                    "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+                   vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+                                                 slp_node, code, type,
+                                                 vectype_in);
+    }
+
+  /* Transform via vect_transform_reduction.  */
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.

    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7811,18 +7956,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;

-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8362,14 +8495,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;

-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);

-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop

         This isn't necessary for the lane reduction codes, since they
         can only be produced by pattern matching, and it's up to the
@@ -8378,14 +8508,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
         mixed-sign dot-products can be implemented using signed
         dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-         && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
          if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
              || !vect_can_vectorize_without_simd_p (op.code))
-           ok = false;
+           single_defuse_cycle = false;
          else
            if (dump_enabled_p ())
              dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8398,16 +8527,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
            dump_printf (MSG_NOTE, "using word mode not possible.\n");
          return false;
        }
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-       {
-         if (lane_reducing)
-           return false;
-         else
-           single_defuse_cycle = false;
-       }
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8415,22 +8534,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
                     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;

-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "multi def-use cycle not possible for lane-reducing "
-                        "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;

   if (slp_node
-      && !(!single_defuse_cycle
-          && !lane_reducing
-          && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
        {
@@ -8443,28 +8554,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
                             reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-       /* Three dot-products and a subtraction.  */
-       factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-                       stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);

   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
                     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
        = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8654,6 +8757,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);

+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+
+      /* Now lane-reducing op is contained in some slp node.  */
+      gcc_assert (slp_node);
+   }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
@@ -8698,6 +8810,62 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
                         reduc_index == 2 ? op.ops[2] : NULL_TREE,
                         &vec_oprnds[2]);
     }
+  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+        vectype of the reduction PHI determines copies of vectorized def-use
+        cycles, which might be more than effective copies of vectorized lane-
+        reducing reduction statements.  This could be complemented by
+        generating extra trivial pass-through copies.  For example:
+
+          int sum = 0;
+          for (i)
+            {
+              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+              sum += n[i];               // normal <vector(4) int>
+            }
+
+        The vector size is 128-bit?vectorization factor is 16.  Reduction
+        statements would be transformed as:
+
+          vector<4> int sum_v0 = { 0, 0, 0, 0 };
+          vector<4> int sum_v1 = { 0, 0, 0, 0 };
+          vector<4> int sum_v2 = { 0, 0, 0, 0 };
+          vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+          for (i / 16)
+            {
+              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+              sum_v1 = sum_v1;  // copy
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+              sum_v2 = sum_v2;  // copy
+              sum_v3 = sum_v3;  // copy
+
+              sum_v0 += n_v0[i: 0  ~ 3 ];
+              sum_v1 += n_v1[i: 4  ~ 7 ];
+              sum_v2 += n_v2[i: 8  ~ 11];
+              sum_v3 += n_v3[i: 12 ~ 15];
+            }
+       */
+      unsigned using_ncopies = vec_oprnds[0].length ();
+      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+      gcc_assert (using_ncopies <= reduc_ncopies);
+
+      if (using_ncopies < reduc_ncopies)
+       {
+         for (unsigned i = 0; i < op.num_ops - 1; i++)
+           {
+             gcc_assert (vec_oprnds[i].length () == using_ncopies);
+             vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+           }
+       }
+    }

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8706,7 +8874,18 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+       {
+         tree reduc_vop = vec_oprnds[reduc_index][i];
+
+         /* Insert trivial copy if no need to generate vectorized
+            statement.  */
+         gcc_assert (reduc_vop);
+
+         new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
+       }
+      else if (masked_loop_p && !mask_by_cond_expr)
        {
          /* No conditional ifns have been defined for lane-reducing op
             yet.  */
@@ -8735,8 +8914,22 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

          if (masked_loop_p && mask_by_cond_expr)
            {
+             unsigned nvectors = vec_num * ncopies;
+             tree stmt_vectype_in = vectype_in;
+
+             /* For single-lane slp node on lane-reducing op, we need to
+                compute exact number of vector stmts from its input vectype,
+                since the value got from the slp node is over-estimated.
+                TODO: properly set the number this somewhere, so that this
+                fixup could be removed.  */
+             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+               {
+                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+                 nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+               }
+
              tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-                                             vec_num * ncopies, vectype_in, i);
+                                             nvectors, stmt_vectype_in, i);
              build_vect_cond_expr (code, vop, mask, gsi);
            }

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 840e162c7f0..845647b4399 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
                                      NULL, NULL, node, cost_vec)
          || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
          || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+                                        stmt_info, node, cost_vec)
          || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
                                     node, node_instance, cost_vec)
          || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
                                         slp_tree, slp_instance, int,
                                         bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+                                       slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
                                    slp_tree, slp_instance,
                                    stmt_vector_for_cost *);
--
2.17.1

________________________________________
From: Feng Xue OS <fxue@os.amperecomputing.com>
Sent: Tuesday, June 25, 2024 5:32 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

>>
>> >> -      if (slp_node)
>> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>> >
>> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
>> > instead, which is bad.
>> >
>> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>> >>        else
>> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>> >>      }
>> >>  }
>> >>
>> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
>> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
>> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
>> >> +   (sum-of-absolute-differences).
>> >> +
>> >> +   For a lane-reducing operation, the loop reduction path that it lies in,
>> >> +   may contain normal operation, or other lane-reducing operation of different
>> >> +   input type size, an example as:
>> >> +
>> >> +     int sum = 0;
>> >> +     for (i)
>> >> +       {
>> >> +         ...
>> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
>> >> +         sum += w[i];                // widen-sum <vector(16) char>
>> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
>> >> +         sum += n[i];                // normal <vector(4) int>
>> >> +         ...
>> >> +       }
>> >> +
>> >> +   Vectorization factor is essentially determined by operation whose input
>> >> +   vectype has the most lanes ("vector(16) char" in the example), while we
>> >> +   need to choose input vectype with the least lanes ("vector(4) int" in the
>> >> +   example) for the reduction PHI statement.  */
>> >> +
>> >> +bool
>> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
>> >> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
>> >> +{
>> >> +  gimple *stmt = stmt_info->stmt;
>> >> +
>> >> +  if (!lane_reducing_stmt_p (stmt))
>> >> +    return false;
>> >> +
>> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
>> >> +
>> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
>> >> +    return false;
>> >> +
>> >> +  /* Do not try to vectorize bit-precision reductions.  */
>> >> +  if (!type_has_mode_precision_p (type))
>> >> +    return false;
>> >> +
>> >> +  if (!slp_node)
>> >> +    return false;
>> >> +
>> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
>> >> +    {
>> >> +      stmt_vec_info def_stmt_info;
>> >> +      slp_tree slp_op;
>> >> +      tree op;
>> >> +      tree vectype;
>> >> +      enum vect_def_type dt;
>> >> +
>> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
>> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
>> >> +       {
>> >> +         if (dump_enabled_p ())
>> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +                            "use not simple.\n");
>> >> +         return false;
>> >> +       }
>> >> +
>> >> +      if (!vectype)
>> >> +       {
>> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
>> >> +                                                slp_op);
>> >> +         if (!vectype)
>> >> +           return false;
>> >> +       }
>> >> +
>> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
>> >> +       {
>> >> +         if (dump_enabled_p ())
>> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +                            "incompatible vector types for invariants\n");
>> >> +         return false;
>> >> +       }
>> >> +
>> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
>> >> +       continue;
>> >> +
>> >> +      /* There should be at most one cycle def in the stmt.  */
>> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
>> >> +       return false;
>> >> +    }
>> >> +
>> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
>> >> +
>> >> +  /* TODO: Support lane-reducing operation that does not directly participate
>> >> +     in loop reduction. */
>> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
>> >> +    return false;
>> >> +
>> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
>> >> +     recoginized.  */
>> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
>> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
>> >> +
>> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> >> +  int ncopies_for_cost;
>> >> +
>> >> +  if (SLP_TREE_LANES (slp_node) > 1)
>> >> +    {
>> >> +      /* Now lane-reducing operations in a non-single-lane slp node should only
>> >> +        come from the same loop reduction path.  */
>> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
>> >> +      ncopies_for_cost = 1;
>> >> +    }
>> >> +  else
>> >> +    {
>> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
>> >
>> > OK, so the fact that the ops are lane-reducing means they effectively
>> > change the VF for the result.  That's only possible as we tightly control
>> > code generation and "adjust" to the expected VF (by inserting the copies
>> > you mentioned above), but only up to the highest number of outputs
>> > created in the reduction chain.  In that sense instead of talking and recording
>> > "input vector types" wouldn't it make more sense to record the effective
>> > vectorization factor for the reduction instance?  That VF would be at most
>> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
>> > operation in the reduction chain it would be always equal to the loops VF.
>> >
>> > ncopies would then be always determined by that reduction instance VF and
>> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
>> > instance VF would also trivially indicate the force-single-def-use-cycle
>> > case, possibly simplifying code?
>>
>> I tried to add such an effective VF, while the vectype_in is still needed in some
>> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
>> The former could be deduced from the later, so recording both things seems
>> to be redundant. Another consideration is that for normal op, ncopies
>> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
>> it is from VF. So, a better means to make them unified?
>
> AFAICS reductions are special in that they, for the accumulation SSA cycle,
> do not adhere to the loops VF but as optimization can chose a smaller one.
> OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> operations which even for lane-reducing ops is adhered to - those just
> may use a smaller VF, that of the reduction SSA cycle.
>
> So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> it's not fully redundant but needlessly replicated over all stmts participating
> in the reduction instead of recording the reduction VF in the reduc_info and
> using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> for stmts in the reduction cycle.
>
> At least that was my idea ...
>

For lane-reducing ops and single-defuse-cycle optimization, we could assume
no lane would be reduced, and always generate vectorization statements
according to the normal VF, if placeholder is needed, just insert some trivial
statement like zero-initialization, or pass-through copy. And define a"effective VF or
ncopies" to control lane-reducing related aspects in analysis and codegen (such
as the below vect_get_loop_mask).  Since all things will become SLP-based finally,
I think a suitable place to add such a field might be in slp_node, as a supplement to
"vect_stmts_size", and it is expected to be adjusted in vectorizable_reduction. So
could we do the refinement as separate patches when non-slp code path is to be
removed?

>> >> +      gcc_assert (ncopies_for_cost >= 1);
>> >> +    }
>> >> +
>> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> >> +    {
>> >> +      /* We need extra two invariants: one that contains the minimum signed
>> >> +        value and one that contains half of its negative.  */
>> >> +      int prologue_stmts = 2;
>> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
>> >> +                                       scalar_to_vec, stmt_info, 0,
>> >> +                                       vect_prologue);
>> >> +      if (dump_enabled_p ())
>> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
>> >> +                    "extra prologue_cost = %d .\n", cost);
>> >> +
>> >> +      /* Three dot-products and a subtraction.  */
>> >> +      ncopies_for_cost *= 4;
>> >> +    }
>> >> +
>> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
>> >> +                   vect_body);
>> >> +
>> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> >> +    {
>> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
>> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
>> >> +                                                 slp_node, code, type,
>> >> +                                                 vectype_in);
>> >> +    }
>> >> +
>> >
>> > Add a comment:
>> >
>> >     /* Transform via vect_transform_reduction.  */
>> >
>> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>> >> +  return true;
>> >> +}
>> >> +
>> >>  /* Function vectorizable_reduction.
>> >>
>> >>     Check if STMT_INFO performs a reduction operation that can be vectorized.
>> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>    if (!type_has_mode_precision_p (op.type))
>> >>      return false;
>> >>
>> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
>> >> -     which means the only use of that may be in the lane-reducing operation.  */
>> >> -  if (lane_reducing
>> >> -      && reduc_chain_length != 1
>> >> -      && !only_slp_reduc_chain)
>> >> -    {
>> >> -      if (dump_enabled_p ())
>> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> -                        "lane-reducing reduction with extra stmts.\n");
>> >> -      return false;
>> >> -    }
>> >> -
>> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
>> >>       since we'll mix lanes belonging to different reductions.  But it's
>> >>       OK to use them in a reduction chain or when the reduction group
>> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>        && loop_vinfo->suggested_unroll_factor == 1)
>> >>      single_defuse_cycle = true;
>> >>
>> >> -  if (single_defuse_cycle || lane_reducing)
>> >> +  if (single_defuse_cycle && !lane_reducing)
>> >
>> > If there's also a non-lane-reducing plus in the chain don't we have to
>> > check for that reduction op?  So shouldn't it be
>> > single_defuse_cycle && ... fact that we don't record
>> > (non-lane-reducing op there) ...
>>
>> Quite not understand this point.  For a non-lane-reducing op in the chain,
>> it should be handled in its own vectorizable_xxx function? The below check
>> is only for the first statement (vect_reduction_def) in the reduction.
>
> Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> following is then just for the case there's a single def that's not
> lane-reducing
> and we're forcing a single-def-use and thus go via vect_transform_reduction?

Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
This logic is same as the original.

>> >
>> >>      {
>> >>        gcc_assert (op.code != COND_EXPR);
>> >>
>> >> -      /* 4. Supportable by target?  */
>> >> -      bool ok = true;
>> >> -
>> >> -      /* 4.1. check support for the operation in the loop
>> >> +      /* 4. check support for the operation in the loop
>> >>
>> >>          This isn't necessary for the lane reduction codes, since they
>> >>          can only be produced by pattern matching, and it's up to the
>> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>          mixed-sign dot-products can be implemented using signed
>> >>          dot-products.  */
>> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
>> >> -      if (!lane_reducing
>> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
>> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>> >>          {
>> >>            if (dump_enabled_p ())
>> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>> >>               || !vect_can_vectorize_without_simd_p (op.code))
>> >> -           ok = false;
>> >> +           single_defuse_cycle = false;
>> >>           else
>> >>             if (dump_enabled_p ())
>> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
>> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>> >>           return false;
>> >>         }
>> >> -
>> >> -      /* lane-reducing operations have to go through vect_transform_reduction.
>> >> -         For the other cases try without the single cycle optimization.  */
>> >> -      if (!ok)
>> >> -       {
>> >> -         if (lane_reducing)
>> >> -           return false;
>> >> -         else
>> >> -           single_defuse_cycle = false;
>> >> -       }
>> >>      }
>> >>    if (dump_enabled_p () && single_defuse_cycle)
>> >>      dump_printf_loc (MSG_NOTE, vect_location,
>> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>                      "multiple vectors to one in the loop body\n");
>> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>> >>
>> >> -  /* If the reduction stmt is one of the patterns that have lane
>> >> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
>> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
>> >> -      && lane_reducing)
>> >> -    {
>> >> -      if (dump_enabled_p ())
>> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> -                        "multi def-use cycle not possible for lane-reducing "
>> >> -                        "reduction operation\n");
>> >> -      return false;
>> >> -    }
>> >> +  /* For lane-reducing operation, the below processing related to single
>> >> +     defuse-cycle will be done in its own vectorizable function.  One more
>> >> +     thing to note is that the operation must not be involved in fold-left
>> >> +     reduction.  */
>> >> +  single_defuse_cycle &= !lane_reducing;
>> >>
>> >>    if (slp_node
>> >> -      && !(!single_defuse_cycle
>> >> -          && !lane_reducing
>> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
>> >> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>> >>      for (i = 0; i < (int) op.num_ops; i++)
>> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>> >>         {
>> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>> >>                              reduction_type, ncopies, cost_vec);
>> >>    /* Cost the reduction op inside the loop if transformed via
>> >> -     vect_transform_reduction.  Otherwise this is costed by the
>> >> -     separate vectorizable_* routines.  */
>> >> -  if (single_defuse_cycle || lane_reducing)
>> >> -    {
>> >> -      int factor = 1;
>> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
>> >> -       /* Three dot-products and a subtraction.  */
>> >> -       factor = 4;
>> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
>> >> -                       stmt_info, 0, vect_body);
>> >> -    }
>> >> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
>> >> +     this is costed by the separate vectorizable_* routines.  */
>> >> +  if (single_defuse_cycle)
>> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>> >>
>> >>    if (dump_enabled_p ()
>> >>        && reduction_type == FOLD_LEFT_REDUCTION)
>> >>      dump_printf_loc (MSG_NOTE, vect_location,
>> >>                      "using an in-order (fold-left) reduction.\n");
>> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
>> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
>> >> -     reductions go through their own vectorizable_* routines.  */
>> >> -  if (!single_defuse_cycle
>> >> -      && !lane_reducing
>> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
>> >> +
>> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
>> >> +     through their own vectorizable_* routines.  */
>> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>> >>      {
>> >>        stmt_vec_info tem
>> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
>> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>    bool lane_reducing = lane_reducing_op_p (code);
>> >>    gcc_assert (single_defuse_cycle || lane_reducing);
>> >>
>> >> +  if (lane_reducing)
>> >> +    {
>> >> +      /* The last operand of lane-reducing op is for reduction.  */
>> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
>> >> +
>> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
>> >> +      gcc_assert (slp_node);
>> >> +    }
>> >> +
>> >>    /* Create the destination vector  */
>> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>> >>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>> >>                          &vec_oprnds[2]);
>> >>      }
>> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
>> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
>> >> +    {
>> >> +      /* For lane-reducing op covered by single-lane slp node, the input
>> >> +        vectype of the reduction PHI determines copies of vectorized def-use
>> >> +        cycles, which might be more than effective copies of vectorized lane-
>> >> +        reducing reduction statements.  This could be complemented by
>> >> +        generating extra trivial pass-through copies.  For example:
>> >> +
>> >> +          int sum = 0;
>> >> +          for (i)
>> >> +            {
>> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>> >> +              sum += n[i];               // normal <vector(4) int>
>> >> +            }
>> >> +
>> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
>> >> +        statements would be transformed as:
>> >> +
>> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
>> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
>> >> +
>> >> +          for (i / 16)
>> >> +            {
>> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>> >> +              sum_v1 = sum_v1;  // copy
>> >> +              sum_v2 = sum_v2;  // copy
>> >> +              sum_v3 = sum_v3;  // copy
>> >> +
>> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>> >> +              sum_v2 = sum_v2;  // copy
>> >> +              sum_v3 = sum_v3;  // copy
>> >> +
>> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
>> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
>> >> +              sum_v2 += n_v2[i: 8  ~ 11];
>> >> +              sum_v3 += n_v3[i: 12 ~ 15];
>> >> +            }
>> >> +       */
>> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
>> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
>> >> +
>> >
>> > assert reduc_ncopies >= using_ncopies?  Maybe assert
>> > reduc_index == op.num_ops - 1 given you use one above
>> > and the other below?  Or simply iterate till op.num_ops
>> > and sip i == reduc_index.
>> >
>> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
>> >> +       {
>> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
>> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
>> >> +       }
>> >> +    }
>> >>
>> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
>> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>      {
>> >>        gimple *new_stmt;
>> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
>> >> -      if (masked_loop_p && !mask_by_cond_expr)
>> >> +
>> >> +      if (!vop[0] || !vop[1])
>> >> +       {
>> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
>> >> +
>> >> +         /* Insert trivial copy if no need to generate vectorized
>> >> +            statement.  */
>> >> +         gcc_assert (reduc_vop);
>> >> +
>> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
>> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
>> >> +         gimple_set_lhs (new_stmt, new_temp);
>> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
>> >
>> > I think you could simply do
>> >
>> >                slp_node->push_vec_def (reduc_vop);
>> >                continue;
>> >
>> > without any code generation.
>> >
>>
>> OK, that would be easy. Here comes another question, this patch assumes
>> lane-reducing op would always be contained in a slp node, since single-lane
>> slp node feature has been enabled. But I got some regression if I enforced
>> such constraint on lane-reducing op check. Those cases are founded to
>> be unvectorizable with single-lane slp, so this should not be what we want?
>> and need to be fixed?
>
> Yes, in the end we need to chase down all unsupported cases and fix them
> (there's known issues with load permutes, I'm working on that - hopefully
> when finding a continuous stretch of time...).
>
>>
>> >> +       }
>> >> +      else if (masked_loop_p && !mask_by_cond_expr)
>> >>         {
>> >>           /* No conditional ifns have been defined for lane-reducing op
>> >>              yet.  */
>> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>> >>
>> >>           if (masked_loop_p && mask_by_cond_expr)
>> >>             {
>> >> +             tree stmt_vectype_in = vectype_in;
>> >> +             unsigned nvectors = vec_num * ncopies;
>> >> +
>> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
>> >> +               {
>> >> +                 /* Input vectype of the reduction PHI may be defferent from
>> >
>> > different
>> >
>> >> +                    that of lane-reducing operation.  */
>> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> >> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
>> >
>> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>>
>> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
>> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
>> to vect_get_loop_mask?
>
> Probably - it depends on the vectorization factor.  What I wanted to
> point out is that
> vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> or we should forgo with it (but that's possibly a post-only-SLP
> cleanup to be done).
>
> See vect_slp_analyze_node_operations_1 where that's computed.  For reductions
> it's probably not quite right (and we might have latent issues like
> those you are
> "fixing" with code like above).  The order we analyze stmts might also be not
> optimal for reductions with SLP - in fact given that stmt analysis
> relies on a fixed VF
> it would probably make sense to determine the reduction VF in advance as well.
> But again this sounds like post-only-SLP cleanup opportunities.
>
> In the end I might suggest to always use reduct-VF and vectype to determine
> the number of vector stmts rather than computing ncopies/vec_num separately.
>

Thanks,
Feng

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-vect-Support-multiple-lane-reducing-operations-for-l.patch --]
[-- Type: text/x-patch; name="0002-vect-Support-multiple-lane-reducing-operations-for-l.patch", Size: 41377 bytes --]

From 516d6689a1c916c4aa8e59a36c1f8a159df99f13 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:22:36 +0800
Subject: [PATCH 2/3] vect: Support multiple lane-reducing operations for loop
 reduction [PR114440]

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
       sum += n[i];               // normal <vector(4) int>
     }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 += n_v0[i: 0  ~ 3 ];
       sum_v1 += n_v1[i: 4  ~ 7 ];
       sum_v2 += n_v2[i: 8  ~ 11];
       sum_v3 += n_v3[i: 12 ~ 15];
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (vectorizable_lane_reducing): New function
	declaration.
	* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
	vectorizable_lane_reducing to analyze lane-reducing operation.
	* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
	code related to	emulated_mixed_dot_prod.
	(vect_reduction_update_partial_vector_usage): Compute ncopies as the
	original means for single-lane slp node.
	(vectorizable_lane_reducing): New function.
	(vectorizable_reduction): Allow multiple lane-reducing operations in
	loop reduction. Move some original lane-reducing related code to
	vectorizable_lane_reducing.
	(vect_transform_reduction): Extend transformation to support reduction
	statements with mixed input vectypes.

gcc/testsuite/
	PR tree-optimization/114440
	* gcc.dg/vect/vect-reduc-chain-1.c
	* gcc.dg/vect/vect-reduc-chain-2.c
	* gcc.dg/vect/vect-reduc-chain-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
	* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
	* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
 .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 ++++
 .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 ++++
 gcc/tree-vect-loop.cc                         | 333 ++++++++++++++----
 gcc/tree-vect-stmts.cc                        |   2 +
 gcc/tree-vectorizer.h                         |   2 +
 11 files changed, 836 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 00000000000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      c[i] = BASE + i * 2;
+      d[i] = BASE + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
new file mode 100644
index 00000000000..6c803b80120
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
@@ -0,0 +1,77 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#define SIGNEDNESS_4 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+fn (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 char *restrict c,
+   SIGNEDNESS_3 char *restrict d,
+   SIGNEDNESS_4 short *restrict e,
+   SIGNEDNESS_4 short *restrict f,
+   SIGNEDNESS_1 int *restrict g)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      res += a[i] * b[i];
+      res += i + 1;
+      res += c[i] * d[i];
+      res += e[i] * f[i];
+      res += g[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
+#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 char c[N], d[N];
+  SIGNEDNESS_4 short e[N], f[N];
+  SIGNEDNESS_1 int g[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 + OFFSET + i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = BASE4 + i * 6;
+      f[i] = BASE4 + OFFSET + i * 5;
+      g[i] = i;
+      asm volatile ("" ::: "memory");
+      expected += a[i] * b[i];
+      expected += i + 1;
+      expected += c[i] * d[i];
+      expected += e[i] * f[i];
+      expected += g[i];
+    }
+  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
new file mode 100644
index 00000000000..a41e4b176c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
@@ -0,0 +1,66 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 unsigned
+#define SIGNEDNESS_3 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_3 short *restrict c,
+   SIGNEDNESS_3 short *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+    {
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      res += abs;
+      res += c[i] * d[i];
+      res += e[i];
+    }
+  return res;
+}
+
+#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_3 short c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE2 + i * 5;
+      b[i] = BASE2 - i * 4;
+      c[i] = BASE3 + i * 2;
+      d[i] = BASE3 + OFFSET + i * 3;
+      e[i] = i;
+      asm volatile ("" ::: "memory");
+      short diff = a[i] - b[i];
+      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
+      expected += abs;
+      expected += c[i] * d[i];
+      expected += e[i];
+    }
+  if (f (0x12345, a, b, c, d, e) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
+/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
new file mode 100644
index 00000000000..c2831fbcc8e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
@@ -0,0 +1,95 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+      res += a[8] * b[8];
+      res += a[9] * b[9];
+      res += a[10] * b[10];
+      res += a[11] * b[11];
+      res += a[12] * b[12];
+      res += a[13] * b[13];
+      res += a[14] * b[14];
+      res += a[15] * b[15];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int step = 16;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      expected += a[t + 8] * b[t + 8];
+      expected += a[t + 9] * b[t + 9];
+      expected += a[t + 10] * b[t + 10];
+      expected += a[t + 11] * b[t + 11];
+      expected += a[t + 12] * b[t + 12];
+      expected += a[t + 13] * b[t + 13];
+      expected += a[t + 14] * b[t + 14];
+      expected += a[t + 15] * b[t + 15];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
new file mode 100644
index 00000000000..4114264a364
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
@@ -0,0 +1,67 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[5 * i + 0] * b[5 * i + 0];
+      res += a[5 * i + 1] * b[5 * i + 1];
+      res += a[5 * i + 2] * b[5 * i + 2];
+      res += a[5 * i + 3] * b[5 * i + 3];
+      res += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[5 * i + 0] * b[5 * i + 0];
+      expected += a[5 * i + 1] * b[5 * i + 1];
+      expected += a[5 * i + 2] * b[5 * i + 2];
+      expected += a[5 * i + 3] * b[5 * i + 3];
+      expected += a[5 * i + 4] * b[5 * i + 4];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
new file mode 100644
index 00000000000..2cdecc36d16
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
@@ -0,0 +1,79 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int step, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[0] * b[0];
+      res += a[1] * b[1];
+      res += a[2] * b[2];
+      res += a[3] * b[3];
+      res += a[4] * b[4];
+      res += a[5] * b[5];
+      res += a[6] * b[6];
+      res += a[7] * b[7];
+
+      a += step;
+      b += step;
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int step = 8;
+  int n = 2;
+  int t = 0;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[t + 0] * b[t + 0];
+      expected += a[t + 1] * b[t + 1];
+      expected += a[t + 2] * b[t + 2];
+      expected += a[t + 3] * b[t + 3];
+      expected += a[t + 4] * b[t + 4];
+      expected += a[t + 5] * b[t + 5];
+      expected += a[t + 6] * b[t + 6];
+      expected += a[t + 7] * b[t + 7];
+      t += step;
+    }
+
+  if (f (0x12345, a, b, step, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
new file mode 100644
index 00000000000..32c0f30c77b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
@@ -0,0 +1,63 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 short *a,
+   SIGNEDNESS_2 short *b,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      res += a[3 * i + 0] * b[3 * i + 0];
+      res += a[3 * i + 1] * b[3 * i + 1];
+      res += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 short a[100], b[100];
+  int expected = 0x12345;
+  int n = 18;
+
+  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
+    {
+      a[i] = BASE + i * 5;
+      b[i] = BASE + OFFSET + i * 4;
+      asm volatile ("" ::: "memory");
+    }
+
+  for (int i = 0; i < n; i++)
+    {
+      asm volatile ("" ::: "memory");
+      expected += a[3 * i + 0] * b[3 * i + 0];
+      expected += a[3 * i + 1] * b[3 * i + 1];
+      expected += a[3 * i + 2] * b[3 * i + 2];
+    }
+
+  if (f (0x12345, a, b, n) != expected)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
new file mode 100644
index 00000000000..84c82b023d4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
@@ -0,0 +1,60 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-do compile } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res0,
+   SIGNEDNESS_1 int res1,
+   SIGNEDNESS_1 int res2,
+   SIGNEDNESS_1 int res3,
+   SIGNEDNESS_1 int res4,
+   SIGNEDNESS_1 int res5,
+   SIGNEDNESS_1 int res6,
+   SIGNEDNESS_1 int res7,
+   SIGNEDNESS_1 int res8,
+   SIGNEDNESS_1 int res9,
+   SIGNEDNESS_1 int resA,
+   SIGNEDNESS_1 int resB,
+   SIGNEDNESS_1 int resC,
+   SIGNEDNESS_1 int resD,
+   SIGNEDNESS_1 int resE,
+   SIGNEDNESS_1 int resF,
+   SIGNEDNESS_2 char *a,
+   SIGNEDNESS_2 char *b)
+{
+  for (int i = 0; i < 64; i += 16)
+    {
+      res0 += a[i + 0x00] * b[i + 0x00];
+      res1 += a[i + 0x01] * b[i + 0x01];
+      res2 += a[i + 0x02] * b[i + 0x02];
+      res3 += a[i + 0x03] * b[i + 0x03];
+      res4 += a[i + 0x04] * b[i + 0x04];
+      res5 += a[i + 0x05] * b[i + 0x05];
+      res6 += a[i + 0x06] * b[i + 0x06];
+      res7 += a[i + 0x07] * b[i + 0x07];
+      res8 += a[i + 0x08] * b[i + 0x08];
+      res9 += a[i + 0x09] * b[i + 0x09];
+      resA += a[i + 0x0A] * b[i + 0x0A];
+      resB += a[i + 0x0B] * b[i + 0x0B];
+      resC += a[i + 0x0C] * b[i + 0x0C];
+      resD += a[i + 0x0D] * b[i + 0x0D];
+      resE += a[i + 0x0E] * b[i + 0x0E];
+      resF += a[i + 0x0F] * b[i + 0x0F];
+    }
+
+  return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^
+         res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF;
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
+/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 419f4b08d2b..6bfb0e72905 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
     gcc_unreachable ();
 
-  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
-
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     /* No extra instructions are needed in the prologue.  The loop body
        operations are costed in vectorizable_condition.  */
@@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
 	   initial result of the data reduction, initial value of the index
 	   reduction.  */
 	prologue_stmts = 4;
-      else if (emulated_mixed_dot_prod)
-	/* We need the initial reduction value and two invariants:
-	   one that contains the minimum signed value and one that
-	   contains half of its negative.  */
-	prologue_stmts = 3;
       else
+	/* We need the initial reduction value.  */
 	prologue_stmts = 1;
       prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
 					 scalar_to_vec, stmt_info, 0,
@@ -7466,7 +7460,10 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
       unsigned nvectors;
 
-      if (slp_node)
+      /* TODO: The number of vector statements for lane-reducing op is over-
+	 estimated, we have to recompute it when the containing slp node is
+	 single-lane.  Need a general means to correct this value.  */
+      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
 	nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
 	nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
@@ -7478,6 +7475,154 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
     }
 }
 
+/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
+   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
+   Now there are three such kinds of operations: dot-prod/widen-sum/sad
+   (sum-of-absolute-differences).
+
+   For a lane-reducing operation, the loop reduction path that it lies in,
+   may contain normal operation, or other lane-reducing operation of different
+   input type size, an example as:
+
+     int sum = 0;
+     for (i)
+       {
+         ...
+         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
+         sum += w[i];                // widen-sum <vector(16) char>
+         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
+         sum += n[i];                // normal <vector(4) int>
+         ...
+       }
+
+   Vectorization factor is essentially determined by operation whose input
+   vectype has the most lanes ("vector(16) char" in the example), while we
+   need to choose input vectype with the least lanes ("vector(4) int" in the
+   example) for the reduction PHI statement.  */
+
+bool
+vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+			    slp_tree slp_node, stmt_vector_for_cost *cost_vec)
+{
+  gimple *stmt = stmt_info->stmt;
+
+  if (!lane_reducing_stmt_p (stmt))
+    return false;
+
+  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
+
+  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
+    return false;
+
+  /* Do not try to vectorize bit-precision reductions.  */
+  if (!type_has_mode_precision_p (type))
+    return false;
+
+  /* A lane-reducing op should be contained in some slp node.  */
+  if (!slp_node)
+    return false;
+
+  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+    {
+      stmt_vec_info def_stmt_info;
+      slp_tree slp_op;
+      tree op;
+      tree vectype;
+      enum vect_def_type dt;
+
+      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
+			       &slp_op, &dt, &vectype, &def_stmt_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "use not simple.\n");
+	  return false;
+	}
+
+      if (!vectype)
+	{
+	  vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
+						 slp_op);
+	  if (!vectype)
+	    return false;
+	}
+
+      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "incompatible vector types for invariants\n");
+	  return false;
+	}
+
+      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
+	continue;
+
+      /* There should be at most one cycle def in the stmt.  */
+      if (VECTORIZABLE_CYCLE_DEF (dt))
+	return false;
+    }
+
+  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
+
+  /* TODO: Support lane-reducing operation that does not directly participate
+     in loop reduction.  */
+  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+    return false;
+
+  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
+     recoginized.  */
+  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
+  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
+
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+  int ncopies_for_cost;
+
+  if (SLP_TREE_LANES (slp_node) > 1)
+    {
+      /* Now lane-reducing operations in a non-single-lane slp node should only
+	 come from the same loop reduction path.  */
+      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
+      ncopies_for_cost = 1;
+    }
+  else
+    {
+      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies_for_cost >= 1);
+    }
+
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
+    {
+      /* We need extra two invariants: one that contains the minimum signed
+	 value and one that contains half of its negative.  */
+      int prologue_stmts = 2;
+      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
+					scalar_to_vec, stmt_info, 0,
+					vect_prologue);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
+		     "extra prologue_cost = %d .\n", cost);
+
+      /* Three dot-products and a subtraction.  */
+      ncopies_for_cost *= 4;
+    }
+
+  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
+		    vect_body);
+
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
+						  slp_node, code, type,
+						  vectype_in);
+    }
+
+  /* Transform via vect_transform_reduction.  */
+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+  return true;
+}
+
 /* Function vectorizable_reduction.
 
    Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7811,18 +7956,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (!type_has_mode_precision_p (op.type))
     return false;
 
-  /* For lane-reducing ops we're reducing the number of reduction PHIs
-     which means the only use of that may be in the lane-reducing operation.  */
-  if (lane_reducing
-      && reduc_chain_length != 1
-      && !only_slp_reduc_chain)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "lane-reducing reduction with extra stmts.\n");
-      return false;
-    }
-
   /* Lane-reducing ops also never can be used in a SLP reduction group
      since we'll mix lanes belonging to different reductions.  But it's
      OK to use them in a reduction chain or when the reduction group
@@ -8362,14 +8495,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
       && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
-  if (single_defuse_cycle || lane_reducing)
+  if (single_defuse_cycle && !lane_reducing)
     {
       gcc_assert (op.code != COND_EXPR);
 
-      /* 4. Supportable by target?  */
-      bool ok = true;
-
-      /* 4.1. check support for the operation in the loop
+      /* 4. check support for the operation in the loop
 
 	 This isn't necessary for the lane reduction codes, since they
 	 can only be produced by pattern matching, and it's up to the
@@ -8378,14 +8508,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	 mixed-sign dot-products can be implemented using signed
 	 dot-products.  */
       machine_mode vec_mode = TYPE_MODE (vectype_in);
-      if (!lane_reducing
-	  && !directly_supported_p (op.code, vectype_in, optab_vector))
+      if (!directly_supported_p (op.code, vectype_in, optab_vector))
         {
           if (dump_enabled_p ())
             dump_printf (MSG_NOTE, "op not supported by target.\n");
 	  if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
 	      || !vect_can_vectorize_without_simd_p (op.code))
-	    ok = false;
+	    single_defuse_cycle = false;
 	  else
 	    if (dump_enabled_p ())
 	      dump_printf (MSG_NOTE, "proceeding using word mode.\n");
@@ -8398,16 +8527,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 	    dump_printf (MSG_NOTE, "using word mode not possible.\n");
 	  return false;
 	}
-
-      /* lane-reducing operations have to go through vect_transform_reduction.
-         For the other cases try without the single cycle optimization.  */
-      if (!ok)
-	{
-	  if (lane_reducing)
-	    return false;
-	  else
-	    single_defuse_cycle = false;
-	}
     }
   if (dump_enabled_p () && single_defuse_cycle)
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -8415,22 +8534,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 		     "multiple vectors to one in the loop body\n");
   STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
 
-  /* If the reduction stmt is one of the patterns that have lane
-     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
-  if ((ncopies > 1 && ! single_defuse_cycle)
-      && lane_reducing)
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "multi def-use cycle not possible for lane-reducing "
-			 "reduction operation\n");
-      return false;
-    }
+  /* For lane-reducing operation, the below processing related to single
+     defuse-cycle will be done in its own vectorizable function.  One more
+     thing to note is that the operation must not be involved in fold-left
+     reduction.  */
+  single_defuse_cycle &= !lane_reducing;
 
   if (slp_node
-      && !(!single_defuse_cycle
-	   && !lane_reducing
-	   && reduction_type != FOLD_LEFT_REDUCTION))
+      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
     for (i = 0; i < (int) op.num_ops; i++)
       if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
 	{
@@ -8443,28 +8554,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
 			     reduction_type, ncopies, cost_vec);
   /* Cost the reduction op inside the loop if transformed via
-     vect_transform_reduction.  Otherwise this is costed by the
-     separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reducing)
-    {
-      int factor = 1;
-      if (vect_is_emulated_mixed_dot_prod (stmt_info))
-	/* Three dot-products and a subtraction.  */
-	factor = 4;
-      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
-			stmt_info, 0, vect_body);
-    }
+     vect_transform_reduction for non-lane-reducing operation.  Otherwise
+     this is costed by the separate vectorizable_* routines.  */
+  if (single_defuse_cycle)
+    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
 
   if (dump_enabled_p ()
       && reduction_type == FOLD_LEFT_REDUCTION)
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "using an in-order (fold-left) reduction.\n");
   STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
-  /* All but single defuse-cycle optimized, lane-reducing and fold-left
-     reductions go through their own vectorizable_* routines.  */
-  if (!single_defuse_cycle
-      && !lane_reducing
-      && reduction_type != FOLD_LEFT_REDUCTION)
+
+  /* All but single defuse-cycle optimized and fold-left reductions go
+     through their own vectorizable_* routines.  */
+  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
     {
       stmt_vec_info tem
 	= vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
@@ -8654,6 +8757,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);
 
+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+
+      /* Now lane-reducing op is contained in some slp node.  */
+      gcc_assert (slp_node);
+   }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
@@ -8698,6 +8810,62 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 			 reduc_index == 2 ? op.ops[2] : NULL_TREE,
 			 &vec_oprnds[2]);
     }
+  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+    {
+      /* For lane-reducing op covered by single-lane slp node, the input
+	 vectype of the reduction PHI determines copies of vectorized def-use
+	 cycles, which might be more than effective copies of vectorized lane-
+	 reducing reduction statements.  This could be complemented by
+	 generating extra trivial pass-through copies.  For example:
+
+	   int sum = 0;
+	   for (i)
+	     {
+	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+	       sum += n[i];               // normal <vector(4) int>
+	     }
+
+	 The vector size is 128-bit，vectorization factor is 16.  Reduction
+	 statements would be transformed as:
+
+	   vector<4> int sum_v0 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v1 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v2 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+	   for (i / 16)
+	     {
+	       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+
+	       sum_v0 += n_v0[i: 0  ~ 3 ];
+	       sum_v1 += n_v1[i: 4  ~ 7 ];
+	       sum_v2 += n_v2[i: 8  ~ 11];
+	       sum_v3 += n_v3[i: 12 ~ 15];
+	     }
+	*/
+      unsigned using_ncopies = vec_oprnds[0].length ();
+      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+      gcc_assert (using_ncopies <= reduc_ncopies);
+
+      if (using_ncopies < reduc_ncopies)
+	{
+	  for (unsigned i = 0; i < op.num_ops - 1; i++)
+	    {
+	      gcc_assert (vec_oprnds[i].length () == using_ncopies);
+	      vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+	    }
+	}
+    }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8706,7 +8874,18 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+
+      if (!vop[0] || !vop[1])
+	{
+	  tree reduc_vop = vec_oprnds[reduc_index][i];
+
+	  /* Insert trivial copy if no need to generate vectorized
+	     statement.  */
+	  gcc_assert (reduc_vop);
+
+	  new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
+	}
+      else if (masked_loop_p && !mask_by_cond_expr)
 	{
 	  /* No conditional ifns have been defined for lane-reducing op
 	     yet.  */
@@ -8735,8 +8914,22 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
 	  if (masked_loop_p && mask_by_cond_expr)
 	    {
+	      unsigned nvectors = vec_num * ncopies;
+	      tree stmt_vectype_in = vectype_in;
+
+	      /* For single-lane slp node on lane-reducing op, we need to
+		 compute exact number of vector stmts from its input vectype,
+		 since the value got from the slp node is over-estimated.
+		 TODO: properly set the number this somewhere, so that this
+		 fixup could be removed.  */
+	      if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+		{
+		  stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+		  nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+		}
+
 	      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
-					      vec_num * ncopies, vectype_in, i);
+					      nvectors, stmt_vectype_in, i);
 	      build_vect_cond_expr (code, vop, mask, gsi);
 	    }
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 840e162c7f0..845647b4399 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
 				      NULL, NULL, node, cost_vec)
 	  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
 	  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+	  || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
+					 stmt_info, node, cost_vec)
 	  || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
 				     node, node_instance, cost_vec)
 	  || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 					 slp_tree, slp_instance, int,
 					 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+					slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
 				    slp_tree, slp_instance,
 				    stmt_vector_for_cost *);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
  2024-06-26 14:50         ` Feng Xue OS
@ 2024-06-28 13:06           ` Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2024-06-28 13:06 UTC (permalink / raw)
  To: Feng Xue OS; +Cc: gcc-patches

On Wed, Jun 26, 2024 at 4:50 PM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> Updated the patch.
>
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> vectorizer could only handle the pattern if the reduction chain does not
> contain other operation, no matter the other is normal or lane-reducing.
>
> Actually, to allow multiple arbitrary lane-reducing operations, we need to
> support vectorization of loop reduction chain with mixed input vectypes. Since
> lanes of vectype may vary with operation, the effective ncopies of vectorized
> statements for operation also may not be same to each other, this causes
> mismatch on vectorized def-use cycles. A simple way is to align all operations
> with the one that has the most ncopies, the gap could be complemented by
> generating extra trivial pass-through copies. For example:
>
>    int sum = 0;
>    for (i)
>      {
>        sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>        sum += w[i];               // widen-sum <vector(16) char>
>        sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>        sum += n[i];               // normal <vector(4) int>
>      }
>
> The vector size is 128-bit vectorization factor is 16. Reduction statements
> would be transformed as:
>
>    vector<4> int sum_v0 = { 0, 0, 0, 0 };
>    vector<4> int sum_v1 = { 0, 0, 0, 0 };
>    vector<4> int sum_v2 = { 0, 0, 0, 0 };
>    vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>    for (i / 16)
>      {
>        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>        sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 += n_v0[i: 0  ~ 3 ];
>        sum_v1 += n_v1[i: 4  ~ 7 ];
>        sum_v2 += n_v2[i: 8  ~ 11];
>        sum_v3 += n_v3[i: 12 ~ 15];
>      }
>
> 2024-03-22 Feng Xue <fxue@os.amperecomputing.com>
>
> gcc/
>         PR tree-optimization/114440
>         * tree-vectorizer.h (vectorizable_lane_reducing): New function
>         declaration.
>         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
>         vectorizable_lane_reducing to analyze lane-reducing operation.
>         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
>         code related to emulated_mixed_dot_prod.
>         (vect_reduction_update_partial_vector_usage): Compute ncopies as the
>         original means for single-lane slp node.
>         (vectorizable_lane_reducing): New function.
>         (vectorizable_reduction): Allow multiple lane-reducing operations in
>         loop reduction. Move some original lane-reducing related code to
>         vectorizable_lane_reducing.
>         (vect_transform_reduction): Extend transformation to support reduction
>         statements with mixed input vectypes.
>
> gcc/testsuite/
>         PR tree-optimization/114440
>         * gcc.dg/vect/vect-reduc-chain-1.c
>         * gcc.dg/vect/vect-reduc-chain-2.c
>         * gcc.dg/vect/vect-reduc-chain-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>         * gcc.dg/vect/vect-reduc-dot-slp-1.c
> ---
>  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
>  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 ++++
>  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
>  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  60 ++++
>  gcc/tree-vect-loop.cc                         | 333 ++++++++++++++----
>  gcc/tree-vect-stmts.cc                        |   2 +
>  gcc/tree-vectorizer.h                         |   2 +
>  11 files changed, 836 insertions(+), 70 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> new file mode 100644
> index 00000000000..04bfc419dbd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> @@ -0,0 +1,62 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_2 char *restrict c,
> +   SIGNEDNESS_2 char *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_2 char c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      c[i] = BASE + i * 2;
> +      d[i] = BASE + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> new file mode 100644
> index 00000000000..6c803b80120
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> @@ -0,0 +1,77 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#define SIGNEDNESS_4 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +fn (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 char *restrict c,
> +   SIGNEDNESS_3 char *restrict d,
> +   SIGNEDNESS_4 short *restrict e,
> +   SIGNEDNESS_4 short *restrict f,
> +   SIGNEDNESS_1 int *restrict g)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      res += a[i] * b[i];
> +      res += i + 1;
> +      res += c[i] * d[i];
> +      res += e[i] * f[i];
> +      res += g[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
> +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 char c[N], d[N];
> +  SIGNEDNESS_4 short e[N], f[N];
> +  SIGNEDNESS_1 int g[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 + OFFSET + i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = BASE4 + i * 6;
> +      f[i] = BASE4 + OFFSET + i * 5;
> +      g[i] = i;
> +      asm volatile ("" ::: "memory");
> +      expected += a[i] * b[i];
> +      expected += i + 1;
> +      expected += c[i] * d[i];
> +      expected += e[i] * f[i];
> +      expected += g[i];
> +    }
> +  if (fn (0x12345, a, b, c, d, e, f, g) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_udot_qi } } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target { vect_sdot_hi } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> new file mode 100644
> index 00000000000..a41e4b176c4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> @@ -0,0 +1,66 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +
> +#include "tree-vect.h"
> +
> +#define N 50
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 unsigned
> +#define SIGNEDNESS_3 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *restrict a,
> +   SIGNEDNESS_2 char *restrict b,
> +   SIGNEDNESS_3 short *restrict c,
> +   SIGNEDNESS_3 short *restrict d,
> +   SIGNEDNESS_1 int *restrict e)
> +{
> +  for (int i = 0; i < N; ++i)
> +    {
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      res += abs;
> +      res += c[i] * d[i];
> +      res += e[i];
> +    }
> +  return res;
> +}
> +
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[N], b[N];
> +  SIGNEDNESS_3 short c[N], d[N];
> +  SIGNEDNESS_1 int e[N];
> +  int expected = 0x12345;
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE2 + i * 5;
> +      b[i] = BASE2 - i * 4;
> +      c[i] = BASE3 + i * 2;
> +      d[i] = BASE3 + OFFSET + i * 3;
> +      e[i] = i;
> +      asm volatile ("" ::: "memory");
> +      short diff = a[i] - b[i];
> +      SIGNEDNESS_2 short abs = diff < 0 ? -diff : diff;
> +      expected += abs;
> +      expected += c[i] * d[i];
> +      expected += e[i];
> +    }
> +  if (f (0x12345, a, b, c, d, e) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = SAD_EXPR" "vect" { target vect_udot_qi } } } */
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ = DOT_PROD_EXPR" "vect" { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> new file mode 100644
> index 00000000000..c2831fbcc8e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> @@ -0,0 +1,95 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +      res += a[8] * b[8];
> +      res += a[9] * b[9];
> +      res += a[10] * b[10];
> +      res += a[11] * b[11];
> +      res += a[12] * b[12];
> +      res += a[13] * b[13];
> +      res += a[14] * b[14];
> +      res += a[15] * b[15];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 16;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      expected += a[t + 8] * b[t + 8];
> +      expected += a[t + 9] * b[t + 9];
> +      expected += a[t + 10] * b[t + 10];
> +      expected += a[t + 11] * b[t + 11];
> +      expected += a[t + 12] * b[t + 12];
> +      expected += a[t + 13] * b[t + 13];
> +      expected += a[t + 14] * b[t + 14];
> +      expected += a[t + 15] * b[t + 15];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 16 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> new file mode 100644
> index 00000000000..4114264a364
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> @@ -0,0 +1,67 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[5 * i + 0] * b[5 * i + 0];
> +      res += a[5 * i + 1] * b[5 * i + 1];
> +      res += a[5 * i + 2] * b[5 * i + 2];
> +      res += a[5 * i + 3] * b[5 * i + 3];
> +      res += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 char a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[5 * i + 0] * b[5 * i + 0];
> +      expected += a[5 * i + 1] * b[5 * i + 1];
> +      expected += a[5 * i + 2] * b[5 * i + 2];
> +      expected += a[5 * i + 3] * b[5 * i + 3];
> +      expected += a[5 * i + 4] * b[5 * i + 4];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 5 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> new file mode 100644
> index 00000000000..2cdecc36d16
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> @@ -0,0 +1,79 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int step, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[0] * b[0];
> +      res += a[1] * b[1];
> +      res += a[2] * b[2];
> +      res += a[3] * b[3];
> +      res += a[4] * b[4];
> +      res += a[5] * b[5];
> +      res += a[6] * b[6];
> +      res += a[7] * b[7];
> +
> +      a += step;
> +      b += step;
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int step = 8;
> +  int n = 2;
> +  int t = 0;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[t + 0] * b[t + 0];
> +      expected += a[t + 1] * b[t + 1];
> +      expected += a[t + 2] * b[t + 2];
> +      expected += a[t + 3] * b[t + 3];
> +      expected += a[t + 4] * b[t + 4];
> +      expected += a[t + 5] * b[t + 5];
> +      expected += a[t + 6] * b[t + 6];
> +      expected += a[t + 7] * b[t + 7];
> +      t += step;
> +    }
> +
> +  if (f (0x12345, a, b, step, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> new file mode 100644
> index 00000000000..32c0f30c77b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> @@ -0,0 +1,63 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res,
> +   SIGNEDNESS_2 short *a,
> +   SIGNEDNESS_2 short *b,
> +   int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      res += a[3 * i + 0] * b[3 * i + 0];
> +      res += a[3 * i + 1] * b[3 * i + 1];
> +      res += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  return res;
> +}
> +
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> +#define OFFSET 20
> +
> +int
> +main (void)
> +{
> +  check_vect ();
> +
> +  SIGNEDNESS_2 short a[100], b[100];
> +  int expected = 0x12345;
> +  int n = 18;
> +
> +  for (int i = 0; i < sizeof (a) / sizeof (a[0]); ++i)
> +    {
> +      a[i] = BASE + i * 5;
> +      b[i] = BASE + OFFSET + i * 4;
> +      asm volatile ("" ::: "memory");
> +    }
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      asm volatile ("" ::: "memory");
> +      expected += a[3 * i + 0] * b[3 * i + 0];
> +      expected += a[3 * i + 1] * b[3 * i + 1];
> +      expected += a[3 * i + 2] * b[3 * i + 2];
> +    }
> +
> +  if (f (0x12345, a, b, n) != expected)
> +    __builtin_abort ();
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ = DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> new file mode 100644
> index 00000000000..84c82b023d4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> @@ -0,0 +1,60 @@
> +/* Disabling epilogues until we find a better way to deal with scans.  */
> +/* { dg-do compile } */
> +/* { dg-additional-options "--param vect-epilogues-nomask=0 -fdump-tree-optimized" } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> +
> +#include "tree-vect.h"
> +
> +#ifndef SIGNEDNESS_1
> +#define SIGNEDNESS_1 signed
> +#define SIGNEDNESS_2 signed
> +#endif
> +
> +SIGNEDNESS_1 int __attribute__ ((noipa))
> +f (SIGNEDNESS_1 int res0,
> +   SIGNEDNESS_1 int res1,
> +   SIGNEDNESS_1 int res2,
> +   SIGNEDNESS_1 int res3,
> +   SIGNEDNESS_1 int res4,
> +   SIGNEDNESS_1 int res5,
> +   SIGNEDNESS_1 int res6,
> +   SIGNEDNESS_1 int res7,
> +   SIGNEDNESS_1 int res8,
> +   SIGNEDNESS_1 int res9,
> +   SIGNEDNESS_1 int resA,
> +   SIGNEDNESS_1 int resB,
> +   SIGNEDNESS_1 int resC,
> +   SIGNEDNESS_1 int resD,
> +   SIGNEDNESS_1 int resE,
> +   SIGNEDNESS_1 int resF,
> +   SIGNEDNESS_2 char *a,
> +   SIGNEDNESS_2 char *b)
> +{
> +  for (int i = 0; i < 64; i += 16)
> +    {
> +      res0 += a[i + 0x00] * b[i + 0x00];
> +      res1 += a[i + 0x01] * b[i + 0x01];
> +      res2 += a[i + 0x02] * b[i + 0x02];
> +      res3 += a[i + 0x03] * b[i + 0x03];
> +      res4 += a[i + 0x04] * b[i + 0x04];
> +      res5 += a[i + 0x05] * b[i + 0x05];
> +      res6 += a[i + 0x06] * b[i + 0x06];
> +      res7 += a[i + 0x07] * b[i + 0x07];
> +      res8 += a[i + 0x08] * b[i + 0x08];
> +      res9 += a[i + 0x09] * b[i + 0x09];
> +      resA += a[i + 0x0A] * b[i + 0x0A];
> +      resB += a[i + 0x0B] * b[i + 0x0B];
> +      resC += a[i + 0x0C] * b[i + 0x0C];
> +      resD += a[i + 0x0D] * b[i + 0x0D];
> +      resE += a[i + 0x0E] * b[i + 0x0E];
> +      resF += a[i + 0x0F] * b[i + 0x0F];
> +    }
> +
> +  return res0 ^ res1 ^ res2 ^ res3 ^ res4 ^ res5 ^ res6 ^ res7 ^
> +         res8 ^ res9 ^ resA ^ resB ^ resC ^ resD ^ resE ^ resF;
> +}
> +
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "DOT_PROD_EXPR" "optimized" } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 419f4b08d2b..6bfb0e72905 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>      gcc_unreachable ();
>
> -  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> -
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      /* No extra instructions are needed in the prologue.  The loop body
>         operations are costed in vectorizable_condition.  */
> @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>            initial result of the data reduction, initial value of the index
>            reduction.  */
>         prologue_stmts = 4;
> -      else if (emulated_mixed_dot_prod)
> -       /* We need the initial reduction value and two invariants:
> -          one that contains the minimum signed value and one that
> -          contains half of its negative.  */
> -       prologue_stmts = 3;
>        else
> +       /* We need the initial reduction value.  */
>         prologue_stmts = 1;
>        prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>                                          scalar_to_vec, stmt_info, 0,
> @@ -7466,7 +7460,10 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>        vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>        unsigned nvectors;
>
> -      if (slp_node)
> +      /* TODO: The number of vector statements for lane-reducing op is over-
> +        estimated, we have to recompute it when the containing slp node is
> +        single-lane.  Need a general means to correct this value.  */
> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>        else
>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> @@ -7478,6 +7475,154 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>      }
>  }
>
> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> +   (sum-of-absolute-differences).
> +
> +   For a lane-reducing operation, the loop reduction path that it lies in,
> +   may contain normal operation, or other lane-reducing operation of different
> +   input type size, an example as:
> +
> +     int sum = 0;
> +     for (i)
> +       {
> +         ...
> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> +         sum += w[i];                // widen-sum <vector(16) char>
> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> +         sum += n[i];                // normal <vector(4) int>
> +         ...
> +       }
> +
> +   Vectorization factor is essentially determined by operation whose input
> +   vectype has the most lanes ("vector(16) char" in the example), while we
> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> +   example) for the reduction PHI statement.  */
> +
> +bool
> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> +{
> +  gimple *stmt = stmt_info->stmt;
> +
> +  if (!lane_reducing_stmt_p (stmt))
> +    return false;
> +
> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> +
> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> +    return false;
> +
> +  /* Do not try to vectorize bit-precision reductions.  */
> +  if (!type_has_mode_precision_p (type))
> +    return false;
> +
> +  /* A lane-reducing op should be contained in some slp node.  */
> +  if (!slp_node)
> +    return false;
> +
> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> +    {
> +      stmt_vec_info def_stmt_info;
> +      slp_tree slp_op;
> +      tree op;
> +      tree vectype;
> +      enum vect_def_type dt;
> +
> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "use not simple.\n");
> +         return false;
> +       }
> +
> +      if (!vectype)
> +       {
> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> +                                                slp_op);
> +         if (!vectype)
> +           return false;
> +       }
> +
> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "incompatible vector types for invariants\n");
> +         return false;
> +       }
> +
> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> +       continue;
> +
> +      /* There should be at most one cycle def in the stmt.  */
> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> +       return false;
> +    }
> +
> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> +
> +  /* TODO: Support lane-reducing operation that does not directly participate
> +     in loop reduction.  */
> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> +    return false;
> +
> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> +     recoginized.  */
> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> +
> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +  int ncopies_for_cost;
> +
> +  if (SLP_TREE_LANES (slp_node) > 1)
> +    {
> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> +        come from the same loop reduction path.  */
> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> +      ncopies_for_cost = 1;
> +    }
> +  else
> +    {
> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> +      gcc_assert (ncopies_for_cost >= 1);
> +    }
> +
> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> +    {
> +      /* We need extra two invariants: one that contains the minimum signed
> +        value and one that contains half of its negative.  */
> +      int prologue_stmts = 2;
> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> +                                       scalar_to_vec, stmt_info, 0,
> +                                       vect_prologue);
> +      if (dump_enabled_p ())
> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> +                    "extra prologue_cost = %d .\n", cost);
> +
> +      /* Three dot-products and a subtraction.  */
> +      ncopies_for_cost *= 4;
> +    }
> +
> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> +                   vect_body);
> +
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> +                                                 slp_node, code, type,
> +                                                 vectype_in);
> +    }
> +
> +  /* Transform via vect_transform_reduction.  */
> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> +  return true;
> +}
> +
>  /* Function vectorizable_reduction.
>
>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> @@ -7811,18 +7956,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    if (!type_has_mode_precision_p (op.type))
>      return false;
>
> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> -     which means the only use of that may be in the lane-reducing operation.  */
> -  if (lane_reducing
> -      && reduc_chain_length != 1
> -      && !only_slp_reduc_chain)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "lane-reducing reduction with extra stmts.\n");
> -      return false;
> -    }
> -
>    /* Lane-reducing ops also never can be used in a SLP reduction group
>       since we'll mix lanes belonging to different reductions.  But it's
>       OK to use them in a reduction chain or when the reduction group
> @@ -8362,14 +8495,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>        && loop_vinfo->suggested_unroll_factor == 1)
>      single_defuse_cycle = true;
>
> -  if (single_defuse_cycle || lane_reducing)
> +  if (single_defuse_cycle && !lane_reducing)
>      {
>        gcc_assert (op.code != COND_EXPR);
>
> -      /* 4. Supportable by target?  */
> -      bool ok = true;
> -
> -      /* 4.1. check support for the operation in the loop
> +      /* 4. check support for the operation in the loop
>
>          This isn't necessary for the lane reduction codes, since they
>          can only be produced by pattern matching, and it's up to the
> @@ -8378,14 +8508,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>          mixed-sign dot-products can be implemented using signed
>          dot-products.  */
>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> -      if (!lane_reducing
> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
>          {
>            if (dump_enabled_p ())
>              dump_printf (MSG_NOTE, "op not supported by target.\n");
>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
>               || !vect_can_vectorize_without_simd_p (op.code))
> -           ok = false;
> +           single_defuse_cycle = false;
>           else
>             if (dump_enabled_p ())
>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> @@ -8398,16 +8527,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
>           return false;
>         }
> -
> -      /* lane-reducing operations have to go through vect_transform_reduction.
> -         For the other cases try without the single cycle optimization.  */
> -      if (!ok)
> -       {
> -         if (lane_reducing)
> -           return false;
> -         else
> -           single_defuse_cycle = false;
> -       }
>      }
>    if (dump_enabled_p () && single_defuse_cycle)
>      dump_printf_loc (MSG_NOTE, vect_location,
> @@ -8415,22 +8534,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>                      "multiple vectors to one in the loop body\n");
>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
>
> -  /* If the reduction stmt is one of the patterns that have lane
> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> -  if ((ncopies > 1 && ! single_defuse_cycle)
> -      && lane_reducing)
> -    {
> -      if (dump_enabled_p ())
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "multi def-use cycle not possible for lane-reducing "
> -                        "reduction operation\n");
> -      return false;
> -    }
> +  /* For lane-reducing operation, the below processing related to single
> +     defuse-cycle will be done in its own vectorizable function.  One more
> +     thing to note is that the operation must not be involved in fold-left
> +     reduction.  */
> +  single_defuse_cycle &= !lane_reducing;
>
>    if (slp_node
> -      && !(!single_defuse_cycle
> -          && !lane_reducing
> -          && reduction_type != FOLD_LEFT_REDUCTION))
> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
>      for (i = 0; i < (int) op.num_ops; i++)
>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
>         {
> @@ -8443,28 +8554,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
>                              reduction_type, ncopies, cost_vec);
>    /* Cost the reduction op inside the loop if transformed via
> -     vect_transform_reduction.  Otherwise this is costed by the
> -     separate vectorizable_* routines.  */
> -  if (single_defuse_cycle || lane_reducing)
> -    {
> -      int factor = 1;
> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> -       /* Three dot-products and a subtraction.  */
> -       factor = 4;
> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> -                       stmt_info, 0, vect_body);
> -    }
> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> +     this is costed by the separate vectorizable_* routines.  */
> +  if (single_defuse_cycle)
> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
>
>    if (dump_enabled_p ()
>        && reduction_type == FOLD_LEFT_REDUCTION)
>      dump_printf_loc (MSG_NOTE, vect_location,
>                      "using an in-order (fold-left) reduction.\n");
>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> -     reductions go through their own vectorizable_* routines.  */
> -  if (!single_defuse_cycle
> -      && !lane_reducing
> -      && reduction_type != FOLD_LEFT_REDUCTION)
> +
> +  /* All but single defuse-cycle optimized and fold-left reductions go
> +     through their own vectorizable_* routines.  */
> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
>      {
>        stmt_vec_info tem
>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> @@ -8654,6 +8757,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    bool lane_reducing = lane_reducing_op_p (code);
>    gcc_assert (single_defuse_cycle || lane_reducing);
>
> +  if (lane_reducing)
> +    {
> +      /* The last operand of lane-reducing op is for reduction.  */
> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> +
> +      /* Now lane-reducing op is contained in some slp node.  */
> +      gcc_assert (slp_node);
> +   }
> +
>    /* Create the destination vector  */
>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> @@ -8698,6 +8810,62 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>                          &vec_oprnds[2]);
>      }
> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> +    {
> +      /* For lane-reducing op covered by single-lane slp node, the input
> +        vectype of the reduction PHI determines copies of vectorized def-use
> +        cycles, which might be more than effective copies of vectorized lane-
> +        reducing reduction statements.  This could be complemented by
> +        generating extra trivial pass-through copies.  For example:
> +
> +          int sum = 0;
> +          for (i)
> +            {
> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> +              sum += n[i];               // normal <vector(4) int>
> +            }
> +
> +        The vector size is 128-bit?vectorization factor is 16.  Reduction
> +        statements would be transformed as:
> +
> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> +
> +          for (i / 16)
> +            {
> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> +              sum_v1 = sum_v1;  // copy
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> +              sum_v2 += n_v2[i: 8  ~ 11];
> +              sum_v3 += n_v3[i: 12 ~ 15];
> +            }
> +       */
> +      unsigned using_ncopies = vec_oprnds[0].length ();
> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> +
> +      gcc_assert (using_ncopies <= reduc_ncopies);
> +
> +      if (using_ncopies < reduc_ncopies)
> +       {
> +         for (unsigned i = 0; i < op.num_ops - 1; i++)
> +           {
> +             gcc_assert (vec_oprnds[i].length () == using_ncopies);
> +             vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> +           }
> +       }
> +    }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> @@ -8706,7 +8874,18 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>      {
>        gimple *new_stmt;
>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> -      if (masked_loop_p && !mask_by_cond_expr)
> +
> +      if (!vop[0] || !vop[1])
> +       {
> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> +
> +         /* Insert trivial copy if no need to generate vectorized
> +            statement.  */
> +         gcc_assert (reduc_vop);
> +
> +         new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
> +       }
> +      else if (masked_loop_p && !mask_by_cond_expr)
>         {
>           /* No conditional ifns have been defined for lane-reducing op
>              yet.  */
> @@ -8735,8 +8914,22 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>
>           if (masked_loop_p && mask_by_cond_expr)
>             {
> +             unsigned nvectors = vec_num * ncopies;
> +             tree stmt_vectype_in = vectype_in;
> +
> +             /* For single-lane slp node on lane-reducing op, we need to
> +                compute exact number of vector stmts from its input vectype,
> +                since the value got from the slp node is over-estimated.
> +                TODO: properly set the number this somewhere, so that this
> +                fixup could be removed.  */
> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> +               {
> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +                 nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> +               }

As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets this
and I think the existing "exception"

  /* Calculate the number of vector statements to be created for the
     scalar stmts in this node.  For SLP reductions it is equal to the
     number of vector statements in the children (which has already been
     calculated by the recursive call).  Otherwise it is the number of
     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
     VF divided by the number of elements in a vector.  */
  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
      && !STMT_VINFO_DATA_REF (stmt_info)
      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
    {
      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
        if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
vect_internal_def)
          {
            SLP_TREE_NUMBER_OF_VEC_STMTS (node)
              = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
            break;
          }
    }

could be changed (or amended if replacing doesn't work out) to

  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
      && STMT_VINFO_REDUC_IDX (stmt_info)
      // do we have this always set?
      && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
   {
      do the same as in else {} but using VECTYPE_IN
   }

Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.

> +
>               tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> -                                             vec_num * ncopies, vectype_in, i);
> +                                             nvectors, stmt_vectype_in, i);
>               build_vect_cond_expr (code, vop, mask, gsi);
>             }
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 840e162c7f0..845647b4399 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
>                                       NULL, NULL, node, cost_vec)
>           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
>           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
> +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
> +                                        stmt_info, node, cost_vec)
>           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_info,
>                                      node, node_instance, cost_vec)
>           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_info,
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 60224f4e284..94736736dcc 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *,
>  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
>                                          slp_tree, slp_instance, int,
>                                          bool, stmt_vector_for_cost *);
> +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
> +                                       slp_tree, stmt_vector_for_cost *);
>  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
>                                     slp_tree, slp_instance,
>                                     stmt_vector_for_cost *);
> --
> 2.17.1
>
> ________________________________________
> From: Feng Xue OS <fxue@os.amperecomputing.com>
> Sent: Tuesday, June 25, 2024 5:32 PM
> To: Richard Biener
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
>
> >>
> >> >> -      if (slp_node)
> >> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >> >
> >> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> >> > instead, which is bad.
> >> >
> >> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >> >>        else
> >> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >> >>      }
> >> >>  }
> >> >>
> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> >> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> >> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> >> >> +   (sum-of-absolute-differences).
> >> >> +
> >> >> +   For a lane-reducing operation, the loop reduction path that it lies in,
> >> >> +   may contain normal operation, or other lane-reducing operation of different
> >> >> +   input type size, an example as:
> >> >> +
> >> >> +     int sum = 0;
> >> >> +     for (i)
> >> >> +       {
> >> >> +         ...
> >> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> >> >> +         sum += w[i];                // widen-sum <vector(16) char>
> >> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> >> +         sum += n[i];                // normal <vector(4) int>
> >> >> +         ...
> >> >> +       }
> >> >> +
> >> >> +   Vectorization factor is essentially determined by operation whose input
> >> >> +   vectype has the most lanes ("vector(16) char" in the example), while we
> >> >> +   need to choose input vectype with the least lanes ("vector(4) int" in the
> >> >> +   example) for the reduction PHI statement.  */
> >> >> +
> >> >> +bool
> >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> >> >> +                           slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> >> >> +{
> >> >> +  gimple *stmt = stmt_info->stmt;
> >> >> +
> >> >> +  if (!lane_reducing_stmt_p (stmt))
> >> >> +    return false;
> >> >> +
> >> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> >> >> +
> >> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> >> +    return false;
> >> >> +
> >> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> >> +  if (!type_has_mode_precision_p (type))
> >> >> +    return false;
> >> >> +
> >> >> +  if (!slp_node)
> >> >> +    return false;
> >> >> +
> >> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> >> +    {
> >> >> +      stmt_vec_info def_stmt_info;
> >> >> +      slp_tree slp_op;
> >> >> +      tree op;
> >> >> +      tree vectype;
> >> >> +      enum vect_def_type dt;
> >> >> +
> >> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> >> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "use not simple.\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vectype)
> >> >> +       {
> >> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> >> >> +                                                slp_op);
> >> >> +         if (!vectype)
> >> >> +           return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "incompatible vector types for invariants\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> >> >> +       continue;
> >> >> +
> >> >> +      /* There should be at most one cycle def in the stmt.  */
> >> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> >> +       return false;
> >> >> +    }
> >> >> +
> >> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> >> >> +
> >> >> +  /* TODO: Support lane-reducing operation that does not directly participate
> >> >> +     in loop reduction. */
> >> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> >> +    return false;
> >> >> +
> >> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> >> +     recoginized.  */
> >> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> >> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> >> >> +
> >> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +  int ncopies_for_cost;
> >> >> +
> >> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> >> +    {
> >> >> +      /* Now lane-reducing operations in a non-single-lane slp node should only
> >> >> +        come from the same loop reduction path.  */
> >> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> >> +      ncopies_for_cost = 1;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >
> >> > OK, so the fact that the ops are lane-reducing means they effectively
> >> > change the VF for the result.  That's only possible as we tightly control
> >> > code generation and "adjust" to the expected VF (by inserting the copies
> >> > you mentioned above), but only up to the highest number of outputs
> >> > created in the reduction chain.  In that sense instead of talking and recording
> >> > "input vector types" wouldn't it make more sense to record the effective
> >> > vectorization factor for the reduction instance?  That VF would be at most
> >> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> >> > operation in the reduction chain it would be always equal to the loops VF.
> >> >
> >> > ncopies would then be always determined by that reduction instance VF and
> >> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> >> > instance VF would also trivially indicate the force-single-def-use-cycle
> >> > case, possibly simplifying code?
> >>
> >> I tried to add such an effective VF, while the vectype_in is still needed in some
> >> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
> >> The former could be deduced from the later, so recording both things seems
> >> to be redundant. Another consideration is that for normal op, ncopies
> >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> >> it is from VF. So, a better means to make them unified?
> >
> > AFAICS reductions are special in that they, for the accumulation SSA cycle,
> > do not adhere to the loops VF but as optimization can chose a smaller one.
> > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> > operations which even for lane-reducing ops is adhered to - those just
> > may use a smaller VF, that of the reduction SSA cycle.
> >
> > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> > it's not fully redundant but needlessly replicated over all stmts participating
> > in the reduction instead of recording the reduction VF in the reduc_info and
> > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> > for stmts in the reduction cycle.
> >
> > At least that was my idea ...
> >
>
> For lane-reducing ops and single-defuse-cycle optimization, we could assume
> no lane would be reduced, and always generate vectorization statements
> according to the normal VF, if placeholder is needed, just insert some trivial
> statement like zero-initialization, or pass-through copy. And define a"effective VF or
> ncopies" to control lane-reducing related aspects in analysis and codegen (such
> as the below vect_get_loop_mask).  Since all things will become SLP-based finally,
> I think a suitable place to add such a field might be in slp_node, as a supplement to
> "vect_stmts_size", and it is expected to be adjusted in vectorizable_reduction. So
> could we do the refinement as separate patches when non-slp code path is to be
> removed?
>
> >> >> +      gcc_assert (ncopies_for_cost >= 1);
> >> >> +    }
> >> >> +
> >> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> +    {
> >> >> +      /* We need extra two invariants: one that contains the minimum signed
> >> >> +        value and one that contains half of its negative.  */
> >> >> +      int prologue_stmts = 2;
> >> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> >> >> +                                       scalar_to_vec, stmt_info, 0,
> >> >> +                                       vect_prologue);
> >> >> +      if (dump_enabled_p ())
> >> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> >> +                    "extra prologue_cost = %d .\n", cost);
> >> >> +
> >> >> +      /* Three dot-products and a subtraction.  */
> >> >> +      ncopies_for_cost *= 4;
> >> >> +    }
> >> >> +
> >> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> >> >> +                   vect_body);
> >> >> +
> >> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> >> +    {
> >> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> >> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> >> >> +                                                 slp_node, code, type,
> >> >> +                                                 vectype_in);
> >> >> +    }
> >> >> +
> >> >
> >> > Add a comment:
> >> >
> >> >     /* Transform via vect_transform_reduction.  */
> >> >
> >> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >> >> +  return true;
> >> >> +}
> >> >> +
> >> >>  /* Function vectorizable_reduction.
> >> >>
> >> >>     Check if STMT_INFO performs a reduction operation that can be vectorized.
> >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>    if (!type_has_mode_precision_p (op.type))
> >> >>      return false;
> >> >>
> >> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> >> >> -     which means the only use of that may be in the lane-reducing operation.  */
> >> >> -  if (lane_reducing
> >> >> -      && reduc_chain_length != 1
> >> >> -      && !only_slp_reduc_chain)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "lane-reducing reduction with extra stmts.\n");
> >> >> -      return false;
> >> >> -    }
> >> >> -
> >> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
> >> >>       since we'll mix lanes belonging to different reductions.  But it's
> >> >>       OK to use them in a reduction chain or when the reduction group
> >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>        && loop_vinfo->suggested_unroll_factor == 1)
> >> >>      single_defuse_cycle = true;
> >> >>
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> +  if (single_defuse_cycle && !lane_reducing)
> >> >
> >> > If there's also a non-lane-reducing plus in the chain don't we have to
> >> > check for that reduction op?  So shouldn't it be
> >> > single_defuse_cycle && ... fact that we don't record
> >> > (non-lane-reducing op there) ...
> >>
> >> Quite not understand this point.  For a non-lane-reducing op in the chain,
> >> it should be handled in its own vectorizable_xxx function? The below check
> >> is only for the first statement (vect_reduction_def) in the reduction.
> >
> > Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> > lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> > following is then just for the case there's a single def that's not
> > lane-reducing
> > and we're forcing a single-def-use and thus go via vect_transform_reduction?
>
> Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
> This logic is same as the original.
>
> >> >
> >> >>      {
> >> >>        gcc_assert (op.code != COND_EXPR);
> >> >>
> >> >> -      /* 4. Supportable by target?  */
> >> >> -      bool ok = true;
> >> >> -
> >> >> -      /* 4.1. check support for the operation in the loop
> >> >> +      /* 4. check support for the operation in the loop
> >> >>
> >> >>          This isn't necessary for the lane reduction codes, since they
> >> >>          can only be produced by pattern matching, and it's up to the
> >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>          mixed-sign dot-products can be implemented using signed
> >> >>          dot-products.  */
> >> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> >> >> -      if (!lane_reducing
> >> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> >> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >> >>          {
> >> >>            if (dump_enabled_p ())
> >> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> >> -           ok = false;
> >> >> +           single_defuse_cycle = false;
> >> >>           else
> >> >>             if (dump_enabled_p ())
> >> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >> >>           return false;
> >> >>         }
> >> >> -
> >> >> -      /* lane-reducing operations have to go through vect_transform_reduction.
> >> >> -         For the other cases try without the single cycle optimization.  */
> >> >> -      if (!ok)
> >> >> -       {
> >> >> -         if (lane_reducing)
> >> >> -           return false;
> >> >> -         else
> >> >> -           single_defuse_cycle = false;
> >> >> -       }
> >> >>      }
> >> >>    if (dump_enabled_p () && single_defuse_cycle)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>                      "multiple vectors to one in the loop body\n");
> >> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >> >>
> >> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> >> -     reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
> >> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> >> -      && lane_reducing)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "multi def-use cycle not possible for lane-reducing "
> >> >> -                        "reduction operation\n");
> >> >> -      return false;
> >> >> -    }
> >> >> +  /* For lane-reducing operation, the below processing related to single
> >> >> +     defuse-cycle will be done in its own vectorizable function.  One more
> >> >> +     thing to note is that the operation must not be involved in fold-left
> >> >> +     reduction.  */
> >> >> +  single_defuse_cycle &= !lane_reducing;
> >> >>
> >> >>    if (slp_node
> >> >> -      && !(!single_defuse_cycle
> >> >> -          && !lane_reducing
> >> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
> >> >> +      && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
> >> >>      for (i = 0; i < (int) op.num_ops; i++)
> >> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >> >>         {
> >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >> >>                              reduction_type, ncopies, cost_vec);
> >> >>    /* Cost the reduction op inside the loop if transformed via
> >> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> >> -     separate vectorizable_* routines.  */
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> -    {
> >> >> -      int factor = 1;
> >> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> -       /* Three dot-products and a subtraction.  */
> >> >> -       factor = 4;
> >> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> >> -                       stmt_info, 0, vect_body);
> >> >> -    }
> >> >> +     vect_transform_reduction for non-lane-reducing operation.  Otherwise
> >> >> +     this is costed by the separate vectorizable_* routines.  */
> >> >> +  if (single_defuse_cycle)
> >> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> >> >>
> >> >>    if (dump_enabled_p ()
> >> >>        && reduction_type == FOLD_LEFT_REDUCTION)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >>                      "using an in-order (fold-left) reduction.\n");
> >> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> >> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> >> >> -     reductions go through their own vectorizable_* routines.  */
> >> >> -  if (!single_defuse_cycle
> >> >> -      && !lane_reducing
> >> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
> >> >> +
> >> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
> >> >> +     through their own vectorizable_* routines.  */
> >> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >> >>      {
> >> >>        stmt_vec_info tem
> >> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>    bool lane_reducing = lane_reducing_op_p (code);
> >> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >> >>
> >> >> +  if (lane_reducing)
> >> >> +    {
> >> >> +      /* The last operand of lane-reducing op is for reduction.  */
> >> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> >> >> +
> >> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
> >> >> +      gcc_assert (slp_node);
> >> >> +    }
> >> >> +
> >> >>    /* Create the destination vector  */
> >> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >> >>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >> >>                          &vec_oprnds[2]);
> >> >>      }
> >> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> >> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> >> >> +    {
> >> >> +      /* For lane-reducing op covered by single-lane slp node, the input
> >> >> +        vectype of the reduction PHI determines copies of vectorized def-use
> >> >> +        cycles, which might be more than effective copies of vectorized lane-
> >> >> +        reducing reduction statements.  This could be complemented by
> >> >> +        generating extra trivial pass-through copies.  For example:
> >> >> +
> >> >> +          int sum = 0;
> >> >> +          for (i)
> >> >> +            {
> >> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> >> +              sum += n[i];               // normal <vector(4) int>
> >> >> +            }
> >> >> +
> >> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> >> >> +        statements would be transformed as:
> >> >> +
> >> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >> >> +
> >> >> +          for (i / 16)
> >> >> +            {
> >> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >> >> +              sum_v1 = sum_v1;  // copy
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> >> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> >> >> +              sum_v2 += n_v2[i: 8  ~ 11];
> >> >> +              sum_v3 += n_v3[i: 12 ~ 15];
> >> >> +            }
> >> >> +       */
> >> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
> >> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> >> >> +
> >> >
> >> > assert reduc_ncopies >= using_ncopies?  Maybe assert
> >> > reduc_index == op.num_ops - 1 given you use one above
> >> > and the other below?  Or simply iterate till op.num_ops
> >> > and sip i == reduc_index.
> >> >
> >> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> >> >> +       {
> >> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> >> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> >> +       }
> >> >> +    }
> >> >>
> >> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>      {
> >> >>        gimple *new_stmt;
> >> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> >> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> >> +
> >> >> +      if (!vop[0] || !vop[1])
> >> >> +       {
> >> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> >> >> +
> >> >> +         /* Insert trivial copy if no need to generate vectorized
> >> >> +            statement.  */
> >> >> +         gcc_assert (reduc_vop);
> >> >> +
> >> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> >> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> >> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> >> >
> >> > I think you could simply do
> >> >
> >> >                slp_node->push_vec_def (reduc_vop);
> >> >                continue;
> >> >
> >> > without any code generation.
> >> >
> >>
> >> OK, that would be easy. Here comes another question, this patch assumes
> >> lane-reducing op would always be contained in a slp node, since single-lane
> >> slp node feature has been enabled. But I got some regression if I enforced
> >> such constraint on lane-reducing op check. Those cases are founded to
> >> be unvectorizable with single-lane slp, so this should not be what we want?
> >> and need to be fixed?
> >
> > Yes, in the end we need to chase down all unsupported cases and fix them
> > (there's known issues with load permutes, I'm working on that - hopefully
> > when finding a continuous stretch of time...).
> >
> >>
> >> >> +       }
> >> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >> >>         {
> >> >>           /* No conditional ifns have been defined for lane-reducing op
> >> >>              yet.  */
> >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>
> >> >>           if (masked_loop_p && mask_by_cond_expr)
> >> >>             {
> >> >> +             tree stmt_vectype_in = vectype_in;
> >> >> +             unsigned nvectors = vec_num * ncopies;
> >> >> +
> >> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> >> >> +               {
> >> >> +                 /* Input vectype of the reduction PHI may be defferent from
> >> >
> >> > different
> >> >
> >> >> +                    that of lane-reducing operation.  */
> >> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +                 nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
> >> >
> >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
> >>
> >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> >> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
> >> to vect_get_loop_mask?
> >
> > Probably - it depends on the vectorization factor.  What I wanted to
> > point out is that
> > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> > or we should forgo with it (but that's possibly a post-only-SLP
> > cleanup to be done).
> >
> > See vect_slp_analyze_node_operations_1 where that's computed.  For reductions
> > it's probably not quite right (and we might have latent issues like
> > those you are
> > "fixing" with code like above).  The order we analyze stmts might also be not
> > optimal for reductions with SLP - in fact given that stmt analysis
> > relies on a fixed VF
> > it would probably make sense to determine the reduction VF in advance as well.
> > But again this sounds like post-only-SLP cleanup opportunities.
> >
> > In the end I might suggest to always use reduct-VF and vectype to determine
> > the number of vector stmts rather than computing ncopies/vec_num separately.
> >
>
> Thanks,
> Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-28 13:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-16  7:31 [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Feng Xue OS
2024-06-20  5:59 ` Feng Xue OS
2024-06-20 12:26 ` Richard Biener
2024-06-23 15:10   ` Feng Xue OS
2024-06-24 12:58     ` Richard Biener
2024-06-25  9:32       ` Feng Xue OS
2024-06-25 10:26         ` Richard Biener
2024-06-26 14:50         ` Feng Xue OS
2024-06-28 13:06           ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).