public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles
@ 2024-06-16  7:32 Feng Xue OS
  2024-06-20  6:02 ` Feng Xue OS
  0 siblings, 1 reply; 3+ messages in thread
From: Feng Xue OS @ 2024-06-16  7:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5009 bytes --]

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
	reduc_result_pos.
	* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
	statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 = sum_v2;  // copy
 	       sum_v3 = sum_v3;  // copy
 
-	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-	       sum_v2 = sum_v2;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+	       sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
 	       sum_v3 = sum_v3;  // copy
 
 	       sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 += n_v2[i: 8  ~ 11];
 	       sum_v3 += n_v3[i: 12 ~ 15];
 	     }
-	*/
+
+	 Moreover, for a higher instruction parallelism in final vectorized
+	 loop, it is considered to make those effective vectorized lane-
+	 reducing statements be distributed evenly among all def-use cycles.
+	 In the above example, SADs are generated into other cycles rather
+	 than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+      unsigned result_pos = reduc_info->reduc_result_pos;
+
+      reduc_info->reduc_result_pos
+		= (result_pos + using_ncopies) % reduc_ncopies;
+      gcc_assert (result_pos < reduc_ncopies);
 
       for (unsigned i = 0; i < op.num_ops - 1; i++)
 	{
 	  gcc_assert (vec_oprnds[i].length () == using_ncopies);
 	  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+	  /* Find suitable def-use cycles to generate vectorized statements
+	     into, and reorder operands based on the selection.  */
+	  if (result_pos)
+	    {
+	      unsigned count = reduc_ncopies - using_ncopies;
+	      unsigned start = result_pos - count;
+
+	      if ((int) start < 0)
+		{
+		  count = result_pos;
+		  start = 0;
+		}
+
+	      for (unsigned j = using_ncopies; j > start; j--)
+		{
+		  unsigned k = j - 1;
+		  std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+		  gcc_assert (!vec_oprnds[i][k]);
+		}
+	    }
 	}
     }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
-- 
2.17.1

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0008-vect-Optimize-order-of-lane-reducing-statements-in-l.patch --]
[-- Type: text/x-patch; name="0008-vect-Optimize-order-of-lane-reducing-statements-in-l.patch", Size: 5712 bytes --]

From 1f2e05a6787eb4449a24a9d6e371ae162855aaff Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:28:14 +0800
Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop
 def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
	reduc_result_pos.
	* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
	statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 = sum_v2;  // copy
 	       sum_v3 = sum_v3;  // copy
 
-	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-	       sum_v2 = sum_v2;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+	       sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
 	       sum_v3 = sum_v3;  // copy
 
 	       sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 += n_v2[i: 8  ~ 11];
 	       sum_v3 += n_v3[i: 12 ~ 15];
 	     }
-	*/
+
+	 Moreover, for a higher instruction parallelism in final vectorized
+	 loop, it is considered to make those effective vectorized lane-
+	 reducing statements be distributed evenly among all def-use cycles.
+	 In the above example, SADs are generated into other cycles rather
+	 than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+      unsigned result_pos = reduc_info->reduc_result_pos;
+
+      reduc_info->reduc_result_pos
+		= (result_pos + using_ncopies) % reduc_ncopies;
+      gcc_assert (result_pos < reduc_ncopies);
 
       for (unsigned i = 0; i < op.num_ops - 1; i++)
 	{
 	  gcc_assert (vec_oprnds[i].length () == using_ncopies);
 	  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+	  /* Find suitable def-use cycles to generate vectorized statements
+	     into, and reorder operands based on the selection.  */
+	  if (result_pos)
+	    {
+	      unsigned count = reduc_ncopies - using_ncopies;
+	      unsigned start = result_pos - count;
+
+	      if ((int) start < 0)
+		{
+		  count = result_pos;
+		  start = 0;
+		}
+
+	      for (unsigned j = using_ncopies; j > start; j--)
+		{
+		  unsigned k = j - 1;
+		  std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+		  gcc_assert (!vec_oprnds[i][k]);
+		}
+	    }
 	}
     }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  def-use cycles
  2024-06-16  7:32 [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles Feng Xue OS
@ 2024-06-20  6:02 ` Feng Xue OS
  2024-06-26 14:54   ` Feng Xue OS
  0 siblings, 1 reply; 3+ messages in thread
From: Feng Xue OS @ 2024-06-20  6:02 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 11902 bytes --]

This patch was updated with some new change.

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
        reduc_result_pos.
        * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
        statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5a27a2c3d9c..adee54350d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 = sum_v2;  // copy
               sum_v3 = sum_v3;  // copy

-              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-              sum_v2 = sum_v2;  // copy
+              sum_v0 = sum_v0;  // copy
+              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
               sum_v3 = sum_v3;  // copy

               sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 += n_v2[i: 8  ~ 11];
               sum_v3 += n_v3[i: 12 ~ 15];
             }
-       */
+
+        Moreover, for a higher instruction parallelism in final vectorized
+        loop, it is considered to make those effective vectorized lane-
+        reducing statements be distributed evenly among all def-use cycles.
+        In the above example, SADs are generated into other cycles rather
+        than that of DOT_PROD.  */
       tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
       unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in);
       unsigned use_ncopies = vec_oprnds[0].length ();
@@ -8855,6 +8860,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
              gcc_assert (vec_oprnds[i].length () == use_ncopies);
              vec_oprnds[i].safe_grow_cleared (all_ncopies);
            }
+
+         /* Find suitable def-use cycles to generate vectorized statements
+            into, and reorder operands based on the selection.  */
+         unsigned curr_pos = reduc_info->reduc_result_pos;
+         unsigned next_pos = (curr_pos + use_ncopies) % all_ncopies;
+
+         gcc_assert (curr_pos < all_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+         if (curr_pos)
+           {
+             unsigned count = all_ncopies - use_ncopies;
+             unsigned start = curr_pos - count;
+
+             if ((int) start < 0)
+               {
+                 count = curr_pos;
+                 start = 0;
+               }
+
+             for (unsigned i = 0; i < op.num_ops - 1; i++)
+               {
+                 for (unsigned j = use_ncopies; j > start; j--)
+                   {
+                     unsigned k = j - 1;
+                     std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+                     gcc_assert (!vec_oprnds[i][k]);
+                   }
+               }
+           }
        }
     }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
--
2.17.1

________________________________________
From: Feng Xue OS <fxue@os.amperecomputing.com>
Sent: Sunday, June 16, 2024 3:32 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
        reduc_result_pos.
        * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
        statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 = sum_v2;  // copy
               sum_v3 = sum_v3;  // copy

-              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-              sum_v2 = sum_v2;  // copy
+              sum_v0 = sum_v0;  // copy
+              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
               sum_v3 = sum_v3;  // copy

               sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 += n_v2[i: 8  ~ 11];
               sum_v3 += n_v3[i: 12 ~ 15];
             }
-       */
+
+        Moreover, for a higher instruction parallelism in final vectorized
+        loop, it is considered to make those effective vectorized lane-
+        reducing statements be distributed evenly among all def-use cycles.
+        In the above example, SADs are generated into other cycles rather
+        than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+      unsigned result_pos = reduc_info->reduc_result_pos;
+
+      reduc_info->reduc_result_pos
+               = (result_pos + using_ncopies) % reduc_ncopies;
+      gcc_assert (result_pos < reduc_ncopies);

       for (unsigned i = 0; i < op.num_ops - 1; i++)
        {
          gcc_assert (vec_oprnds[i].length () == using_ncopies);
          vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+         /* Find suitable def-use cycles to generate vectorized statements
+            into, and reorder operands based on the selection.  */
+         if (result_pos)
+           {
+             unsigned count = reduc_ncopies - using_ncopies;
+             unsigned start = result_pos - count;
+
+             if ((int) start < 0)
+               {
+                 count = result_pos;
+                 start = 0;
+               }
+
+             for (unsigned j = using_ncopies; j > start; j--)
+               {
+                 unsigned k = j - 1;
+                 std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+                 gcc_assert (!vec_oprnds[i][k]);
+               }
+           }
        }
     }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
--
2.17.1

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0008-vect-Optimize-order-of-lane-reducing-statements-in-l.patch --]
[-- Type: text/x-patch; name="0008-vect-Optimize-order-of-lane-reducing-statements-in-l.patch", Size: 5913 bytes --]

From 9435bf24fd085796d5a801a7f8f22a2374d27458 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:28:14 +0800
Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop
 def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
	reduc_result_pos.
	* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
	statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5a27a2c3d9c..adee54350d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 = sum_v2;  // copy
 	       sum_v3 = sum_v3;  // copy
 
-	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-	       sum_v2 = sum_v2;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+	       sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
 	       sum_v3 = sum_v3;  // copy
 
 	       sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 += n_v2[i: 8  ~ 11];
 	       sum_v3 += n_v3[i: 12 ~ 15];
 	     }
-	*/
+
+	 Moreover, for a higher instruction parallelism in final vectorized
+	 loop, it is considered to make those effective vectorized lane-
+	 reducing statements be distributed evenly among all def-use cycles.
+	 In the above example, SADs are generated into other cycles rather
+	 than that of DOT_PROD.  */
       tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
       unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in);
       unsigned use_ncopies = vec_oprnds[0].length ();
@@ -8855,6 +8860,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	      gcc_assert (vec_oprnds[i].length () == use_ncopies);
 	      vec_oprnds[i].safe_grow_cleared (all_ncopies);
 	    }
+
+	  /* Find suitable def-use cycles to generate vectorized statements
+	     into, and reorder operands based on the selection.  */
+	  unsigned curr_pos = reduc_info->reduc_result_pos;
+	  unsigned next_pos = (curr_pos + use_ncopies) % all_ncopies;
+
+	  gcc_assert (curr_pos < all_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+	  if (curr_pos)
+	    {
+	      unsigned count = all_ncopies - use_ncopies;
+	      unsigned start = curr_pos - count;
+
+	      if ((int) start < 0)
+		{
+		  count = curr_pos;
+		  start = 0;
+		}
+
+	      for (unsigned i = 0; i < op.num_ops - 1; i++)
+		{
+		  for (unsigned j = use_ncopies; j > start; j--)
+		    {
+		      unsigned k = j - 1;
+		      std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+		      gcc_assert (!vec_oprnds[i][k]);
+		    }
+		}
+	    }
 	}
     }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  def-use cycles
  2024-06-20  6:02 ` Feng Xue OS
@ 2024-06-26 14:54   ` Feng Xue OS
  0 siblings, 0 replies; 3+ messages in thread
From: Feng Xue OS @ 2024-06-26 14:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 18409 bytes --]

This patch is also adjusted with changes to two its dependent patches.

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
        reduc_result_pos.
        * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
        statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6bfb0e72905..783c4f2b153 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8841,9 +8841,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 = sum_v2;  // copy
               sum_v3 = sum_v3;  // copy

-              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-              sum_v2 = sum_v2;  // copy
+              sum_v0 = sum_v0;  // copy
+              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
               sum_v3 = sum_v3;  // copy

               sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8851,7 +8851,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 += n_v2[i: 8  ~ 11];
               sum_v3 += n_v3[i: 12 ~ 15];
             }
-       */
+
+        Moreover, for a higher instruction parallelism in final vectorized
+        loop, it is considered to make those effective vectorized lane-
+        reducing statements be distributed evenly among all def-use cycles.
+        In the above example, SADs are generated into other cycles rather
+        than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();

@@ -8864,6 +8869,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
              gcc_assert (vec_oprnds[i].length () == using_ncopies);
              vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
            }
+
+         /* Find suitable def-use cycles to generate vectorized statements
+            into, and reorder operands based on the selection.  */
+         unsigned curr_pos = reduc_info->reduc_result_pos;
+         unsigned next_pos = (curr_pos + using_ncopies) % reduc_ncopies;
+
+         gcc_assert (curr_pos < reduc_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+         if (curr_pos)
+           {
+             unsigned count = reduc_ncopies - using_ncopies;
+             unsigned start = curr_pos - count;
+
+             if ((int) start < 0)
+               {
+                 count = curr_pos;
+                 start = 0;
+               }
+
+             for (unsigned i = 0; i < op.num_ops - 1; i++)
+               {
+                 for (unsigned j = using_ncopies; j > start; j--)
+                   {
+                     unsigned k = j - 1;
+                     std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+                     gcc_assert (!vec_oprnds[i][k]);
+                   }
+               }
+           }
        }
     }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
--
2.17.1

________________________________________
From: Feng Xue OS <fxue@os.amperecomputing.com>
Sent: Thursday, June 20, 2024 2:02 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  def-use cycles

This patch was updated with some new change.

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
        reduc_result_pos.
        * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
        statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5a27a2c3d9c..adee54350d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 = sum_v2;  // copy
               sum_v3 = sum_v3;  // copy

-              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-              sum_v2 = sum_v2;  // copy
+              sum_v0 = sum_v0;  // copy
+              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
               sum_v3 = sum_v3;  // copy

               sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 += n_v2[i: 8  ~ 11];
               sum_v3 += n_v3[i: 12 ~ 15];
             }
-       */
+
+        Moreover, for a higher instruction parallelism in final vectorized
+        loop, it is considered to make those effective vectorized lane-
+        reducing statements be distributed evenly among all def-use cycles.
+        In the above example, SADs are generated into other cycles rather
+        than that of DOT_PROD.  */
       tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
       unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in);
       unsigned use_ncopies = vec_oprnds[0].length ();
@@ -8855,6 +8860,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
              gcc_assert (vec_oprnds[i].length () == use_ncopies);
              vec_oprnds[i].safe_grow_cleared (all_ncopies);
            }
+
+         /* Find suitable def-use cycles to generate vectorized statements
+            into, and reorder operands based on the selection.  */
+         unsigned curr_pos = reduc_info->reduc_result_pos;
+         unsigned next_pos = (curr_pos + use_ncopies) % all_ncopies;
+
+         gcc_assert (curr_pos < all_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+         if (curr_pos)
+           {
+             unsigned count = all_ncopies - use_ncopies;
+             unsigned start = curr_pos - count;
+
+             if ((int) start < 0)
+               {
+                 count = curr_pos;
+                 start = 0;
+               }
+
+             for (unsigned i = 0; i < op.num_ops - 1; i++)
+               {
+                 for (unsigned j = use_ncopies; j > start; j--)
+                   {
+                     unsigned k = j - 1;
+                     std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+                     gcc_assert (!vec_oprnds[i][k]);
+                   }
+               }
+           }
        }
     }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
--
2.17.1

________________________________________
From: Feng Xue OS <fxue@os.amperecomputing.com>
Sent: Sunday, June 16, 2024 3:32 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
        PR tree-optimization/114440
        * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
        reduc_result_pos.
        * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
        statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 = sum_v2;  // copy
               sum_v3 = sum_v3;  // copy

-              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-              sum_v2 = sum_v2;  // copy
+              sum_v0 = sum_v0;  // copy
+              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
               sum_v3 = sum_v3;  // copy

               sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
               sum_v2 += n_v2[i: 8  ~ 11];
               sum_v3 += n_v3[i: 12 ~ 15];
             }
-       */
+
+        Moreover, for a higher instruction parallelism in final vectorized
+        loop, it is considered to make those effective vectorized lane-
+        reducing statements be distributed evenly among all def-use cycles.
+        In the above example, SADs are generated into other cycles rather
+        than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+      unsigned result_pos = reduc_info->reduc_result_pos;
+
+      reduc_info->reduc_result_pos
+               = (result_pos + using_ncopies) % reduc_ncopies;
+      gcc_assert (result_pos < reduc_ncopies);

       for (unsigned i = 0; i < op.num_ops - 1; i++)
        {
          gcc_assert (vec_oprnds[i].length () == using_ncopies);
          vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+         /* Find suitable def-use cycles to generate vectorized statements
+            into, and reorder operands based on the selection.  */
+         if (result_pos)
+           {
+             unsigned count = reduc_ncopies - using_ncopies;
+             unsigned start = result_pos - count;
+
+             if ((int) start < 0)
+               {
+                 count = result_pos;
+                 start = 0;
+               }
+
+             for (unsigned j = using_ncopies; j > start; j--)
+               {
+                 unsigned k = j - 1;
+                 std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+                 gcc_assert (!vec_oprnds[i][k]);
+               }
+           }
        }
     }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
--
2.17.1

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0003-vect-Optimize-order-of-lane-reducing-statements-in-l.patch --]
[-- Type: text/x-patch; name="0003-vect-Optimize-order-of-lane-reducing-statements-in-l.patch", Size: 5849 bytes --]

From 6878a60309fac518368a31bb52976e4f9562dac3 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:28:14 +0800
Subject: [PATCH 3/3] vect: Optimize order of lane-reducing statements in loop
 def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
     {
       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
       sum += w[i];               // widen-sum <vector(16) char>
       sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
     }

Original transformation result:

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy
     }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
     {
       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
       sum_v1 = sum_v1;  // copy
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
       sum_v2 = sum_v2;  // copy
       sum_v3 = sum_v3;  // copy

       sum_v0 = sum_v0;  // copy
       sum_v1 = sum_v1;  // copy
       sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
       sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
     }

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	PR tree-optimization/114440
	* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
	reduc_result_pos.
	* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
	statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 gcc/tree-vectorizer.h |  6 ++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6bfb0e72905..783c4f2b153 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8841,9 +8841,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 = sum_v2;  // copy
 	       sum_v3 = sum_v3;  // copy
 
-	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-	       sum_v2 = sum_v2;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+	       sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
 	       sum_v3 = sum_v3;  // copy
 
 	       sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8851,7 +8851,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	       sum_v2 += n_v2[i: 8  ~ 11];
 	       sum_v3 += n_v3[i: 12 ~ 15];
 	     }
-	*/
+
+	 Moreover, for a higher instruction parallelism in final vectorized
+	 loop, it is considered to make those effective vectorized lane-
+	 reducing statements be distributed evenly among all def-use cycles.
+	 In the above example, SADs are generated into other cycles rather
+	 than that of DOT_PROD.  */
       unsigned using_ncopies = vec_oprnds[0].length ();
       unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
 
@@ -8864,6 +8869,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 	      gcc_assert (vec_oprnds[i].length () == using_ncopies);
 	      vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
 	    }
+
+	  /* Find suitable def-use cycles to generate vectorized statements
+	     into, and reorder operands based on the selection.  */
+	  unsigned curr_pos = reduc_info->reduc_result_pos;
+	  unsigned next_pos = (curr_pos + using_ncopies) % reduc_ncopies;
+
+	  gcc_assert (curr_pos < reduc_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+	  if (curr_pos)
+	    {
+	      unsigned count = reduc_ncopies - using_ncopies;
+	      unsigned start = curr_pos - count;
+
+	      if ((int) start < 0)
+		{
+		  count = curr_pos;
+		  start = 0;
+		}
+
+	      for (unsigned i = 0; i < op.num_ops - 1; i++)
+		{
+		  for (unsigned j = using_ncopies; j > start; j--)
+		    {
+		      unsigned k = j - 1;
+		      std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+		      gcc_assert (!vec_oprnds[i][k]);
+		    }
+		}
+	    }
 	}
     }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
      N scalar reductions in parallel, this variable gives the initial
      scalar values of those N reductions.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-06-26 14:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-16  7:32 [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles Feng Xue OS
2024-06-20  6:02 ` Feng Xue OS
2024-06-26 14:54   ` Feng Xue OS

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).