[RFC/PATCH] vect: Recog mul

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC/PATCH] vect: Recog mul_highpart pattern
@ 2021-07-13  8:52 Kewen.Lin
  2021-07-13  8:58 ` [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector Kewen.Lin
  2021-07-13  9:35 ` [RFC/PATCH] vect: Recog mul_highpart pattern Richard Biener
  0 siblings, 2 replies; 31+ messages in thread
From: Kewen.Lin @ 2021-07-13  8:52 UTC (permalink / raw)
  To: GCC Patches
  Cc: Richard Biener, Richard Sandiford, Segher Boessenkool, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 1030 bytes --]

Hi,

When I added the support for Power10 newly introduced multiply
highpart instrutions, I noticed that currently vectorizer
doesn't try to vectorize multiply highpart pattern, I hope
this isn't intentional?

This patch is to extend the existing pattern mulhs handlings
to cover multiply highpart.  Another alternative seems to
recog mul_highpart operation in a general place applied for
scalar code when the target supports the optab for the scalar
operation, it's based on the assumption that one target which
supports vector version of multiply highpart should have the
scalar version.  I noticed that the function can_mult_highpart_p
can check/handle mult_highpart well even without mul_highpart
optab support, I think to recog this pattern in vectorizer
is better.  Is it on the right track?

Bootstrapped & regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.

BR,
Kewen
-----
gcc/ChangeLog:

	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart.


[-- Attachment #2: 0001-vect-Recog-mul_highpart-pattern.patch --]
[-- Type: text/plain, Size: 4279 bytes --]

---
 gcc/tree-vect-patterns.c | 67 ++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..9253c8088e9 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_LAST;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
@@ -2029,9 +2047,14 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
   /* Check for target support.  */
   tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
-  if (!new_vectype
-      || !direct_internal_fn_supported_p
-	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+  if (!new_vectype)
+    return NULL;
+  if (ifn != IFN_LAST
+      && !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+    return NULL;
+  else if (ifn == IFN_LAST
+	   && !can_mult_highpart_p (TYPE_MODE (new_vectype),
+				    TYPE_UNSIGNED (new_type)))
     return NULL;
 
   /* The IR requires a valid vector type for the cast result, even though
@@ -2040,14 +2063,20 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   if (!*type_out)
     return NULL;
 
-  /* Generate the IFN_MULHRS call.  */
+  gimple *mulhrs_stmt;
   tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
   tree new_ops[2];
-  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
-		       unprom_mult, new_vectype);
-  gcall *mulhrs_stmt
-    = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
-  gimple_call_set_lhs (mulhrs_stmt, new_var);
+  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
+		       new_vectype);
+  if (ifn == IFN_LAST)
+    mulhrs_stmt = gimple_build_assign (new_var, MULT_HIGHPART_EXPR, new_ops[0],
+				       new_ops[1]);
+  else
+    {
+      /* Generate the IFN_MULHRS call.  */
+      mulhrs_stmt = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
+      gimple_call_set_lhs (mulhrs_stmt, new_var);
+    }
   gimple_set_location (mulhrs_stmt, gimple_location (last_stmt));
 
   if (dump_enabled_p ())
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector
  2021-07-13  8:52 [RFC/PATCH] vect: Recog mul_highpart pattern Kewen.Lin
@ 2021-07-13  8:58 ` Kewen.Lin
  2021-07-13 22:07   ` Segher Boessenkool
  2021-07-13  9:35 ` [RFC/PATCH] vect: Recog mul_highpart pattern Richard Biener
  1 sibling, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-13  8:58 UTC (permalink / raw)
  To: GCC Patches; +Cc: Segher Boessenkool, Bill Schmidt, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

Hi,

This patch is to make Power10 newly introduced vector
multiply high (part) instructions exploited in vectorized
loops, it renames existing define_insns as standard pattern
names.  It depends on that patch which enables vectorizer
to recog mul_highpart.

Tested on powerpc64le-linux-gnu P9 with P10 supported
binutils, will test more if the vectorizer patch gets
landed.

BR,
Kewen.
-----
gcc/ChangeLog:

	* config/rs6000/vsx.md (mulhs_<mode>): Rename to...
	(smul<mode>3_highpart): ... this.
	(mulhu_<mode>): Rename to...
	(umul<mode>3_highpart): ... this.
	* config/rs6000/rs6000-builtin.def (MULHS_V2DI, MULHS_V4SI,
	MULHU_V2DI, MULHU_V4SI): Adjust.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/mul-vectorize-3.c: New test.
	* gcc.target/powerpc/mul-vectorize-4.c: New test.

[-- Attachment #2: 0001-rs6000-Support-u-mul-mode-3_highpart-for-vector.patch --]
[-- Type: text/plain, Size: 4952 bytes --]

---
 gcc/config/rs6000/rs6000-builtin.def          |  8 ++---
 gcc/config/rs6000/vsx.md                      |  4 +--
 .../gcc.target/powerpc/mul-vectorize-3.c      | 32 ++++++++++++++++++
 .../gcc.target/powerpc/mul-vectorize-4.c      | 33 +++++++++++++++++++
 4 files changed, 71 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/mul-vectorize-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/mul-vectorize-4.c

diff --git a/gcc/config/rs6000/rs6000-builtin.def b/gcc/config/rs6000/rs6000-builtin.def
index 592efe31b04..cbacbc6b785 100644
--- a/gcc/config/rs6000/rs6000-builtin.def
+++ b/gcc/config/rs6000/rs6000-builtin.def
@@ -3016,10 +3016,10 @@ BU_P10V_AV_2 (MODS_V2DI, "vmodsd", CONST, modv2di3)
 BU_P10V_AV_2 (MODS_V4SI, "vmodsw", CONST, modv4si3)
 BU_P10V_AV_2 (MODU_V2DI, "vmodud", CONST, umodv2di3)
 BU_P10V_AV_2 (MODU_V4SI, "vmoduw", CONST, umodv4si3)
-BU_P10V_AV_2 (MULHS_V2DI, "vmulhsd", CONST, mulhs_v2di)
-BU_P10V_AV_2 (MULHS_V4SI, "vmulhsw", CONST, mulhs_v4si)
-BU_P10V_AV_2 (MULHU_V2DI, "vmulhud", CONST, mulhu_v2di)
-BU_P10V_AV_2 (MULHU_V4SI, "vmulhuw", CONST, mulhu_v4si)
+BU_P10V_AV_2 (MULHS_V2DI, "vmulhsd", CONST, smulv2di3_highpart)
+BU_P10V_AV_2 (MULHS_V4SI, "vmulhsw", CONST, smulv4si3_highpart)
+BU_P10V_AV_2 (MULHU_V2DI, "vmulhud", CONST, umulv2di3_highpart)
+BU_P10V_AV_2 (MULHU_V4SI, "vmulhuw", CONST, umulv4si3_highpart)
 BU_P10V_AV_2 (MULLD_V2DI, "vmulld", CONST, mulv2di3)
 
 BU_P10V_VSX_1 (VXXSPLTIW_V4SI, "vxxspltiw_v4si", CONST, xxspltiw_v4si)
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index f622873d758..6f6fc0bd835 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -6351,7 +6351,7 @@ (define_insn "umod<mode>3"
   [(set_attr "type" "vecdiv")
    (set_attr "size" "<bits>")])
 
-(define_insn "mulhs_<mode>"
+(define_insn "smul<mode>3_highpart"
   [(set (match_operand:VIlong 0 "vsx_register_operand" "=v")
 	(mult:VIlong (ashiftrt
 		       (match_operand:VIlong 1 "vsx_register_operand" "v")
@@ -6363,7 +6363,7 @@ (define_insn "mulhs_<mode>"
   "vmulhs<wd> %0,%1,%2"
   [(set_attr "type" "veccomplex")])
 
-(define_insn "mulhu_<mode>"
+(define_insn "umul<mode>3_highpart"
   [(set (match_operand:VIlong 0 "vsx_register_operand" "=v")
 	(us_mult:VIlong (ashiftrt
 			  (match_operand:VIlong 1 "vsx_register_operand" "v")
diff --git a/gcc/testsuite/gcc.target/powerpc/mul-vectorize-3.c b/gcc/testsuite/gcc.target/powerpc/mul-vectorize-3.c
new file mode 100644
index 00000000000..2c89c0faec2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/mul-vectorize-3.c
@@ -0,0 +1,32 @@
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -fdump-tree-vect-details" } */
+
+/* Test vectorizer can exploit ISA 3.1 instructions Vector Multiply
+   High Signed/Unsigned Word for both signed and unsigned int high part
+   multiplication.  */
+
+#define N 128
+
+extern signed int si_a[N], si_b[N], si_c[N];
+extern unsigned int ui_a[N], ui_b[N], ui_c[N];
+
+typedef signed long long sLL;
+typedef unsigned long long uLL;
+
+__attribute__ ((noipa)) void
+test_si ()
+{
+  for (int i = 0; i < N; i++)
+    si_c[i] = ((sLL) si_a[i] * (sLL) si_b[i]) >> 32;
+}
+
+__attribute__ ((noipa)) void
+test_ui ()
+{
+  for (int i = 0; i < N; i++)
+    ui_c[i] = ((uLL) ui_a[i] * (uLL) ui_b[i]) >> 32;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
+/* { dg-final { scan-assembler-times {\mvmulhsw\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mvmulhuw\M} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/mul-vectorize-4.c b/gcc/testsuite/gcc.target/powerpc/mul-vectorize-4.c
new file mode 100644
index 00000000000..265e7588bb6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/mul-vectorize-4.c
@@ -0,0 +1,33 @@
+/* { dg-require-effective-target power10_ok } */
+/* { dg-require-effective-target int128 } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -fdump-tree-vect-details" } */
+
+/* Test vectorizer can exploit ISA 3.1 instructions Vector Multiply
+   High Signed/Unsigned Doubleword for both signed and unsigned long
+   long high part multiplication.  */
+
+#define N 128
+
+extern signed long long sll_a[N], sll_b[N], sll_c[N];
+extern unsigned long long ull_a[N], ull_b[N], ull_c[N];
+
+typedef signed __int128 s128;
+typedef unsigned __int128 u128;
+
+__attribute__ ((noipa)) void
+test_sll ()
+{
+  for (int i = 0; i < N; i++)
+    sll_c[i] = ((s128) sll_a[i] * (s128) sll_b[i]) >> 64;
+}
+
+__attribute__ ((noipa)) void
+test_ull ()
+{
+  for (int i = 0; i < N; i++)
+    ull_c[i] = ((u128) ull_a[i] * (u128) ull_b[i]) >> 64;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
+/* { dg-final { scan-assembler-times {\mvmulhsd\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mvmulhud\M} 1 } } */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13  8:52 [RFC/PATCH] vect: Recog mul_highpart pattern Kewen.Lin
  2021-07-13  8:58 ` [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector Kewen.Lin
@ 2021-07-13  9:35 ` Richard Biener
  2021-07-13  9:40   ` Richard Sandiford
  2021-07-13 10:25   ` Kewen.Lin
  1 sibling, 2 replies; 31+ messages in thread
From: Richard Biener @ 2021-07-13  9:35 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi,
>
> When I added the support for Power10 newly introduced multiply
> highpart instrutions, I noticed that currently vectorizer
> doesn't try to vectorize multiply highpart pattern, I hope
> this isn't intentional?
>
> This patch is to extend the existing pattern mulhs handlings
> to cover multiply highpart.  Another alternative seems to
> recog mul_highpart operation in a general place applied for
> scalar code when the target supports the optab for the scalar
> operation, it's based on the assumption that one target which
> supports vector version of multiply highpart should have the
> scalar version.  I noticed that the function can_mult_highpart_p
> can check/handle mult_highpart well even without mul_highpart
> optab support, I think to recog this pattern in vectorizer
> is better.  Is it on the right track?

I think it's on the right track, using IFN_LAST is a bit awkward
in case yet another case pops up so maybe you can use
a code_helper instance instead which unifies tree_code,
builtin_code and internal_fn?

I also notice that can_mult_highpart_p will return true if
only vec_widen_[us]mult_{even,odd,hi,lo} are available,
but then the result might be less optimal (or even not
handled later)?

That is, what about adding optab internal functions
for [us]mul_highpart instead, much like the existing
ones for MULH{R,}S?

Richard.

>
> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
> x86_64-redhat-linux and aarch64-linux-gnu.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>         recog normal multiply highpart.
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13  9:35 ` [RFC/PATCH] vect: Recog mul_highpart pattern Richard Biener
@ 2021-07-13  9:40   ` Richard Sandiford
  2021-07-13  9:44     ` Richard Biener
  2021-07-13 10:25   ` Kewen.Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Richard Sandiford @ 2021-07-13  9:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, Segher Boessenkool, Bill Schmidt

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> When I added the support for Power10 newly introduced multiply
>> highpart instrutions, I noticed that currently vectorizer
>> doesn't try to vectorize multiply highpart pattern, I hope
>> this isn't intentional?
>>
>> This patch is to extend the existing pattern mulhs handlings
>> to cover multiply highpart.  Another alternative seems to
>> recog mul_highpart operation in a general place applied for
>> scalar code when the target supports the optab for the scalar
>> operation, it's based on the assumption that one target which
>> supports vector version of multiply highpart should have the
>> scalar version.  I noticed that the function can_mult_highpart_p
>> can check/handle mult_highpart well even without mul_highpart
>> optab support, I think to recog this pattern in vectorizer
>> is better.  Is it on the right track?
>
> I think it's on the right track, using IFN_LAST is a bit awkward
> in case yet another case pops up so maybe you can use
> a code_helper instance instead which unifies tree_code,
> builtin_code and internal_fn?
>
> I also notice that can_mult_highpart_p will return true if
> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> but then the result might be less optimal (or even not
> handled later)?
>
> That is, what about adding optab internal functions
> for [us]mul_highpart instead, much like the existing
> ones for MULH{R,}S?

Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
already have to be guarded by can_mult_highpart_p, so replacing it with
a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
should be replaced with a direct_internal_fn_supported_p query.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13  9:40   ` Richard Sandiford
@ 2021-07-13  9:44     ` Richard Biener
  2021-07-13 10:11       ` Richard Sandiford
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-07-13  9:44 UTC (permalink / raw)
  To: Richard Biener, Kewen.Lin, GCC Patches, Segher Boessenkool,
	Bill Schmidt, Richard Sandiford

On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> When I added the support for Power10 newly introduced multiply
> >> highpart instrutions, I noticed that currently vectorizer
> >> doesn't try to vectorize multiply highpart pattern, I hope
> >> this isn't intentional?
> >>
> >> This patch is to extend the existing pattern mulhs handlings
> >> to cover multiply highpart.  Another alternative seems to
> >> recog mul_highpart operation in a general place applied for
> >> scalar code when the target supports the optab for the scalar
> >> operation, it's based on the assumption that one target which
> >> supports vector version of multiply highpart should have the
> >> scalar version.  I noticed that the function can_mult_highpart_p
> >> can check/handle mult_highpart well even without mul_highpart
> >> optab support, I think to recog this pattern in vectorizer
> >> is better.  Is it on the right track?
> >
> > I think it's on the right track, using IFN_LAST is a bit awkward
> > in case yet another case pops up so maybe you can use
> > a code_helper instance instead which unifies tree_code,
> > builtin_code and internal_fn?
> >
> > I also notice that can_mult_highpart_p will return true if
> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> > but then the result might be less optimal (or even not
> > handled later)?
> >
> > That is, what about adding optab internal functions
> > for [us]mul_highpart instead, much like the existing
> > ones for MULH{R,}S?
>
> Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
> already have to be guarded by can_mult_highpart_p, so replacing it with
> a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
> should be replaced with a direct_internal_fn_supported_p query.)

But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo}
but I think this specific pattern should key on [us]mul_highpart only?

Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)?

Richard.


> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13  9:44     ` Richard Biener
@ 2021-07-13 10:11       ` Richard Sandiford
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Sandiford @ 2021-07-13 10:11 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen.Lin, GCC Patches, Segher Boessenkool, Bill Schmidt

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> When I added the support for Power10 newly introduced multiply
>> >> highpart instrutions, I noticed that currently vectorizer
>> >> doesn't try to vectorize multiply highpart pattern, I hope
>> >> this isn't intentional?
>> >>
>> >> This patch is to extend the existing pattern mulhs handlings
>> >> to cover multiply highpart.  Another alternative seems to
>> >> recog mul_highpart operation in a general place applied for
>> >> scalar code when the target supports the optab for the scalar
>> >> operation, it's based on the assumption that one target which
>> >> supports vector version of multiply highpart should have the
>> >> scalar version.  I noticed that the function can_mult_highpart_p
>> >> can check/handle mult_highpart well even without mul_highpart
>> >> optab support, I think to recog this pattern in vectorizer
>> >> is better.  Is it on the right track?
>> >
>> > I think it's on the right track, using IFN_LAST is a bit awkward
>> > in case yet another case pops up so maybe you can use
>> > a code_helper instance instead which unifies tree_code,
>> > builtin_code and internal_fn?
>> >
>> > I also notice that can_mult_highpart_p will return true if
>> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>> > but then the result might be less optimal (or even not
>> > handled later)?
>> >
>> > That is, what about adding optab internal functions
>> > for [us]mul_highpart instead, much like the existing
>> > ones for MULH{R,}S?
>>
>> Yeah, that's be my preference too FWIW.  All uses of MULT_HIGHPART_EXPR
>> already have to be guarded by can_mult_highpart_p, so replacing it with
>> a directly-mapped ifn seems like a natural fit.  (Then can_mult_highpart_p
>> should be replaced with a direct_internal_fn_supported_p query.)
>
> But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo}
> but I think this specific pattern should key on [us]mul_highpart only?
>
> Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)?

But wouldn't it be better to do the existing hi/lo/even/odd conversion in
gimple, rather than hide it in expand?  (Yes, this is feature creep. :-))

Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13  9:35 ` [RFC/PATCH] vect: Recog mul_highpart pattern Richard Biener
  2021-07-13  9:40   ` Richard Sandiford
@ 2021-07-13 10:25   ` Kewen.Lin
  2021-07-13 12:42     ` Richard Biener
  1 sibling, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-13 10:25 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

Hi Richi,

Thanks for the comments!

on 2021/7/13 下午5:35, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> When I added the support for Power10 newly introduced multiply
>> highpart instrutions, I noticed that currently vectorizer
>> doesn't try to vectorize multiply highpart pattern, I hope
>> this isn't intentional?
>>
>> This patch is to extend the existing pattern mulhs handlings
>> to cover multiply highpart.  Another alternative seems to
>> recog mul_highpart operation in a general place applied for
>> scalar code when the target supports the optab for the scalar
>> operation, it's based on the assumption that one target which
>> supports vector version of multiply highpart should have the
>> scalar version.  I noticed that the function can_mult_highpart_p
>> can check/handle mult_highpart well even without mul_highpart
>> optab support, I think to recog this pattern in vectorizer
>> is better.  Is it on the right track?
> 
> I think it's on the right track, using IFN_LAST is a bit awkward
> in case yet another case pops up so maybe you can use
> a code_helper instance instead which unifies tree_code,
> builtin_code and internal_fn?
> 

If there is one new requirement which doesn't have/introduce IFN
stuffs but have one existing tree_code, can we add one more field
with type tree_code, then for the IFN_LAST path we can check the
different requirements under the guard with that tree_code variable?

> I also notice that can_mult_highpart_p will return true if
> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> but then the result might be less optimal (or even not
> handled later)?
> 

I think it will be handled always?  The expander calls

rtx
expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
		      rtx target, bool uns_p)

which will further check with can_mult_highpart_p.

For the below case, 

#define SHT_CNT 16

__attribute__ ((noipa)) void
test ()
{
  for (int i = 0; i < N; i++)
    sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
}

Without this patch, it use widen_mult like below:

  vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
  vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
  vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
  vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
  vect__6.10_25 = vect_patt_22.9_13 >> 16;
  vect__6.10_26 = vect_patt_22.9_9 >> 16;
  vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
  MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;

.L2:
        lxvx 33,7,9
        lxvx 32,8,9
        vmulosh 13,0,1    // widen mult
        vmulesh 0,0,1
        xxmrglw 33,32,45  // merge
        xxmrghw 32,32,45
        vsraw 1,1,12      // shift
        vsraw 0,0,12
        vpkuwum 0,0,1     // pack
        stxvx 32,10,9
        addi 9,9,16
        bdnz .L2


With this patch, it ends up with:

  vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
  vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
  vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
  MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;

.L2:
        lxvx 32,8,9
        lxvx 33,10,9
        vmulosh 13,0,1   // widen mult
        vmulesh 0,0,1
        vperm 0,0,13,12  // perm on widen mults
        stxvx 32,7,9
        addi 9,9,16
        bdnz .L2


> That is, what about adding optab internal functions
> for [us]mul_highpart instead, much like the existing
> ones for MULH{R,}S?
> 

OK, I was thinking the IFN way at the beginning, but was worried
that it's easy to be blamed saying it's not necessary since there
is one existing tree_code.  :-)  Will update it with IFN way.

BR,
Kewen

> Richard.
> 
>>
>> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
>> x86_64-redhat-linux and aarch64-linux-gnu.
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>         recog normal multiply highpart.
>>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13 10:25   ` Kewen.Lin
@ 2021-07-13 12:42     ` Richard Biener
  2021-07-13 14:59       ` Kewen.Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-07-13 12:42 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> Thanks for the comments!
>
> on 2021/7/13 下午5:35, Richard Biener wrote:
> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> When I added the support for Power10 newly introduced multiply
> >> highpart instrutions, I noticed that currently vectorizer
> >> doesn't try to vectorize multiply highpart pattern, I hope
> >> this isn't intentional?
> >>
> >> This patch is to extend the existing pattern mulhs handlings
> >> to cover multiply highpart.  Another alternative seems to
> >> recog mul_highpart operation in a general place applied for
> >> scalar code when the target supports the optab for the scalar
> >> operation, it's based on the assumption that one target which
> >> supports vector version of multiply highpart should have the
> >> scalar version.  I noticed that the function can_mult_highpart_p
> >> can check/handle mult_highpart well even without mul_highpart
> >> optab support, I think to recog this pattern in vectorizer
> >> is better.  Is it on the right track?
> >
> > I think it's on the right track, using IFN_LAST is a bit awkward
> > in case yet another case pops up so maybe you can use
> > a code_helper instance instead which unifies tree_code,
> > builtin_code and internal_fn?
> >
>
> If there is one new requirement which doesn't have/introduce IFN
> stuffs but have one existing tree_code, can we add one more field
> with type tree_code, then for the IFN_LAST path we can check the
> different requirements under the guard with that tree_code variable?
>
> > I also notice that can_mult_highpart_p will return true if
> > only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> > but then the result might be less optimal (or even not
> > handled later)?
> >
>
> I think it will be handled always?  The expander calls
>
> rtx
> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>                       rtx target, bool uns_p)
>
> which will further check with can_mult_highpart_p.
>
> For the below case,
>
> #define SHT_CNT 16
>
> __attribute__ ((noipa)) void
> test ()
> {
>   for (int i = 0; i < N; i++)
>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
> }
>
> Without this patch, it use widen_mult like below:
>
>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>
> .L2:
>         lxvx 33,7,9
>         lxvx 32,8,9
>         vmulosh 13,0,1    // widen mult
>         vmulesh 0,0,1
>         xxmrglw 33,32,45  // merge
>         xxmrghw 32,32,45
>         vsraw 1,1,12      // shift
>         vsraw 0,0,12
>         vpkuwum 0,0,1     // pack
>         stxvx 32,10,9
>         addi 9,9,16
>         bdnz .L2
>
>
> With this patch, it ends up with:
>
>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;

Yes, so I'm curious what it ends up with/without the patch on x86_64 which
can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.

Richard.

> .L2:
>         lxvx 32,8,9
>         lxvx 33,10,9
>         vmulosh 13,0,1   // widen mult
>         vmulesh 0,0,1
>         vperm 0,0,13,12  // perm on widen mults
>         stxvx 32,7,9
>         addi 9,9,16
>         bdnz .L2
>
>
> > That is, what about adding optab internal functions
> > for [us]mul_highpart instead, much like the existing
> > ones for MULH{R,}S?
> >
>
> OK, I was thinking the IFN way at the beginning, but was worried
> that it's easy to be blamed saying it's not necessary since there
> is one existing tree_code.  :-)  Will update it with IFN way.
>
> BR,
> Kewen
>
> > Richard.
> >
> >>
> >> Bootstrapped & regtested on powerpc64le-linux-gnu P9,
> >> x86_64-redhat-linux and aarch64-linux-gnu.
> >>
> >> BR,
> >> Kewen
> >> -----
> >> gcc/ChangeLog:
> >>
> >>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
> >>         recog normal multiply highpart.
> >>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13 12:42     ` Richard Biener
@ 2021-07-13 14:59       ` Kewen.Lin
  2021-07-14  6:38         ` Richard Biener
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-13 14:59 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

on 2021/7/13 下午8:42, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richi,
>>
>> Thanks for the comments!
>>
>> on 2021/7/13 下午5:35, Richard Biener wrote:
>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> When I added the support for Power10 newly introduced multiply
>>>> highpart instrutions, I noticed that currently vectorizer
>>>> doesn't try to vectorize multiply highpart pattern, I hope
>>>> this isn't intentional?
>>>>
>>>> This patch is to extend the existing pattern mulhs handlings
>>>> to cover multiply highpart.  Another alternative seems to
>>>> recog mul_highpart operation in a general place applied for
>>>> scalar code when the target supports the optab for the scalar
>>>> operation, it's based on the assumption that one target which
>>>> supports vector version of multiply highpart should have the
>>>> scalar version.  I noticed that the function can_mult_highpart_p
>>>> can check/handle mult_highpart well even without mul_highpart
>>>> optab support, I think to recog this pattern in vectorizer
>>>> is better.  Is it on the right track?
>>>
>>> I think it's on the right track, using IFN_LAST is a bit awkward
>>> in case yet another case pops up so maybe you can use
>>> a code_helper instance instead which unifies tree_code,
>>> builtin_code and internal_fn?
>>>
>>
>> If there is one new requirement which doesn't have/introduce IFN
>> stuffs but have one existing tree_code, can we add one more field
>> with type tree_code, then for the IFN_LAST path we can check the
>> different requirements under the guard with that tree_code variable?
>>
>>> I also notice that can_mult_highpart_p will return true if
>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>>> but then the result might be less optimal (or even not
>>> handled later)?
>>>
>>
>> I think it will be handled always?  The expander calls
>>
>> rtx
>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>>                       rtx target, bool uns_p)
>>
>> which will further check with can_mult_highpart_p.
>>
>> For the below case,
>>
>> #define SHT_CNT 16
>>
>> __attribute__ ((noipa)) void
>> test ()
>> {
>>   for (int i = 0; i < N; i++)
>>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
>> }
>>
>> Without this patch, it use widen_mult like below:
>>
>>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>>
>> .L2:
>>         lxvx 33,7,9
>>         lxvx 32,8,9
>>         vmulosh 13,0,1    // widen mult
>>         vmulesh 0,0,1
>>         xxmrglw 33,32,45  // merge
>>         xxmrghw 32,32,45
>>         vsraw 1,1,12      // shift
>>         vsraw 0,0,12
>>         vpkuwum 0,0,1     // pack
>>         stxvx 32,10,9
>>         addi 9,9,16
>>         bdnz .L2
>>
>>
>> With this patch, it ends up with:
>>
>>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
> 
> Yes, so I'm curious what it ends up with/without the patch on x86_64 which
> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
> 

For test case:

```
#define N 32
typedef signed int bigType;
typedef signed short smallType;
#define SH_CNT 16

extern smallType small_a[N], small_b[N], small_c[N];

__attribute__((noipa)) void test_si(int n) {
  for (int i = 0; i < n; i++)
    small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
}

```

on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize

1) without this patch, the pattern isn't recognized, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_34 = niters.4_25 >> 3;
  _13 = (sizetype) bnd.5_34;
  _29 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
  vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
  vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
  vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
  vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
  vect__6.15_46 = vect_patt_18.14_44 >> 16;
  vect__6.15_47 = vect_patt_18.14_45 >> 16;
  vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
  ivtmp.34_5 = ivtmp.34_4 + 16;
  if (ivtmp.34_5 != _29)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

...

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        psrad   $16, %xmm1
        psrad   $16, %xmm0
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpcklwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist in loop*

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      2 psrad
      3 punpckhwd
      4 punpcklwd


2) with this patch but make the mul_highpart optab query return false, the IR looks
like:

  <bb 4> [local count: 94607391]:
  bnd.5_37 = niters.4_22 >> 3;
  _13 = (sizetype) bnd.5_37;
  _32 = _13 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
  vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
  vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
  vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
  ivtmp.33_5 = ivtmp.33_4 + 16;
  if (ivtmp.33_5 != _32)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

*asm*:

.L4:
        movdqu  small_b(%rax), %xmm3
        movdqu  small_a(%rax), %xmm1
        addq    $16, %rax
        movdqu  small_a-16(%rax), %xmm2
        pmullw  %xmm3, %xmm1
        pmulhw  %xmm3, %xmm2
        movdqa  %xmm1, %xmm0
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm2, %xmm1
        movdqa  %xmm0, %xmm2
        punpcklwd       %xmm1, %xmm0
        punpckhwd       %xmm1, %xmm2
        movdqa  %xmm0, %xmm1
        punpckhwd       %xmm2, %xmm1
        punpcklwd       %xmm2, %xmm0
        punpckhwd       %xmm1, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %edi, %eax
        andl    $-8, %eax
        testb   $7, %dil
        je      .L14

*insn dist*:

      1 addq
      3 movdqa
      3 movdqu
      1 movups
      1 pmulhw
      1 pmullw
      4 punpckhwd
      3 punpcklwd

I know little on i386 asm, but this seems better from insn distribution,
the one without this patch uses two more psrad (s).

3) FWIW with this patch as well as existing optab supports, the IR looks like:

  <bb 4> [local count: 94607391]:
  bnd.5_40 = niters.4_19 >> 3;
  _75 = (sizetype) bnd.5_40;
  _91 = _75 * 16;

  <bb 5> [local count: 378429566]:
  # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
  vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
  vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
  vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
  MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
  ivtmp.48_45 = ivtmp.48_53 + 16;
  if (ivtmp.48_45 != _91)
    goto <bb 5>; [75.00%]
  else
    goto <bb 6>; [25.00%]

  <bb 6> [local count: 94607391]:
  niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
  tmp.7_42 = (int) niters_vector_mult_vf.6_41;
  _61 = niters.4_19 & 7;
  if (_61 == 0)
    goto <bb 12>; [12.50%]
  else
    goto <bb 7>; [87.50%]

  <bb 7> [local count: 93293400]:
  # i_38 = PHI <tmp.7_42(6), 0(3)>
  # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
  niters.18_60 = niters.4_19 - _44;
  _76 = niters.18_60 + 4294967295;
  if (_76 <= 2)
    goto <bb 9>; [10.00%]
  else
    goto <bb 8>; [90.00%]

  <bb 8> [local count: 167928121]:
  _85 = (sizetype) _44;
  _86 = _85 * 2;
  vectp_small_a.23_84 = &small_a + _86;
  vectp_small_b.26_90 = &small_b + _86;
  vectp_small_c.31_98 = &small_c + _86;
  vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
  vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
  vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
  MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
  niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
  _82 = (int) niters_vector_mult_vf.20_80;
  tmp.21_81 = i_38 + _82;
  _74 = niters.18_60 & 3;
  if (_74 == 0)
    goto <bb 11>; [25.00%]
  else
    goto <bb 9>; [75.00%]

...

*asm*:

.L4:
        movdqu  small_a(%rax), %xmm0
        movdqu  small_b(%rax), %xmm2
        addq    $16, %rax
        pmulhw  %xmm2, %xmm0
        movups  %xmm0, small_c-16(%rax)
        cmpq    %rdx, %rax
        jne     .L4
        movl    %ecx, %edx
        andl    $-8, %edx
        movl    %edx, %eax
        testb   $7, %cl
        je      .L19
.L3:
        movl    %ecx, %esi
        subl    %edx, %esi
        leal    -1(%rsi), %edi
        cmpl    $2, %edi
        jbe     .L6
        movq    small_a(%rdx,%rdx), %xmm0
        movq    small_b(%rdx,%rdx), %xmm1
        pmulhw  %xmm1, %xmm0
        movq    %xmm0, small_c(%rdx,%rdx)
        movl    %esi, %edx
        andl    $-4, %edx
        addl    %edx, %eax
        andl    $3, %esi
        je      .L1

I guess the proposed IFN would be directly mapped for [us]mul_highpart?
or do you expect it will also cover what we support in can_mult_highpart_p?

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector
  2021-07-13  8:58 ` [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector Kewen.Lin
@ 2021-07-13 22:07   ` Segher Boessenkool
  2021-07-14  2:12     ` Kewen.Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Segher Boessenkool @ 2021-07-13 22:07 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, David Edelsohn

Hi!

On Tue, Jul 13, 2021 at 04:58:42PM +0800, Kewen.Lin wrote:
> This patch is to make Power10 newly introduced vector
> multiply high (part) instructions exploited in vectorized
> loops, it renames existing define_insns as standard pattern
> names.  It depends on that patch which enables vectorizer
> to recog mul_highpart.

It actually is correct already, it will just not be used yet, right?
But the testcases will fail until the generic support lands.

Okay for trunk.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector
  2021-07-13 22:07   ` Segher Boessenkool
@ 2021-07-14  2:12     ` Kewen.Lin
  2021-07-14 18:38       ` Segher Boessenkool
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-14  2:12 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Bill Schmidt, David Edelsohn

on 2021/7/14 上午6:07, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Jul 13, 2021 at 04:58:42PM +0800, Kewen.Lin wrote:
>> This patch is to make Power10 newly introduced vector
>> multiply high (part) instructions exploited in vectorized
>> loops, it renames existing define_insns as standard pattern
>> names.  It depends on that patch which enables vectorizer
>> to recog mul_highpart.
> 
> It actually is correct already, it will just not be used yet, right?

Yes, the names are just not standard.  :)

> But the testcases will fail until the generic support lands.
> 

Yes!

> Okay for trunk.  Thanks!
> 
> 

Thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-13 14:59       ` Kewen.Lin
@ 2021-07-14  6:38         ` Richard Biener
  2021-07-14  7:45           ` Kewen.Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-07-14  6:38 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/7/13 下午8:42, Richard Biener wrote:
> > On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Richi,
> >>
> >> Thanks for the comments!
> >>
> >> on 2021/7/13 下午5:35, Richard Biener wrote:
> >>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> When I added the support for Power10 newly introduced multiply
> >>>> highpart instrutions, I noticed that currently vectorizer
> >>>> doesn't try to vectorize multiply highpart pattern, I hope
> >>>> this isn't intentional?
> >>>>
> >>>> This patch is to extend the existing pattern mulhs handlings
> >>>> to cover multiply highpart.  Another alternative seems to
> >>>> recog mul_highpart operation in a general place applied for
> >>>> scalar code when the target supports the optab for the scalar
> >>>> operation, it's based on the assumption that one target which
> >>>> supports vector version of multiply highpart should have the
> >>>> scalar version.  I noticed that the function can_mult_highpart_p
> >>>> can check/handle mult_highpart well even without mul_highpart
> >>>> optab support, I think to recog this pattern in vectorizer
> >>>> is better.  Is it on the right track?
> >>>
> >>> I think it's on the right track, using IFN_LAST is a bit awkward
> >>> in case yet another case pops up so maybe you can use
> >>> a code_helper instance instead which unifies tree_code,
> >>> builtin_code and internal_fn?
> >>>
> >>
> >> If there is one new requirement which doesn't have/introduce IFN
> >> stuffs but have one existing tree_code, can we add one more field
> >> with type tree_code, then for the IFN_LAST path we can check the
> >> different requirements under the guard with that tree_code variable?
> >>
> >>> I also notice that can_mult_highpart_p will return true if
> >>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
> >>> but then the result might be less optimal (or even not
> >>> handled later)?
> >>>
> >>
> >> I think it will be handled always?  The expander calls
> >>
> >> rtx
> >> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
> >>                       rtx target, bool uns_p)
> >>
> >> which will further check with can_mult_highpart_p.
> >>
> >> For the below case,
> >>
> >> #define SHT_CNT 16
> >>
> >> __attribute__ ((noipa)) void
> >> test ()
> >> {
> >>   for (int i = 0; i < N; i++)
> >>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
> >> }
> >>
> >> Without this patch, it use widen_mult like below:
> >>
> >>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
> >>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
> >>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
> >>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
> >>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
> >>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
> >>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
> >>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
> >>
> >> .L2:
> >>         lxvx 33,7,9
> >>         lxvx 32,8,9
> >>         vmulosh 13,0,1    // widen mult
> >>         vmulesh 0,0,1
> >>         xxmrglw 33,32,45  // merge
> >>         xxmrghw 32,32,45
> >>         vsraw 1,1,12      // shift
> >>         vsraw 0,0,12
> >>         vpkuwum 0,0,1     // pack
> >>         stxvx 32,10,9
> >>         addi 9,9,16
> >>         bdnz .L2
> >>
> >>
> >> With this patch, it ends up with:
> >>
> >>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
> >>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
> >>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
> >>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
> >
> > Yes, so I'm curious what it ends up with/without the patch on x86_64 which
> > can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
> >
>
> For test case:
>
> ```
> #define N 32
> typedef signed int bigType;
> typedef signed short smallType;
> #define SH_CNT 16
>
> extern smallType small_a[N], small_b[N], small_c[N];
>
> __attribute__((noipa)) void test_si(int n) {
>   for (int i = 0; i < n; i++)
>     small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
> }
>
> ```
>
> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize
>
> 1) without this patch, the pattern isn't recognized, the IR looks like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_34 = niters.4_25 >> 3;
>   _13 = (sizetype) bnd.5_34;
>   _29 = _13 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
>   vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
>   vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
>   vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
>   vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
>   vect__6.15_46 = vect_patt_18.14_44 >> 16;
>   vect__6.15_47 = vect_patt_18.14_45 >> 16;
>   vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
>   ivtmp.34_5 = ivtmp.34_4 + 16;
>   if (ivtmp.34_5 != _29)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
> ...
>
> *asm*:
>
> .L4:
>         movdqu  small_b(%rax), %xmm3
>         movdqu  small_a(%rax), %xmm1
>         addq    $16, %rax
>         movdqu  small_a-16(%rax), %xmm2
>         pmullw  %xmm3, %xmm1
>         pmulhw  %xmm3, %xmm2
>         movdqa  %xmm1, %xmm0
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm2, %xmm1
>         psrad   $16, %xmm1
>         psrad   $16, %xmm0
>         movdqa  %xmm0, %xmm2
>         punpcklwd       %xmm1, %xmm0
>         punpckhwd       %xmm1, %xmm2
>         movdqa  %xmm0, %xmm1
>         punpckhwd       %xmm2, %xmm1
>         punpcklwd       %xmm2, %xmm0
>         punpcklwd       %xmm1, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %edi, %eax
>         andl    $-8, %eax
>         testb   $7, %dil
>         je      .L14
>
> *insn dist in loop*
>
>       1 addq
>       3 movdqa
>       3 movdqu
>       1 movups
>       1 pmulhw
>       1 pmullw
>       2 psrad
>       3 punpckhwd
>       4 punpcklwd
>
>
> 2) with this patch but make the mul_highpart optab query return false, the IR looks
> like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_37 = niters.4_22 >> 3;
>   _13 = (sizetype) bnd.5_37;
>   _32 = _13 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
>   vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
>   vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
>   vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
>   ivtmp.33_5 = ivtmp.33_4 + 16;
>   if (ivtmp.33_5 != _32)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
> *asm*:
>
> .L4:
>         movdqu  small_b(%rax), %xmm3
>         movdqu  small_a(%rax), %xmm1
>         addq    $16, %rax
>         movdqu  small_a-16(%rax), %xmm2
>         pmullw  %xmm3, %xmm1
>         pmulhw  %xmm3, %xmm2
>         movdqa  %xmm1, %xmm0
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm2, %xmm1
>         movdqa  %xmm0, %xmm2
>         punpcklwd       %xmm1, %xmm0
>         punpckhwd       %xmm1, %xmm2
>         movdqa  %xmm0, %xmm1
>         punpckhwd       %xmm2, %xmm1
>         punpcklwd       %xmm2, %xmm0
>         punpckhwd       %xmm1, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %edi, %eax
>         andl    $-8, %eax
>         testb   $7, %dil
>         je      .L14
>
> *insn dist*:
>
>       1 addq
>       3 movdqa
>       3 movdqu
>       1 movups
>       1 pmulhw
>       1 pmullw
>       4 punpckhwd
>       3 punpcklwd
>
> I know little on i386 asm, but this seems better from insn distribution,
> the one without this patch uses two more psrad (s).

So the advantage of 1) would be more appropriate costing in the vectorizer
(x86 does not have a native vector highpart multiply).

> 3) FWIW with this patch as well as existing optab supports, the IR looks like:
>
>   <bb 4> [local count: 94607391]:
>   bnd.5_40 = niters.4_19 >> 3;
>   _75 = (sizetype) bnd.5_40;
>   _91 = _75 * 16;
>
>   <bb 5> [local count: 378429566]:
>   # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
>   vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
>   vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
>   vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
>   ivtmp.48_45 = ivtmp.48_53 + 16;
>   if (ivtmp.48_45 != _91)
>     goto <bb 5>; [75.00%]
>   else
>     goto <bb 6>; [25.00%]
>
>   <bb 6> [local count: 94607391]:
>   niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
>   tmp.7_42 = (int) niters_vector_mult_vf.6_41;
>   _61 = niters.4_19 & 7;
>   if (_61 == 0)
>     goto <bb 12>; [12.50%]
>   else
>     goto <bb 7>; [87.50%]
>
>   <bb 7> [local count: 93293400]:
>   # i_38 = PHI <tmp.7_42(6), 0(3)>
>   # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
>   niters.18_60 = niters.4_19 - _44;
>   _76 = niters.18_60 + 4294967295;
>   if (_76 <= 2)
>     goto <bb 9>; [10.00%]
>   else
>     goto <bb 8>; [90.00%]
>
>   <bb 8> [local count: 167928121]:
>   _85 = (sizetype) _44;
>   _86 = _85 * 2;
>   vectp_small_a.23_84 = &small_a + _86;
>   vectp_small_b.26_90 = &small_b + _86;
>   vectp_small_c.31_98 = &small_c + _86;
>   vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
>   vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
>   vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
>   MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
>   niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
>   _82 = (int) niters_vector_mult_vf.20_80;
>   tmp.21_81 = i_38 + _82;
>   _74 = niters.18_60 & 3;
>   if (_74 == 0)
>     goto <bb 11>; [25.00%]
>   else
>     goto <bb 9>; [75.00%]
>
> ...
>
> *asm*:
>
> .L4:
>         movdqu  small_a(%rax), %xmm0
>         movdqu  small_b(%rax), %xmm2
>         addq    $16, %rax
>         pmulhw  %xmm2, %xmm0
>         movups  %xmm0, small_c-16(%rax)
>         cmpq    %rdx, %rax
>         jne     .L4
>         movl    %ecx, %edx
>         andl    $-8, %edx
>         movl    %edx, %eax
>         testb   $7, %cl
>         je      .L19
> .L3:
>         movl    %ecx, %esi
>         subl    %edx, %esi
>         leal    -1(%rsi), %edi
>         cmpl    $2, %edi
>         jbe     .L6
>         movq    small_a(%rdx,%rdx), %xmm0
>         movq    small_b(%rdx,%rdx), %xmm1
>         pmulhw  %xmm1, %xmm0
>         movq    %xmm0, small_c(%rdx,%rdx)
>         movl    %esi, %edx
>         andl    $-4, %edx
>         addl    %edx, %eax
>         andl    $3, %esi
>         je      .L1

Oh, so indeed x86 has a vector highpart multiply, not grep-friendly
macroized as <s>mul<mode>3_highpart ;)

> I guess the proposed IFN would be directly mapped for [us]mul_highpart?

Yes.

> or do you expect it will also cover what we support in can_mult_highpart_p?

See above - since we manage to use widen_mult we should expose this
detail for the purpose of costing.  So the pattern would directly ask for
an optab IFN mapping to [us]mul_highpart.

Richard.

> BR,
> Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14  6:38         ` Richard Biener
@ 2021-07-14  7:45           ` Kewen.Lin
  2021-07-14  8:38             ` Richard Sandiford
  2021-07-15  7:06             ` [PATCH v3] " Kewen.Lin
  0 siblings, 2 replies; 31+ messages in thread
From: Kewen.Lin @ 2021-07-14  7:45 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Sandiford, Segher Boessenkool, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 13218 bytes --]

on 2021/7/14 下午2:38, Richard Biener wrote:
> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi Richi,
>>>>
>>>> Thanks for the comments!
>>>>
>>>> on 2021/7/13 下午5:35, Richard Biener wrote:
>>>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> When I added the support for Power10 newly introduced multiply
>>>>>> highpart instrutions, I noticed that currently vectorizer
>>>>>> doesn't try to vectorize multiply highpart pattern, I hope
>>>>>> this isn't intentional?
>>>>>>
>>>>>> This patch is to extend the existing pattern mulhs handlings
>>>>>> to cover multiply highpart.  Another alternative seems to
>>>>>> recog mul_highpart operation in a general place applied for
>>>>>> scalar code when the target supports the optab for the scalar
>>>>>> operation, it's based on the assumption that one target which
>>>>>> supports vector version of multiply highpart should have the
>>>>>> scalar version.  I noticed that the function can_mult_highpart_p
>>>>>> can check/handle mult_highpart well even without mul_highpart
>>>>>> optab support, I think to recog this pattern in vectorizer
>>>>>> is better.  Is it on the right track?
>>>>>
>>>>> I think it's on the right track, using IFN_LAST is a bit awkward
>>>>> in case yet another case pops up so maybe you can use
>>>>> a code_helper instance instead which unifies tree_code,
>>>>> builtin_code and internal_fn?
>>>>>
>>>>
>>>> If there is one new requirement which doesn't have/introduce IFN
>>>> stuffs but have one existing tree_code, can we add one more field
>>>> with type tree_code, then for the IFN_LAST path we can check the
>>>> different requirements under the guard with that tree_code variable?
>>>>
>>>>> I also notice that can_mult_highpart_p will return true if
>>>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available,
>>>>> but then the result might be less optimal (or even not
>>>>> handled later)?
>>>>>
>>>>
>>>> I think it will be handled always?  The expander calls
>>>>
>>>> rtx
>>>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1,
>>>>                       rtx target, bool uns_p)
>>>>
>>>> which will further check with can_mult_highpart_p.
>>>>
>>>> For the below case,
>>>>
>>>> #define SHT_CNT 16
>>>>
>>>> __attribute__ ((noipa)) void
>>>> test ()
>>>> {
>>>>   for (int i = 0; i < N; i++)
>>>>     sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16;
>>>> }
>>>>
>>>> Without this patch, it use widen_mult like below:
>>>>
>>>>   vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1];
>>>>   vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1];
>>>>   vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>;
>>>>   vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>;
>>>>   vect__6.10_25 = vect_patt_22.9_13 >> 16;
>>>>   vect__6.10_26 = vect_patt_22.9_9 >> 16;
>>>>   vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>;
>>>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27;
>>>>
>>>> .L2:
>>>>         lxvx 33,7,9
>>>>         lxvx 32,8,9
>>>>         vmulosh 13,0,1    // widen mult
>>>>         vmulesh 0,0,1
>>>>         xxmrglw 33,32,45  // merge
>>>>         xxmrghw 32,32,45
>>>>         vsraw 1,1,12      // shift
>>>>         vsraw 0,0,12
>>>>         vpkuwum 0,0,1     // pack
>>>>         stxvx 32,10,9
>>>>         addi 9,9,16
>>>>         bdnz .L2
>>>>
>>>>
>>>> With this patch, it ends up with:
>>>>
>>>>   vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1];
>>>>   vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1];
>>>>   vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14;
>>>>   MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25;
>>>
>>> Yes, so I'm curious what it ends up with/without the patch on x86_64 which
>>> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart.
>>>
>>
>> For test case:
>>
>> ```
>> #define N 32
>> typedef signed int bigType;
>> typedef signed short smallType;
>> #define SH_CNT 16
>>
>> extern smallType small_a[N], small_b[N], small_c[N];
>>
>> __attribute__((noipa)) void test_si(int n) {
>>   for (int i = 0; i < n; i++)
>>     small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT;
>> }
>>
>> ```
>>
>> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize
>>
>> 1) without this patch, the pattern isn't recognized, the IR looks like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_34 = niters.4_25 >> 3;
>>   _13 = (sizetype) bnd.5_34;
>>   _29 = _13 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)>
>>   vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1];
>>   vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1];
>>   vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>;
>>   vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>;
>>   vect__6.15_46 = vect_patt_18.14_44 >> 16;
>>   vect__6.15_47 = vect_patt_18.14_45 >> 16;
>>   vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48;
>>   ivtmp.34_5 = ivtmp.34_4 + 16;
>>   if (ivtmp.34_5 != _29)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>> ...
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_b(%rax), %xmm3
>>         movdqu  small_a(%rax), %xmm1
>>         addq    $16, %rax
>>         movdqu  small_a-16(%rax), %xmm2
>>         pmullw  %xmm3, %xmm1
>>         pmulhw  %xmm3, %xmm2
>>         movdqa  %xmm1, %xmm0
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm2, %xmm1
>>         psrad   $16, %xmm1
>>         psrad   $16, %xmm0
>>         movdqa  %xmm0, %xmm2
>>         punpcklwd       %xmm1, %xmm0
>>         punpckhwd       %xmm1, %xmm2
>>         movdqa  %xmm0, %xmm1
>>         punpckhwd       %xmm2, %xmm1
>>         punpcklwd       %xmm2, %xmm0
>>         punpcklwd       %xmm1, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %edi, %eax
>>         andl    $-8, %eax
>>         testb   $7, %dil
>>         je      .L14
>>
>> *insn dist in loop*
>>
>>       1 addq
>>       3 movdqa
>>       3 movdqu
>>       1 movups
>>       1 pmulhw
>>       1 pmullw
>>       2 psrad
>>       3 punpckhwd
>>       4 punpcklwd
>>
>>
>> 2) with this patch but make the mul_highpart optab query return false, the IR looks
>> like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_37 = niters.4_22 >> 3;
>>   _13 = (sizetype) bnd.5_37;
>>   _32 = _13 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)>
>>   vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1];
>>   vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1];
>>   vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47;
>>   ivtmp.33_5 = ivtmp.33_4 + 16;
>>   if (ivtmp.33_5 != _32)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_b(%rax), %xmm3
>>         movdqu  small_a(%rax), %xmm1
>>         addq    $16, %rax
>>         movdqu  small_a-16(%rax), %xmm2
>>         pmullw  %xmm3, %xmm1
>>         pmulhw  %xmm3, %xmm2
>>         movdqa  %xmm1, %xmm0
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm2, %xmm1
>>         movdqa  %xmm0, %xmm2
>>         punpcklwd       %xmm1, %xmm0
>>         punpckhwd       %xmm1, %xmm2
>>         movdqa  %xmm0, %xmm1
>>         punpckhwd       %xmm2, %xmm1
>>         punpcklwd       %xmm2, %xmm0
>>         punpckhwd       %xmm1, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %edi, %eax
>>         andl    $-8, %eax
>>         testb   $7, %dil
>>         je      .L14
>>
>> *insn dist*:
>>
>>       1 addq
>>       3 movdqa
>>       3 movdqu
>>       1 movups
>>       1 pmulhw
>>       1 pmullw
>>       4 punpckhwd
>>       3 punpcklwd
>>
>> I know little on i386 asm, but this seems better from insn distribution,
>> the one without this patch uses two more psrad (s).
> 
> So the advantage of 1) would be more appropriate costing in the vectorizer
> (x86 does not have a native vector highpart multiply).
> 
>> 3) FWIW with this patch as well as existing optab supports, the IR looks like:
>>
>>   <bb 4> [local count: 94607391]:
>>   bnd.5_40 = niters.4_19 >> 3;
>>   _75 = (sizetype) bnd.5_40;
>>   _91 = _75 * 16;
>>
>>   <bb 5> [local count: 378429566]:
>>   # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)>
>>   vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1];
>>   vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1];
>>   vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51;
>>   MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52;
>>   ivtmp.48_45 = ivtmp.48_53 + 16;
>>   if (ivtmp.48_45 != _91)
>>     goto <bb 5>; [75.00%]
>>   else
>>     goto <bb 6>; [25.00%]
>>
>>   <bb 6> [local count: 94607391]:
>>   niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288;
>>   tmp.7_42 = (int) niters_vector_mult_vf.6_41;
>>   _61 = niters.4_19 & 7;
>>   if (_61 == 0)
>>     goto <bb 12>; [12.50%]
>>   else
>>     goto <bb 7>; [87.50%]
>>
>>   <bb 7> [local count: 93293400]:
>>   # i_38 = PHI <tmp.7_42(6), 0(3)>
>>   # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)>
>>   niters.18_60 = niters.4_19 - _44;
>>   _76 = niters.18_60 + 4294967295;
>>   if (_76 <= 2)
>>     goto <bb 9>; [10.00%]
>>   else
>>     goto <bb 8>; [90.00%]
>>
>>   <bb 8> [local count: 167928121]:
>>   _85 = (sizetype) _44;
>>   _86 = _85 * 2;
>>   vectp_small_a.23_84 = &small_a + _86;
>>   vectp_small_b.26_90 = &small_b + _86;
>>   vectp_small_c.31_98 = &small_c + _86;
>>   vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84];
>>   vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90];
>>   vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95;
>>   MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96;
>>   niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292;
>>   _82 = (int) niters_vector_mult_vf.20_80;
>>   tmp.21_81 = i_38 + _82;
>>   _74 = niters.18_60 & 3;
>>   if (_74 == 0)
>>     goto <bb 11>; [25.00%]
>>   else
>>     goto <bb 9>; [75.00%]
>>
>> ...
>>
>> *asm*:
>>
>> .L4:
>>         movdqu  small_a(%rax), %xmm0
>>         movdqu  small_b(%rax), %xmm2
>>         addq    $16, %rax
>>         pmulhw  %xmm2, %xmm0
>>         movups  %xmm0, small_c-16(%rax)
>>         cmpq    %rdx, %rax
>>         jne     .L4
>>         movl    %ecx, %edx
>>         andl    $-8, %edx
>>         movl    %edx, %eax
>>         testb   $7, %cl
>>         je      .L19
>> .L3:
>>         movl    %ecx, %esi
>>         subl    %edx, %esi
>>         leal    -1(%rsi), %edi
>>         cmpl    $2, %edi
>>         jbe     .L6
>>         movq    small_a(%rdx,%rdx), %xmm0
>>         movq    small_b(%rdx,%rdx), %xmm1
>>         pmulhw  %xmm1, %xmm0
>>         movq    %xmm0, small_c(%rdx,%rdx)
>>         movl    %esi, %edx
>>         andl    $-4, %edx
>>         addl    %edx, %eax
>>         andl    $3, %esi
>>         je      .L1
> 
> Oh, so indeed x86 has a vector highpart multiply, not grep-friendly
> macroized as <s>mul<mode>3_highpart ;)
> 
>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> 
> Yes.
> 

Thanks for confirming!  The related patch v2 is attached and the testing
is ongoing.

>> or do you expect it will also cover what we support in can_mult_highpart_p?
> 
> See above - since we manage to use widen_mult we should expose this
> detail for the purpose of costing.  So the pattern would directly ask for
> an optab IFN mapping to [us]mul_highpart.
> 

OK, the costing issue seems already exist since vect_recog_divmod_pattern
checks with can_mult_highpart_p and generate mul_highpart in some path which
probably ends up with widen_mult_{odd,even,lo,hi}, I guess we can fix the
related costing routine by querying can_mult_highpart_p and price it more
for mul_highpart input and it has to work with widen_mult_{odd...}.

btw, IIUC Richard S. proposed to generate these widen_mult_{odd,even,lo,hi}
in gimple rather than expand phase, since vector perm is involved it seems
we have to generate them during transforming?  Otherwise the mul_highpart
in scalar code like pattern recog divmod seems hard to be expanded?
BR,
Kewen
-----
gcc/ChangeLog:

	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
	* internal-fn.def (IFN_MULH): New internal function.
	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart as IFN_MULH.

[-- Attachment #2: vect-recog-hmul-v2.patch --]
[-- Type: text/plain, Size: 4656 bytes --]

---
 gcc/internal-fn.c        |  1 +
 gcc/internal-fn.def      |  2 ++
 gcc/tree-vect-patterns.c | 45 +++++++++++++++++++++++++++-------------
 3 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fb8b43d1ce2..b1b4289357c 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn)
     case IFN_FNMS:
     case IFN_AVG_FLOOR:
     case IFN_AVG_CEIL:
+    case IFN_MULH:
     case IFN_MULHS:
     case IFN_MULHRS:
     case IFN_FMIN:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3b8e730960..ed6d7de1680 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
+			      smul_highpart, umul_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..4581cc6b51c 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULH;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
@@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   /* Check for target support.  */
   tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
   if (!new_vectype
-      || !direct_internal_fn_supported_p
-	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
+      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
     return NULL;
 
   /* The IR requires a valid vector type for the cast result, even though
@@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   /* Generate the IFN_MULHRS call.  */
   tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
   tree new_ops[2];
-  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
-		       unprom_mult, new_vectype);
+  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
+		       new_vectype);
   gcall *mulhrs_stmt
     = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
   gimple_call_set_lhs (mulhrs_stmt, new_var);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14  7:45           ` Kewen.Lin
@ 2021-07-14  8:38             ` Richard Sandiford
  2021-07-14 10:02               ` Kewen.Lin
  2021-07-15  7:06             ` [PATCH v3] " Kewen.Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Richard Sandiford @ 2021-07-14  8:38 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, Segher Boessenkool, Bill Schmidt

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> gcc/ChangeLog:
>
> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
> 	* internal-fn.def (IFN_MULH): New internal function.
> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
> 	recog normal multiply highpart as IFN_MULH.

LGTM FWIW, although:

> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>    /* Check for target support.  */
>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>    if (!new_vectype
> -      || !direct_internal_fn_supported_p
> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>      return NULL;
>  
>    /* The IR requires a valid vector type for the cast result, even though
> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>    /* Generate the IFN_MULHRS call.  */
>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>    tree new_ops[2];
> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
> -		       unprom_mult, new_vectype);
> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
> +		       new_vectype);
>    gcall *mulhrs_stmt
>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>    gimple_call_set_lhs (mulhrs_stmt, new_var);

…these changes look like formatting only.  (I guess it's down to whether
or not the 80th column should be kept free for an “end of line+1” cursor.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14  8:38             ` Richard Sandiford
@ 2021-07-14 10:02               ` Kewen.Lin
  2021-07-14 11:32                 ` Richard Sandiford
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-14 10:02 UTC (permalink / raw)
  To: richard.sandiford
  Cc: Richard Biener, GCC Patches, Segher Boessenkool, Bill Schmidt

Hi Richard,

on 2021/7/14 下午4:38, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> gcc/ChangeLog:
>>
>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>> 	* internal-fn.def (IFN_MULH): New internal function.
>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>> 	recog normal multiply highpart as IFN_MULH.
> 
> LGTM FWIW, although:
> 

Thanks for the review!

>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>    /* Check for target support.  */
>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>    if (!new_vectype
>> -      || !direct_internal_fn_supported_p
>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>      return NULL;
>>  
>>    /* The IR requires a valid vector type for the cast result, even though
>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>    /* Generate the IFN_MULHRS call.  */
>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>    tree new_ops[2];
>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>> -		       unprom_mult, new_vectype);
>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>> +		       new_vectype);
>>    gcall *mulhrs_stmt
>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
> 
> …these changes look like formatting only.  (I guess it's down to whether
> or not the 80th column should be kept free for an “end of line+1” cursor.)
> 

Yeah, just for formatting, the formatting tool (clang-format) reformatted
them.  Thanks for the information on "end of line+1" cursor, I didn't know
that before.  I guess you prefer me to keep the original format?  If so I
will remove them when committing it.  I was thinking whether I should change
field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
80 column cases then. :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14 10:02               ` Kewen.Lin
@ 2021-07-14 11:32                 ` Richard Sandiford
  2021-07-14 19:32                   ` Segher Boessenkool
  2021-07-15  1:37                   ` Kewen.Lin
  0 siblings, 2 replies; 31+ messages in thread
From: Richard Sandiford @ 2021-07-14 11:32 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, Segher Boessenkool, Bill Schmidt

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2021/7/14 下午4:38, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> gcc/ChangeLog:
>>>
>>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>>> 	* internal-fn.def (IFN_MULH): New internal function.
>>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>> 	recog normal multiply highpart as IFN_MULH.
>> 
>> LGTM FWIW, although:
>> 
>
> Thanks for the review!
>
>>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>    /* Check for target support.  */
>>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>>    if (!new_vectype
>>> -      || !direct_internal_fn_supported_p
>>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>      return NULL;
>>>  
>>>    /* The IR requires a valid vector type for the cast result, even though
>>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>    /* Generate the IFN_MULHRS call.  */
>>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>>    tree new_ops[2];
>>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>>> -		       unprom_mult, new_vectype);
>>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>>> +		       new_vectype);
>>>    gcall *mulhrs_stmt
>>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
>> 
>> …these changes look like formatting only.  (I guess it's down to whether
>> or not the 80th column should be kept free for an “end of line+1” cursor.)
>> 
>
> Yeah, just for formatting, the formatting tool (clang-format) reformatted
> them.  Thanks for the information on "end of line+1" cursor, I didn't know
> that before.  I guess you prefer me to keep the original format?  If so I
> will remove them when committing it.  I was thinking whether I should change
> field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
> be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
> 80 column cases then. :)

TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
new code.  But I know in the past people have asked for 79 to be used
for the “end+1” reason, so I don't think we should “fix” existing code
that honours the 79 limit so that it no longer does, especially when the
lines surrounding the code aren't changing.

There's also a risk of yo-yo-ing if someone else is using clang-format
and does have the limit set to 79 columns.

So yeah, I think it'd better to commit without the two hunks above.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector
  2021-07-14  2:12     ` Kewen.Lin
@ 2021-07-14 18:38       ` Segher Boessenkool
  0 siblings, 0 replies; 31+ messages in thread
From: Segher Boessenkool @ 2021-07-14 18:38 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, David Edelsohn

On Wed, Jul 14, 2021 at 10:12:46AM +0800, Kewen.Lin wrote:
> on 2021/7/14 上午6:07, Segher Boessenkool wrote:
> > Hi!
> > 
> > On Tue, Jul 13, 2021 at 04:58:42PM +0800, Kewen.Lin wrote:
> >> This patch is to make Power10 newly introduced vector
> >> multiply high (part) instructions exploited in vectorized
> >> loops, it renames existing define_insns as standard pattern
> >> names.  It depends on that patch which enables vectorizer
> >> to recog mul_highpart.
> > 
> > It actually is correct already, it will just not be used yet, right?
> 
> Yes, the names are just not standard.  :)

I meant after this patch is applied :-)

Doesn't change much though -- applying it right now is fine, but you can
wait for the generic code to get in first, to make the new tests not
fail.


Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14 11:32                 ` Richard Sandiford
@ 2021-07-14 19:32                   ` Segher Boessenkool
  2021-07-15  1:40                     ` Kewen.Lin
  2021-07-15  1:37                   ` Kewen.Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Segher Boessenkool @ 2021-07-14 19:32 UTC (permalink / raw)
  To: Kewen.Lin, Richard Biener, GCC Patches, Bill Schmidt, richard.sandiford

On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote:
> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
> new code.  But I know in the past people have asked for 79 to be used
> for the “end+1” reason, so I don't think we should “fix” existing code
> that honours the 79 limit so that it no longer does, especially when the
> lines surrounding the code aren't changing.

The normal rule is you cannot go over 80.  It is perfectly fine to have
shorter lines, certainly if that is nice for some other reason, so
automatically (by some tool) changing this is Just Wrong.


Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14 11:32                 ` Richard Sandiford
  2021-07-14 19:32                   ` Segher Boessenkool
@ 2021-07-15  1:37                   ` Kewen.Lin
  1 sibling, 0 replies; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  1:37 UTC (permalink / raw)
  To: richard.sandiford
  Cc: Richard Biener, GCC Patches, Segher Boessenkool, Bill Schmidt

on 2021/7/14 下午7:32, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2021/7/14 下午4:38, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> gcc/ChangeLog:
>>>>
>>>> 	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>>>> 	* internal-fn.def (IFN_MULH): New internal function.
>>>> 	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>>> 	recog normal multiply highpart as IFN_MULH.
>>>
>>> LGTM FWIW, although:
>>>
>>
>> Thanks for the review!
>>
>>>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>>    /* Check for target support.  */
>>>>    tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type);
>>>>    if (!new_vectype
>>>> -      || !direct_internal_fn_supported_p
>>>> -	    (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>> +      || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED))
>>>>      return NULL;
>>>>  
>>>>    /* The IR requires a valid vector type for the cast result, even though
>>>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
>>>>    /* Generate the IFN_MULHRS call.  */
>>>>    tree new_var = vect_recog_temp_ssa_var (new_type, NULL);
>>>>    tree new_ops[2];
>>>> -  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type,
>>>> -		       unprom_mult, new_vectype);
>>>> +  vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult,
>>>> +		       new_vectype);
>>>>    gcall *mulhrs_stmt
>>>>      = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]);
>>>>    gimple_call_set_lhs (mulhrs_stmt, new_var);
>>>
>>> …these changes look like formatting only.  (I guess it's down to whether
>>> or not the 80th column should be kept free for an “end of line+1” cursor.)
>>>
>>
>> Yeah, just for formatting, the formatting tool (clang-format) reformatted
>> them.  Thanks for the information on "end of line+1" cursor, I didn't know
>> that before.  I guess you prefer me to keep the original format?  If so I
>> will remove them when committing it.  I was thinking whether I should change
>> field ColumnLimit of my .clang-format to 79 to avoid this kind of case to
>> be caught by formatting tool again.  Hope reviewers won't nit-pick the exact
>> 80 column cases then. :)
> 
> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
> new code.  But I know in the past people have asked for 79 to be used
> for the “end+1” reason, so I don't think we should “fix” existing code
> that honours the 79 limit so that it no longer does, especially when the
> lines surrounding the code aren't changing.
> 

Thanks for the explanation!  Agree.

> There's also a risk of yo-yo-ing if someone else is using clang-format
> and does have the limit set to 79 columns.
> 
> So yeah, I think it'd better to commit without the two hunks above.
> 

Will fix them.  Thanks for catching and explanations!

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-14 19:32                   ` Segher Boessenkool
@ 2021-07-15  1:40                     ` Kewen.Lin
  2021-07-15 23:08                       ` Segher Boessenkool
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  1:40 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, GCC Patches, Bill Schmidt, richard.sandiford

on 2021/7/15 上午3:32, Segher Boessenkool wrote:
> On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote:
>> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing
>> new code.  But I know in the past people have asked for 79 to be used
>> for the “end+1” reason, so I don't think we should “fix” existing code
>> that honours the 79 limit so that it no longer does, especially when the
>> lines surrounding the code aren't changing.
> 
> The normal rule is you cannot go over 80.  It is perfectly fine to have
> shorter lines, certainly if that is nice for some other reason, so
> automatically (by some tool) changing this is Just Wrong.
> 

OK, could this be applied to changelog entry too?  I guess yes?

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-14  7:45           ` Kewen.Lin
  2021-07-14  8:38             ` Richard Sandiford
@ 2021-07-15  7:06             ` Kewen.Lin
  2021-07-15  7:17               ` Uros Bizjak
  1 sibling, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  7:06 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Bill Schmidt, GCC Patches, Segher Boessenkool,
	Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 2242 bytes --]

on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> on 2021/7/14 下午2:38, Richard Biener wrote:
>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
>>
>> Yes.
>>
> 
> Thanks for confirming!  The related patch v2 is attached and the testing
> is ongoing.
> 

It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:

XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw

They weren't exposed in the testing run with the previous patch which
doesn't use IFN way.  By investigating it, the difference comes from
the different costing on MULT_HIGHPART_EXPR and IFN_MULH.

For MULT_HIGHPART_EXPR, it's costed by 16 from below call:

	case MULT_EXPR:
	case WIDEN_MULT_EXPR:
	case MULT_HIGHPART_EXPR:
	  stmt_cost = ix86_multiplication_cost (ix86_cost, mode);

While for IFN_MULH, it's costed by 4 as normal stmt so the total cost
becomes profitable and the expected vectorization happens.

One conservative fix seems to make IFN_MULH costing go through the
unique cost interface for multiplication, that is:

      case CFN_MULH:
        stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
        break;

As the test case marks the checks as "xfail", probably it's good to
revisit the costing on mul_highpart to ensure it's not priced more.

The attached patch also addressed Richard S.'s review comments on
two reformatting hunks.  Is it ok for trunk?

BR,
Kewen
-----
gcc/ChangeLog:

	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
	* internal-fn.def (IFN_MULH): New internal function.
	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart as IFN_MULH.
	* config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
	function CFN_MULH.

[-- Attachment #2: vect-recog-hmul-v3.patch --]
[-- Type: text/plain, Size: 4150 bytes --]

---
 gcc/config/i386/i386.c   |  3 +++
 gcc/internal-fn.c        |  1 +
 gcc/internal-fn.def      |  2 ++
 gcc/tree-vect-patterns.c | 38 ++++++++++++++++++++++++++++----------
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a93128fa0a4..1dd9108353c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -22559,6 +22559,9 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 				   mode == SFmode ? ix86_cost->fmass
 				   : ix86_cost->fmasd);
 	break;
+      case CFN_MULH:
+	stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
+	break;
       default:
 	break;
       }
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fb8b43d1ce2..b1b4289357c 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn)
     case IFN_FNMS:
     case IFN_AVG_FLOOR:
     case IFN_AVG_CEIL:
+    case IFN_MULH:
     case IFN_MULHS:
     case IFN_MULHRS:
     case IFN_FMIN:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3b8e730960..ed6d7de1680 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
+			      smul_highpart, umul_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..ada89d7060b 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULH;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  7:06             ` [PATCH v3] " Kewen.Lin
@ 2021-07-15  7:17               ` Uros Bizjak
  2021-07-15  8:04                 ` Kewen.Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Uros Bizjak @ 2021-07-15  7:17 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, Richard Sandiford, Bill Schmidt, GCC Patches,
	Segher Boessenkool

On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> > on 2021/7/14 下午2:38, Richard Biener wrote:
> >> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>
> >>> on 2021/7/13 下午8:42, Richard Biener wrote:
> >>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> >>
> >> Yes.
> >>
> >
> > Thanks for confirming!  The related patch v2 is attached and the testing
> > is ongoing.
> >
>
> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
>
> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw

These XFAILs should be removed after your patch.

This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
is actually not needed.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100696

Uros.

> They weren't exposed in the testing run with the previous patch which
> doesn't use IFN way.  By investigating it, the difference comes from
> the different costing on MULT_HIGHPART_EXPR and IFN_MULH.
>
> For MULT_HIGHPART_EXPR, it's costed by 16 from below call:
>
>         case MULT_EXPR:
>         case WIDEN_MULT_EXPR:
>         case MULT_HIGHPART_EXPR:
>           stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
>
> While for IFN_MULH, it's costed by 4 as normal stmt so the total cost
> becomes profitable and the expected vectorization happens.
>
> One conservative fix seems to make IFN_MULH costing go through the
> unique cost interface for multiplication, that is:
>
>       case CFN_MULH:
>         stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
>         break;
>
> As the test case marks the checks as "xfail", probably it's good to
> revisit the costing on mul_highpart to ensure it's not priced more.
>
> The attached patch also addressed Richard S.'s review comments on
> two reformatting hunks.  Is it ok for trunk?
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>         * internal-fn.def (IFN_MULH): New internal function.
>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>         recog normal multiply highpart as IFN_MULH.
>         * config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
>         function CFN_MULH.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  7:17               ` Uros Bizjak
@ 2021-07-15  8:04                 ` Kewen.Lin
  2021-07-15  8:23                   ` Uros Bizjak
  2021-07-15  8:40                   ` Kewen.Lin
  0 siblings, 2 replies; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  8:04 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Biener, Richard Sandiford, Bill Schmidt, GCC Patches,
	Segher Boessenkool

Hi Uros,

on 2021/7/15 下午3:17, Uros Bizjak wrote:
> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
>>> on 2021/7/14 下午2:38, Richard Biener wrote:
>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>
>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
>>>>
>>>> Yes.
>>>>
>>>
>>> Thanks for confirming!  The related patch v2 is attached and the testing
>>> is ongoing.
>>>
>>
>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
>>
>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> 
> These XFAILs should be removed after your patch.
> 
I'm curious whether it's intentional not to specify -fno-vect-cost-model
for this test case.  As noted above, this case is sensitive on how we
cost mult_highpart.  Without cost modeling, the XFAILs can be removed
only with this mul_highpart pattern support, no matter how we model it
(x86 part of this patch exists or not).

> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
> is actually not needed.
> 

Thanks for the information!  The justification for the x86 part is that:
the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
optab support, i386 port has already customized costing for 
MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
support), if we don't follow the same way for IFN_MULH, I'm worried that
we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
a right thing (we shouldn't cost it specially), it at least means we
have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
has direct mul_highpart optab support, I think they should be costed
consistently.  Does it sound reasonable?

BR,
Kewen

> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100696
> 
> Uros.
> 
>> They weren't exposed in the testing run with the previous patch which
>> doesn't use IFN way.  By investigating it, the difference comes from
>> the different costing on MULT_HIGHPART_EXPR and IFN_MULH.
>>
>> For MULT_HIGHPART_EXPR, it's costed by 16 from below call:
>>
>>         case MULT_EXPR:
>>         case WIDEN_MULT_EXPR:
>>         case MULT_HIGHPART_EXPR:
>>           stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
>>
>> While for IFN_MULH, it's costed by 4 as normal stmt so the total cost
>> becomes profitable and the expected vectorization happens.
>>
>> One conservative fix seems to make IFN_MULH costing go through the
>> unique cost interface for multiplication, that is:
>>
>>       case CFN_MULH:
>>         stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
>>         break;
>>
>> As the test case marks the checks as "xfail", probably it's good to
>> revisit the costing on mul_highpart to ensure it's not priced more.
>>
>> The attached patch also addressed Richard S.'s review comments on
>> two reformatting hunks.  Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>>         * internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>>         * internal-fn.def (IFN_MULH): New internal function.
>>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>>         recog normal multiply highpart as IFN_MULH.
>>         * config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
>>         function CFN_MULH.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  8:04                 ` Kewen.Lin
@ 2021-07-15  8:23                   ` Uros Bizjak
  2021-07-15  8:49                     ` Kewen.Lin
  2021-07-15  8:40                   ` Kewen.Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Uros Bizjak @ 2021-07-15  8:23 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, Richard Sandiford, Bill Schmidt, GCC Patches,
	Segher Boessenkool

On Thu, Jul 15, 2021 at 10:04 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Uros,
>
> on 2021/7/15 下午3:17, Uros Bizjak wrote:
> > On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> >>> on 2021/7/14 下午2:38, Richard Biener wrote:
> >>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>
> >>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
> >>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> >>>>
> >>>> Yes.
> >>>>
> >>>
> >>> Thanks for confirming!  The related patch v2 is attached and the testing
> >>> is ongoing.
> >>>
> >>
> >> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> >> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
> >>
> >> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >
> > These XFAILs should be removed after your patch.
> >
> I'm curious whether it's intentional not to specify -fno-vect-cost-model
> for this test case.  As noted above, this case is sensitive on how we
> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
> only with this mul_highpart pattern support, no matter how we model it
> (x86 part of this patch exists or not).
>
> > This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
> > is actually not needed.
> >
>
> Thanks for the information!  The justification for the x86 part is that:
> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
> optab support, i386 port has already customized costing for
> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
> support), if we don't follow the same way for IFN_MULH, I'm worried that
> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
> a right thing (we shouldn't cost it specially), it at least means we
> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
> has direct mul_highpart optab support, I think they should be costed
> consistently.  Does it sound reasonable?

Ah, I was under impression that i386 part was introduced to avoid
generation of PMULHW instructions in the testcases above (to keep
XFAILs). Based on your explanation - yes, the costing function should
be the same. So, the x86 part is OK.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  8:04                 ` Kewen.Lin
  2021-07-15  8:23                   ` Uros Bizjak
@ 2021-07-15  8:40                   ` Kewen.Lin
  2021-07-15 11:58                     ` Richard Biener
  1 sibling, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  8:40 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: Bill Schmidt, GCC Patches, Uros Bizjak, Segher Boessenkool

on 2021/7/15 下午4:04, Kewen.Lin via Gcc-patches wrote:
> Hi Uros,
> 
> on 2021/7/15 下午3:17, Uros Bizjak wrote:
>> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
>>>> on 2021/7/14 下午2:38, Richard Biener wrote:
>>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>
>>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>
>>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
>>>>>
>>>>> Yes.
>>>>>
>>>>
>>>> Thanks for confirming!  The related patch v2 is attached and the testing
>>>> is ongoing.
>>>>
>>>
>>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
>>>
>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>
>> These XFAILs should be removed after your patch.
>>
> I'm curious whether it's intentional not to specify -fno-vect-cost-model
> for this test case.  As noted above, this case is sensitive on how we
> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
> only with this mul_highpart pattern support, no matter how we model it
> (x86 part of this patch exists or not).
> 
>> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
>> is actually not needed.
>>
> 
> Thanks for the information!  The justification for the x86 part is that:
> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
> optab support, i386 port has already customized costing for 
> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
> support), if we don't follow the same way for IFN_MULH, I'm worried that
> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
> a right thing (we shouldn't cost it specially), it at least means we
> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
> has direct mul_highpart optab support, I think they should be costed
> consistently.  Does it sound reasonable?
> 

Hi Richard(s),

This possibly inconsistent handling problem seems like a counter example
better to use a new IFN rather than the existing tree_code, it seems hard
to maintain (should remember to keep consistent for its handlings).  ;)
From this perspective, maybe it's better to move backward to use tree_code
and guard it under can_mult_highpart_p == 1 (just like IFN and avoid
costing issue Richi pointed out before)?

What do you think?

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  8:23                   ` Uros Bizjak
@ 2021-07-15  8:49                     ` Kewen.Lin
  2021-07-15  9:41                       ` Uros Bizjak
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-15  8:49 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Biener, Richard Sandiford, Bill Schmidt, GCC Patches,
	Segher Boessenkool

on 2021/7/15 下午4:23, Uros Bizjak wrote:
> On Thu, Jul 15, 2021 at 10:04 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Uros,
>>
>> on 2021/7/15 下午3:17, Uros Bizjak wrote:
>>> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
>>>>> on 2021/7/14 下午2:38, Richard Biener wrote:
>>>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>>
>>>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>
>>>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>
>>>>> Thanks for confirming!  The related patch v2 is attached and the testing
>>>>> is ongoing.
>>>>>
>>>>
>>>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>>>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
>>>>
>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>>
>>> These XFAILs should be removed after your patch.
>>>
>> I'm curious whether it's intentional not to specify -fno-vect-cost-model
>> for this test case.  As noted above, this case is sensitive on how we
>> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
>> only with this mul_highpart pattern support, no matter how we model it
>> (x86 part of this patch exists or not).
>>
>>> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
>>> is actually not needed.
>>>
>>
>> Thanks for the information!  The justification for the x86 part is that:
>> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
>> optab support, i386 port has already customized costing for
>> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
>> support), if we don't follow the same way for IFN_MULH, I'm worried that
>> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
>> a right thing (we shouldn't cost it specially), it at least means we
>> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
>> has direct mul_highpart optab support, I think they should be costed
>> consistently.  Does it sound reasonable?
> 
> Ah, I was under impression that i386 part was introduced to avoid
> generation of PMULHW instructions in the testcases above (to keep
> XFAILs). Based on your explanation - yes, the costing function should
> be the same. So, the x86 part is OK.
> 

Thanks!  It does have the effect to keep XFAILs.  ;)  I guess the case
doesn't care about the costing much just like most vectorization cases?
If so, do you want me to remove the xfails with one extra option 
"-fno-vect-cost-model" along with this patch?

BR,
Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  8:49                     ` Kewen.Lin
@ 2021-07-15  9:41                       ` Uros Bizjak
  0 siblings, 0 replies; 31+ messages in thread
From: Uros Bizjak @ 2021-07-15  9:41 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Biener, Richard Sandiford, Bill Schmidt, GCC Patches,
	Segher Boessenkool

V čet., 15. jul. 2021 10:49 je oseba Kewen.Lin <linkw@linux.ibm.com>
napisala:

> on 2021/7/15 下午4:23, Uros Bizjak wrote:
> > On Thu, Jul 15, 2021 at 10:04 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Uros,
> >>
> >> on 2021/7/15 下午3:17, Uros Bizjak wrote:
> >>> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> >>>>> on 2021/7/14 下午2:38, Richard Biener wrote:
> >>>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com>
> wrote:
> >>>>>>>
> >>>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
> >>>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com>
> wrote:
> >>>>>>
> >>>>>>> I guess the proposed IFN would be directly mapped for
> [us]mul_highpart?
> >>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>
> >>>>> Thanks for confirming!  The related patch v2 is attached and the
> testing
> >>>>> is ongoing.
> >>>>>
> >>>>
> >>>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> >>>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as
> below:
> >>>>
> >>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>>
> >>> These XFAILs should be removed after your patch.
> >>>
> >> I'm curious whether it's intentional not to specify -fno-vect-cost-model
> >> for this test case.  As noted above, this case is sensitive on how we
> >> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
> >> only with this mul_highpart pattern support, no matter how we model it
> >> (x86 part of this patch exists or not).
> >>
> >>> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
> >>> is actually not needed.
> >>>
> >>
> >> Thanks for the information!  The justification for the x86 part is that:
> >> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
> >> optab support, i386 port has already customized costing for
> >> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
> >> support), if we don't follow the same way for IFN_MULH, I'm worried that
> >> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
> >> a right thing (we shouldn't cost it specially), it at least means we
> >> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
> >> has direct mul_highpart optab support, I think they should be costed
> >> consistently.  Does it sound reasonable?
> >
> > Ah, I was under impression that i386 part was introduced to avoid
> > generation of PMULHW instructions in the testcases above (to keep
> > XFAILs). Based on your explanation - yes, the costing function should
> > be the same. So, the x86 part is OK.
> >
>
> Thanks!  It does have the effect to keep XFAILs.  ;)  I guess the case
> doesn't care about the costing much just like most vectorization cases?
> If so, do you want me to remove the xfails with one extra option
> "-fno-vect-cost-model" along with this patch.
>

Yes, please do so. The testcase cares only about PMULHW generation.

Thanks,
Uros.


> BR,
> Kewen
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] vect: Recog mul_highpart pattern
  2021-07-15  8:40                   ` Kewen.Lin
@ 2021-07-15 11:58                     ` Richard Biener
  2021-07-16  5:33                       ` [PATCH v4] " Kewen.Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-07-15 11:58 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Sandiford, Bill Schmidt, GCC Patches, Uros Bizjak,
	Segher Boessenkool

On Thu, Jul 15, 2021 at 10:41 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/7/15 下午4:04, Kewen.Lin via Gcc-patches wrote:
> > Hi Uros,
> >
> > on 2021/7/15 下午3:17, Uros Bizjak wrote:
> >> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>
> >>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> >>>> on 2021/7/14 下午2:38, Richard Biener wrote:
> >>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>>
> >>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
> >>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>
> >>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>
> >>>> Thanks for confirming!  The related patch v2 is attached and the testing
> >>>> is ongoing.
> >>>>
> >>>
> >>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> >>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
> >>>
> >>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>
> >> These XFAILs should be removed after your patch.
> >>
> > I'm curious whether it's intentional not to specify -fno-vect-cost-model
> > for this test case.  As noted above, this case is sensitive on how we
> > cost mult_highpart.  Without cost modeling, the XFAILs can be removed
> > only with this mul_highpart pattern support, no matter how we model it
> > (x86 part of this patch exists or not).
> >
> >> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
> >> is actually not needed.
> >>
> >
> > Thanks for the information!  The justification for the x86 part is that:
> > the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
> > optab support, i386 port has already customized costing for
> > MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
> > support), if we don't follow the same way for IFN_MULH, I'm worried that
> > we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
> > a right thing (we shouldn't cost it specially), it at least means we
> > have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
> > has direct mul_highpart optab support, I think they should be costed
> > consistently.  Does it sound reasonable?
> >
>
> Hi Richard(s),
>
> This possibly inconsistent handling problem seems like a counter example
> better to use a new IFN rather than the existing tree_code, it seems hard
> to maintain (should remember to keep consistent for its handlings).  ;)
> From this perspective, maybe it's better to move backward to use tree_code
> and guard it under can_mult_highpart_p == 1 (just like IFN and avoid
> costing issue Richi pointed out before)?
>
> What do you think?

No, whenever we want to do code generation based on machine
capabilities the canonical way to test for those is to look at optabs
and then it's most natural to keep that 1:1 relation and emit
internal function calls which directly map to supported optabs
instead of going back to some tree codes.

When targets "lie" and provide expanders for something they can
only emulate then they have to compensate in their costing.
But as I understand this isn't the case for x86 here.

Now, in this case we already have the MULT_HIGHPART_EXPR tree,
so yes, it might make sense to use that instead of introducing an
alternate way via the direct internal function.  Somebody decided
that MULT_HIGHPART is generic enough to warrant this - but I
see that expand_mult_highpart can fail unless can_mult_highpart_p
and this is exactly one of the cases we want to avoid - either
we can handle something generally in which case it can be a
tree code or we can't, then it should be 1:1 tied to optabs at best
(mult_highpart has scalar support only for the direct optab,
vector support also for widen_mult).

Richard.

>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC/PATCH] vect: Recog mul_highpart pattern
  2021-07-15  1:40                     ` Kewen.Lin
@ 2021-07-15 23:08                       ` Segher Boessenkool
  0 siblings, 0 replies; 31+ messages in thread
From: Segher Boessenkool @ 2021-07-15 23:08 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, Bill Schmidt, richard.sandiford

On Thu, Jul 15, 2021 at 09:40:52AM +0800, Kewen.Lin wrote:
> on 2021/7/15 上午3:32, Segher Boessenkool wrote:
> > The normal rule is you cannot go over 80.  It is perfectly fine to have
> > shorter lines, certainly if that is nice for some other reason, so
> > automatically (by some tool) changing this is Just Wrong.
> 
> OK, could this be applied to changelog entry too?  I guess yes?

Yes, lines of length 80 are fine in changelogs.

But try to not make short lines (that do no end an entry).  A changelog
that looks different from other changelogs is harder to read.  Normally
you have a whole bunch of totally boring entries ("New." or "Likewise."
for example), and the few that are longer naturally stand out then,
making it easier to scan the changelogs (which is what they are used for
most of the time: search for something, and press "n" a lot).

Also try to write less trivial things somewhat briefly in changelogs:
changelogs just say *what* changed, not *why*, and it is okay to leave
out details (this is a tradeoff of course).

Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4] vect: Recog mul_highpart pattern
  2021-07-15 11:58                     ` Richard Biener
@ 2021-07-16  5:33                       ` Kewen.Lin
  2021-07-19 10:35                         ` Richard Biener
  0 siblings, 1 reply; 31+ messages in thread
From: Kewen.Lin @ 2021-07-16  5:33 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, Bill Schmidt, GCC Patches, Uros Bizjak,
	Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

on 2021/7/15 下午7:58, Richard Biener wrote:
> On Thu, Jul 15, 2021 at 10:41 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2021/7/15 下午4:04, Kewen.Lin via Gcc-patches wrote:
>>> Hi Uros,
>>>
>>> on 2021/7/15 下午3:17, Uros Bizjak wrote:
>>>> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>
>>>>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
>>>>>> on 2021/7/14 下午2:38, Richard Biener wrote:
>>>>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>>>
>>>>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
>>>>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>>
>>>>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>
>>>>>> Thanks for confirming!  The related patch v2 is attached and the testing
>>>>>> is ongoing.
>>>>>>
>>>>>
>>>>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
>>>>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
>>>>>
>>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
>>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
>>>>
>>>> These XFAILs should be removed after your patch.
>>>>
>>> I'm curious whether it's intentional not to specify -fno-vect-cost-model
>>> for this test case.  As noted above, this case is sensitive on how we
>>> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
>>> only with this mul_highpart pattern support, no matter how we model it
>>> (x86 part of this patch exists or not).
>>>
>>>> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
>>>> is actually not needed.
>>>>
>>>
>>> Thanks for the information!  The justification for the x86 part is that:
>>> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
>>> optab support, i386 port has already customized costing for
>>> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
>>> support), if we don't follow the same way for IFN_MULH, I'm worried that
>>> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
>>> a right thing (we shouldn't cost it specially), it at least means we
>>> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
>>> has direct mul_highpart optab support, I think they should be costed
>>> consistently.  Does it sound reasonable?
>>>
>>
>> Hi Richard(s),
>>
>> This possibly inconsistent handling problem seems like a counter example
>> better to use a new IFN rather than the existing tree_code, it seems hard
>> to maintain (should remember to keep consistent for its handlings).  ;)
>> From this perspective, maybe it's better to move backward to use tree_code
>> and guard it under can_mult_highpart_p == 1 (just like IFN and avoid
>> costing issue Richi pointed out before)?
>>
>> What do you think?
> 
> No, whenever we want to do code generation based on machine
> capabilities the canonical way to test for those is to look at optabs
> and then it's most natural to keep that 1:1 relation and emit
> internal function calls which directly map to supported optabs
> instead of going back to some tree codes.
> 
> When targets "lie" and provide expanders for something they can
> only emulate then they have to compensate in their costing.
> But as I understand this isn't the case for x86 here.
> 
> Now, in this case we already have the MULT_HIGHPART_EXPR tree,
> so yes, it might make sense to use that instead of introducing an
> alternate way via the direct internal function.  Somebody decided
> that MULT_HIGHPART is generic enough to warrant this - but I
> see that expand_mult_highpart can fail unless can_mult_highpart_p
> and this is exactly one of the cases we want to avoid - either
> we can handle something generally in which case it can be a
> tree code or we can't, then it should be 1:1 tied to optabs at best
> (mult_highpart has scalar support only for the direct optab,
> vector support also for widen_mult).
> 

Thanks for the detailed explanation!  The attached v4 follows the
preferred IFN way like v3, just with extra test case updates.

Bootstrapped & regtested again on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-----
gcc/ChangeLog:

	PR tree-optimization/100696
	* internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
	* internal-fn.def (IFN_MULH): New internal function.
	* tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
	recog normal multiply highpart as IFN_MULH.
	* config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
	function CFN_MULH.

gcc/testsuite/ChangeLog:

	PR tree-optimization/100696
	* gcc.target/i386/pr100637-3w.c: Adjust for mul_highpart recog.

[-- Attachment #2: vect-recog-hmul-v4.patch --]
[-- Type: text/plain, Size: 5204 bytes --]

---
 gcc/config/i386/i386.c                      |  3 ++
 gcc/internal-fn.c                           |  1 +
 gcc/internal-fn.def                         |  2 ++
 gcc/testsuite/gcc.target/i386/pr100637-3w.c |  6 ++--
 gcc/tree-vect-patterns.c                    | 38 +++++++++++++++------
 5 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a93128fa0a4..1dd9108353c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -22559,6 +22559,9 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 				   mode == SFmode ? ix86_cost->fmass
 				   : ix86_cost->fmasd);
 	break;
+      case CFN_MULH:
+	stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
+	break;
       default:
 	break;
       }
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fb8b43d1ce2..b1b4289357c 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn)
     case IFN_FNMS:
     case IFN_AVG_FLOOR:
     case IFN_AVG_CEIL:
+    case IFN_MULH:
     case IFN_MULHS:
     case IFN_MULHRS:
     case IFN_FMIN:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3b8e730960..ed6d7de1680 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 			      savg_ceil, uavg_ceil, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
+			      smul_highpart, umul_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/testsuite/gcc.target/i386/pr100637-3w.c b/gcc/testsuite/gcc.target/i386/pr100637-3w.c
index b951f30f571..4ea467b4af5 100644
--- a/gcc/testsuite/gcc.target/i386/pr100637-3w.c
+++ b/gcc/testsuite/gcc.target/i386/pr100637-3w.c
@@ -1,6 +1,6 @@
 /* PR target/100637 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -msse4" } */
+/* { dg-options "-O2 -ftree-vectorize -msse4 -fno-vect-cost-model" } */
 
 short r[2], a[2], b[2];
 unsigned short ur[2], ua[2], ub[2];
@@ -13,7 +13,7 @@ void mulh (void)
     r[i] = ((int) a[i] * b[i]) >> 16;
 }
 
-/* { dg-final { scan-assembler "pmulhw" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler "pmulhw" } } */
 
 void mulhu (void)
 {
@@ -23,7 +23,7 @@ void mulhu (void)
     ur[i] = ((unsigned int) ua[i] * ub[i]) >> 16;
 }
 
-/* { dg-final { scan-assembler "pmulhuw" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler "pmulhuw" } } */
 
 void mulhrs (void)
 {
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index b2e7fc2cc7a..ada89d7060b 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
 
    1) Multiply high with scaling
      TYPE res = ((TYPE) a * (TYPE) b) >> c;
+     Here, c is bitsize (TYPE) / 2 - 1.
+
    2) ... or also with rounding
      TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1;
+     Here, d is bitsize (TYPE) / 2 - 2.
+
+   3) Normal multiply high
+     TYPE res = ((TYPE) a * (TYPE) b) >> e;
+     Here, e is bitsize (TYPE) / 2.
 
    where only the bottom half of res is used.  */
 
@@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
   stmt_vec_info mulh_stmt_info;
   tree scale_term;
   internal_fn ifn;
-  unsigned int expect_offset;
 
   /* Check for the presence of the rounding term.  */
   if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR)
@@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo,
 
       /* Get the scaling term.  */
       scale_term = gimple_assign_rhs2 (plus_input_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
+
+      /* Check pattern 2).  */
+      if (wi::to_widest (scale_term) + target_precision + 2
+	  != TYPE_PRECISION (lhs_type))
+	return NULL;
 
-      expect_offset = target_precision + 2;
       ifn = IFN_MULHRS;
     }
   else
     {
       mulh_stmt_info = rshift_input_stmt_info;
       scale_term = gimple_assign_rhs2 (last_stmt);
+      /* Check that the scaling factor is correct.  */
+      if (TREE_CODE (scale_term) != INTEGER_CST)
+	return NULL;
 
-      expect_offset = target_precision + 1;
-      ifn = IFN_MULHS;
+      /* Check for pattern 1).  */
+      if (wi::to_widest (scale_term) + target_precision + 1
+	  == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULHS;
+      /* Check for pattern 3).  */
+      else if (wi::to_widest (scale_term) + target_precision
+	       == TYPE_PRECISION (lhs_type))
+	ifn = IFN_MULH;
+      else
+	return NULL;
     }
 
-  /* Check that the scaling factor is correct.  */
-  if (TREE_CODE (scale_term) != INTEGER_CST
-      || wi::to_widest (scale_term) + expect_offset
-	   != TYPE_PRECISION (lhs_type))
-    return NULL;
-
   /* Check whether the scaling input term can be seen as two widened
      inputs multiplied together.  */
   vect_unpromoted_value unprom_mult[2];
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4] vect: Recog mul_highpart pattern
  2021-07-16  5:33                       ` [PATCH v4] " Kewen.Lin
@ 2021-07-19 10:35                         ` Richard Biener
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Biener @ 2021-07-19 10:35 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Sandiford, Bill Schmidt, GCC Patches, Uros Bizjak,
	Segher Boessenkool

On Fri, Jul 16, 2021 at 7:33 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> on 2021/7/15 下午7:58, Richard Biener wrote:
> > On Thu, Jul 15, 2021 at 10:41 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> on 2021/7/15 下午4:04, Kewen.Lin via Gcc-patches wrote:
> >>> Hi Uros,
> >>>
> >>> on 2021/7/15 下午3:17, Uros Bizjak wrote:
> >>>> On Thu, Jul 15, 2021 at 9:07 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>
> >>>>> on 2021/7/14 下午3:45, Kewen.Lin via Gcc-patches wrote:
> >>>>>> on 2021/7/14 下午2:38, Richard Biener wrote:
> >>>>>>> On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>>>>
> >>>>>>>> on 2021/7/13 下午8:42, Richard Biener wrote:
> >>>>>>>>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>>>
> >>>>>>>> I guess the proposed IFN would be directly mapped for [us]mul_highpart?
> >>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>
> >>>>>> Thanks for confirming!  The related patch v2 is attached and the testing
> >>>>>> is ongoing.
> >>>>>>
> >>>>>
> >>>>> It's bootstrapped & regtested on powerpc64le-linux-gnu P9 and
> >>>>> aarch64-linux-gnu.  But on x86_64-redhat-linux there are XPASSes as below:
> >>>>>
> >>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhuw
> >>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>>>> XFAIL->XPASS: gcc.target/i386/pr100637-3w.c scan-assembler pmulhw
> >>>>
> >>>> These XFAILs should be removed after your patch.
> >>>>
> >>> I'm curious whether it's intentional not to specify -fno-vect-cost-model
> >>> for this test case.  As noted above, this case is sensitive on how we
> >>> cost mult_highpart.  Without cost modeling, the XFAILs can be removed
> >>> only with this mul_highpart pattern support, no matter how we model it
> >>> (x86 part of this patch exists or not).
> >>>
> >>>> This is PR100696 [1], we want PMULH.W here, so x86 part of the patch
> >>>> is actually not needed.
> >>>>
> >>>
> >>> Thanks for the information!  The justification for the x86 part is that:
> >>> the IFN_MULH essentially covers MULT_HIGHPART_EXPR with mul_highpart
> >>> optab support, i386 port has already customized costing for
> >>> MULT_HIGHPART_EXPR (should mean/involve the case with mul_highpart optab
> >>> support), if we don't follow the same way for IFN_MULH, I'm worried that
> >>> we may cost the IFN_MULH wrongly.  If taking IFN_MULH as normal stmt is
> >>> a right thing (we shouldn't cost it specially), it at least means we
> >>> have to adjust ix86_multiplication_cost for MULT_HIGHPART_EXPR when it
> >>> has direct mul_highpart optab support, I think they should be costed
> >>> consistently.  Does it sound reasonable?
> >>>
> >>
> >> Hi Richard(s),
> >>
> >> This possibly inconsistent handling problem seems like a counter example
> >> better to use a new IFN rather than the existing tree_code, it seems hard
> >> to maintain (should remember to keep consistent for its handlings).  ;)
> >> From this perspective, maybe it's better to move backward to use tree_code
> >> and guard it under can_mult_highpart_p == 1 (just like IFN and avoid
> >> costing issue Richi pointed out before)?
> >>
> >> What do you think?
> >
> > No, whenever we want to do code generation based on machine
> > capabilities the canonical way to test for those is to look at optabs
> > and then it's most natural to keep that 1:1 relation and emit
> > internal function calls which directly map to supported optabs
> > instead of going back to some tree codes.
> >
> > When targets "lie" and provide expanders for something they can
> > only emulate then they have to compensate in their costing.
> > But as I understand this isn't the case for x86 here.
> >
> > Now, in this case we already have the MULT_HIGHPART_EXPR tree,
> > so yes, it might make sense to use that instead of introducing an
> > alternate way via the direct internal function.  Somebody decided
> > that MULT_HIGHPART is generic enough to warrant this - but I
> > see that expand_mult_highpart can fail unless can_mult_highpart_p
> > and this is exactly one of the cases we want to avoid - either
> > we can handle something generally in which case it can be a
> > tree code or we can't, then it should be 1:1 tied to optabs at best
> > (mult_highpart has scalar support only for the direct optab,
> > vector support also for widen_mult).
> >
>
> Thanks for the detailed explanation!  The attached v4 follows the
> preferred IFN way like v3, just with extra test case updates.
>
> Bootstrapped & regtested again on powerpc64le-linux-gnu P9,
> x86_64-redhat-linux and aarch64-linux-gnu.
>
> Is it ok for trunk?

OK.

Thanks,
Richard.

> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         PR tree-optimization/100696
>         * internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
>         * internal-fn.def (IFN_MULH): New internal function.
>         * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
>         recog normal multiply highpart as IFN_MULH.
>         * config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
>         function CFN_MULH.
>
> gcc/testsuite/ChangeLog:
>
>         PR tree-optimization/100696
>         * gcc.target/i386/pr100637-3w.c: Adjust for mul_highpart recog.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-07-19 10:35 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-13  8:52 [RFC/PATCH] vect: Recog mul_highpart pattern Kewen.Lin
2021-07-13  8:58 ` [PATCH] rs6000: Support [u]mul<mode>3_highpart for vector Kewen.Lin
2021-07-13 22:07   ` Segher Boessenkool
2021-07-14  2:12     ` Kewen.Lin
2021-07-14 18:38       ` Segher Boessenkool
2021-07-13  9:35 ` [RFC/PATCH] vect: Recog mul_highpart pattern Richard Biener
2021-07-13  9:40   ` Richard Sandiford
2021-07-13  9:44     ` Richard Biener
2021-07-13 10:11       ` Richard Sandiford
2021-07-13 10:25   ` Kewen.Lin
2021-07-13 12:42     ` Richard Biener
2021-07-13 14:59       ` Kewen.Lin
2021-07-14  6:38         ` Richard Biener
2021-07-14  7:45           ` Kewen.Lin
2021-07-14  8:38             ` Richard Sandiford
2021-07-14 10:02               ` Kewen.Lin
2021-07-14 11:32                 ` Richard Sandiford
2021-07-14 19:32                   ` Segher Boessenkool
2021-07-15  1:40                     ` Kewen.Lin
2021-07-15 23:08                       ` Segher Boessenkool
2021-07-15  1:37                   ` Kewen.Lin
2021-07-15  7:06             ` [PATCH v3] " Kewen.Lin
2021-07-15  7:17               ` Uros Bizjak
2021-07-15  8:04                 ` Kewen.Lin
2021-07-15  8:23                   ` Uros Bizjak
2021-07-15  8:49                     ` Kewen.Lin
2021-07-15  9:41                       ` Uros Bizjak
2021-07-15  8:40                   ` Kewen.Lin
2021-07-15 11:58                     ` Richard Biener
2021-07-16  5:33                       ` [PATCH v4] " Kewen.Lin
2021-07-19 10:35                         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).