From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-490522-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 73772 invoked by alias); 20 Nov 2018 14:57:46 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 73758 invoked by uid 89); 20 Nov 2018 14:57:45 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY autolearn=ham version=3.3.2 spammy=Apply, sk:renlin., renlinlifossarmcom, DCE
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 20 Nov 2018 14:57:43 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AD306EBD;	Tue, 20 Nov 2018 06:57:41 -0800 (PST)
Received: from [10.2.206.82] (e109742-lin.cambridge.arm.com [10.2.206.82])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 454763F5A0;	Tue, 20 Nov 2018 06:57:40 -0800 (PST)
Subject: Re: [RFC][PATCH]Merge VEC_COND_EXPR into MASK_STORE after loop vectorization
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Richard Sandiford <richard.sandiford@arm.com>, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>, James Greenhalgh <James.Greenhalgh@arm.com>
References: <a6041c90-2e77-f765-6f0d-f46e326a66de@foss.arm.com> <CAFiYyc1BOJ+sAUBGeKTrFSeYYGXRWqOwAowVeLcJuMJzTBTSow@mail.gmail.com> <afaa665f-3fa0-38cd-28ba-bd1ed1630b0b@foss.arm.com> <CAFiYyc0cSGYvP0sLJgxfEBYKicPrPSHUSGOpcr-osS8eLrMQUQ@mail.gmail.com> <0ff85e5c-c6ef-3220-62c5-1e4924b5c0c8@foss.arm.com> <CAFiYyc2o+aETPsTPWu8STo3PXJQgKmX=vWwP7fUXeDt9+j9fCQ@mail.gmail.com>
From: Renlin Li <renlin.li@foss.arm.com>
Message-ID: <fd792aa3-1881-c11c-b3f4-84c178bcd146@foss.arm.com>
Date: Tue, 20 Nov 2018 14:57:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc2o+aETPsTPWu8STo3PXJQgKmX=vWwP7fUXeDt9+j9fCQ@mail.gmail.com>
Content-Type: multipart/mixed; boundary="------------4BF20A6F2B941E8BD1308DFD"
X-IsSubscribed: yes
X-SW-Source: 2018-11/txt/msg01724.txt.bz2

This is a multi-part message in MIME format.
--------------4BF20A6F2B941E8BD1308DFD
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-length: 5390

Hi Richard,

On 11/14/2018 02:59 PM, Richard Biener wrote:
> On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>
>> Hi Richard,
>>
>> On 11/09/2018 11:48 AM, Richard Biener wrote:
>>> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>
>>>> Hi Richard,
>>>>
>>>>
> I don't see the masked load here on x86_64 btw. (I don't see
> if-conversion generating a load).
> I guess that's again when store-data-races are allowed that it uses a
> RMW cycle and vectorization
> generating the masked variants for the loop-mask.  Which means for SVE
> if-conversion should
> prefer the masked-store variant even when store data races are allowed?

Yes, it looks like, for SVE, masked-store variant is preferred even when store data races are allowed.
This decision is made in if-cvt.

mask_store need a pointer, and if it is created from an array access, we need to make sure the data reference analysis
could properly analysis relationship between array reference and pointer reference.
So that no versioned loop is generated during loop vectorization.
(This is a general improvement, and could be done in a different patch?)

> 
>>
>>>
>>> I was wondering whether we can implement
>>>
>>>     l = [masked]load;
>>>     tem = cond ? x : l;
>>>     masked-store = tem;
>>>
>>> pattern matching in a regular pass - forwprop for example.  Note the
>>> load doesn't need to be masked,
>>> correct?  In fact if it is masked you need to make sure the
>>> conditional never accesses parts that
>>> are masked in the load, no?  Or require the mask to be the same as
>>> that used by the store.  But then
>>> you still cannot simply replace the store mask with a new mask
>>> generated from the conditional?
>>
>> Yes, this would require the mask for load and store is the same.
>> This matches the pattern before loop vectorization.
>> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>>
>> The new mask is the (original mask & condition mask) (example shown above).
>> In this case, less lanes will be stored.
>>
>> It is possible we do that in forwprop.
>> I could try to integrate the change into it if it is the correct place to go.
>>
>> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
>> converted into other forms. For example, forwprop will transform the original code into:
>>
>>     vect__2.4_29 = vect_cst__27 + { 1, ... };
>>     _16 = (void *) ivtmp.13_25;
>>     _2 = &MEM[base: _16, offset: 0B];
>>     vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>>     _28 = vect_cst__34 != { 0, ... };
>>     _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>>     vect__ifc__26.8_36 = _35;
>>     .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>>     ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>>     next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>>     ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>>
>> This make the pattern matching not straight forward.
> 
> Ah, that's because of the .COND_ADDs (I wonder about the copy that's
> left - forwprop should eliminate copies).  Note the pattern-matching
> could go in the
> 
>        /* Apply forward propagation to all stmts in the basic-block.
>           Note we update GSI within the loop as necessary.  */
> 
> loop which comes before the match.pd pattern matching so you'd
> still see the form without the .COND_ADD I think.
> 
> There _is_ precedence for some masked-store post-processing
> in the vectorizer (optimize_mask_stores), not that I like that
> very much either.  Eventually those can be at least combined...
Thanks for your suggestion, indeed .COND_ADD is generated later in fold_stmt function.

I update the patch with the style of "forward-propagation". It starts from
VEC_COND, and forward propagate it into MASK_STORE when specific pattern is found.

      X = MASK_LOAD (PTR, -, MASK)
      VAL = ...
      Y = VEC_COND (cond, VAL, X)
      MASK_STORE (PTR, -, MASK, Y)
> 
> That said, I prefer the forwprop place for any pattern matching
> and the current patch needs more comments to understand
> what it is doing (the DCE it does is IMHO premature).  You
> should also modify the masked store in-place rather than
> building a new one.  I don't like how you need to use
> find_data_references_in_stmt, can't you simply compare
> the address and size arguments?  

find_data_references_in_stmt is used because the two data reference are created
as two new SSA_NAMEs from same scalar pointer by loop vectorizer.
I can not directly compare the address as the are complicated with loop information.

By moving the functionality into forwprop, the complications are removed by the optimizers in between.
This makes a simple comparison possible.


> It should also work for
> a non-masked load I guess and thus apply to non-SVE if
> we manage to feed the masked store with another conditional.

You are right, non-masked load is a load with a mask of all 1.
As long as the store mask is a subset of load mask, and they are loading from the same address,
we could do this combining. (I haven't add this yet as I don't have a test case to test it)
This probably indicates there are more cases we could rewrite around masked operations.


Again, thanks for your review!
Renlin


> 
> Richard.
> 
>>
>> Thanks,
>> Renlin
>>
>>
>>>
>>> Richard.
>>>

--------------4BF20A6F2B941E8BD1308DFD
Content-Type: text/x-patch;
 name="forwprop.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="forwprop.diff"
Content-length: 6194

commit bf42fb31fc5f9beda31a025ec3dbe244b558ed9a
Author: Renlin Li <renlin.li@arm.com>
Date:   Mon Nov 19 15:27:02 2018 +0000

    forwprop

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
new file mode 100644
index 0000000..9213f06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/combine_vcond_mask_store_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize --param allow-store-data-races=1" } */
+
+static int array[100];
+int test (int a, int i)
+{
+   for (unsigned i = 0; i < 16; i++)
+     {
+       if (a & 1)
+        array[i] = a + 1;
+     }
+   return array[i];
+}
+
+/* { dg-final { scan-assembler-not {ld1d} } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 7449eaf..c954b2d 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -629,6 +629,163 @@ forward_propagate_into_cond (gimple_stmt_iterator *gsi_p)
   return 0;
 }
 
+/* Given the following pattern:
+     X = MASK_LOAD (PTR, -, MASK)
+     VAL = ...
+     Y = VEC_COND (cond, VAL, X)
+     MASK_STORE (PTR, -, MASK, Y)
+
+   They could be combined into a single MASK_STORE with new mask.
+   The new mask is the combination of original mask and the value selection mask
+   in VEC_COND_EXPR.
+
+     AND_MASK = MASK & cond
+     MASK_STORE (PTR, -, AND_MASK, VAL)
+
+   After the transformation, the MASK_LOAD and VEC_COND_EXPR might be dead.  */
+
+static bool
+forward_propagate_vec_cond_expr (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  if (!is_gimple_assign (stmt) || gimple_assign_rhs_code (stmt) != VEC_COND_EXPR)
+    return false;
+
+  imm_use_iterator iter;
+  gimple *use_stmt;
+  use_operand_p use_p;
+  tree lhs = gimple_assign_lhs (stmt);
+  bool rewrite = false;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
+    {
+      use_stmt = USE_STMT (use_p);
+      if (gimple_call_internal_p (use_stmt, IFN_MASK_STORE)
+	  && !gimple_has_volatile_ops (use_stmt))
+	{
+	  rewrite = true;
+	  break;
+	}
+    }
+
+  /* Early return if it doesn't have a user we are interested in.  */
+  if (!rewrite)
+    return false;
+
+  tree sel_cond = gimple_assign_rhs1 (stmt);
+  tree true_val = gimple_assign_rhs2 (stmt);
+  tree false_val = gimple_assign_rhs3 (stmt);
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+    {
+      if (!gimple_call_internal_p (use_stmt, IFN_MASK_STORE))
+	continue;
+
+      gimple *mask_store = use_stmt;
+      tree store_ptr = gimple_call_arg (mask_store, 0);
+      tree vec_op = gimple_call_arg (mask_store, 3);
+      tree store_mask = gimple_call_arg (mask_store, 2);
+      if (vec_op != lhs)
+	continue;
+
+      /* Looking for this pattern:
+	 X = MASK_LOAD (PTR, -, MASK)
+	 VAL = ...
+	 Y = VEC_COND (cond, VAL, X)
+	 MASK_STORE (PTR, -, MASK, Y)  */
+      gimple *mask_load = NULL;
+      bool reverse_p = false;
+
+      /* A[i] = cond ? val : A[i] */
+      if (TREE_CODE (false_val) == SSA_NAME)
+	{
+	  gimple *def = SSA_NAME_DEF_STMT (false_val);
+	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+	    mask_load = def;
+	}
+      /* A[i] = cond ? A[i] : val
+	 Transform into:
+	 A[i] = !cond ? val : A[i]  */
+      if (mask_load == NULL && TREE_CODE (true_val) == SSA_NAME)
+	{
+	  gimple *def = SSA_NAME_DEF_STMT (true_val);
+	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
+	    {
+	      enum tree_code code = TREE_CODE (sel_cond);
+	      tree op_type = TREE_TYPE (TREE_OPERAND (sel_cond, 0));
+	      code = invert_tree_comparison (code, HONOR_NANS (op_type));
+	      if (code == ERROR_MARK)
+		return false;
+
+	      sel_cond = build2_loc (EXPR_LOCATION (sel_cond), code,
+				     TREE_TYPE (sel_cond),
+				     TREE_OPERAND (sel_cond, 0),
+				     TREE_OPERAND (sel_cond, 1));
+	      mask_load = def;
+	      reverse_p = true;
+	    }
+	}
+
+      /* The pair must be in the same basic block, use the same mask,
+	 and access the same memory.  */
+      if (mask_load == NULL
+	  || gimple_has_volatile_ops (mask_load)
+	  || gimple_bb (mask_store) != gimple_bb (mask_load)
+	  || gimple_vuse (mask_store) != gimple_vuse (mask_load)
+	  || !operand_equal_p (store_mask, gimple_call_arg (mask_load, 2), 0)
+	  || !operand_equal_p (store_ptr, gimple_call_arg (mask_load, 0), 0)
+	  || !operand_equal_p (gimple_call_arg (mask_load, 1),
+			       gimple_call_arg (mask_store, 1),
+			       0))
+	continue;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "  Merge '");
+	  print_gimple_expr (dump_file, mask_load, 0);
+	  fprintf (dump_file, "  into '");
+	  print_gimple_expr (dump_file, mask_store, 0);
+	  fprintf (dump_file, "'\n");
+	}
+
+      tree sel_mask
+	= force_gimple_operand_gsi (gsi,
+				    unshare_expr (sel_cond),
+				    true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Replace the condition in COND_EXPR with the new variable.  */
+      if (reverse_p)
+	{
+	  gimple_set_op (stmt, 1, sel_mask);
+	  gimple_set_op (stmt, 2, false_val);
+	  gimple_set_op (stmt, 3, true_val);
+
+	  update_stmt (stmt);
+	  true_val = false_val;
+	}
+      else
+	{
+	  gimple_set_op (stmt, 1, sel_mask);
+	  update_stmt (stmt);
+	}
+
+      /* Create the new mask which should have less active lanes.  */
+      tree and_mask = make_temp_ssa_name (TREE_TYPE (store_mask),
+					  NULL, "vec_mask_and");
+      gimple *and_stmt = gimple_build_assign (and_mask, BIT_AND_EXPR,
+					      sel_mask, store_mask);
+      gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
+
+      /* Update the MASK_STORE with new mask and value.  */
+      gimple_call_set_arg (mask_store, 2, and_mask);
+      gimple_call_set_arg (mask_store, 3, true_val);
+      update_stmt (mask_store);
+    }
+
+  return has_zero_uses (lhs);
+}
+
 /* We've just substituted an ADDR_EXPR into stmt.  Update all the
    relevant data structures to match.  */
 
@@ -2440,6 +2597,12 @@ pass_forwprop::execute (function *fun)
 	      else
 		gsi_next (&gsi);
 	    }
+	  else if (code == VEC_COND_EXPR
+		   && forward_propagate_vec_cond_expr (&gsi))
+	    {
+	      release_defs (stmt);
+	      gsi_remove (&gsi, true);
+	    }
 	  else
 	    gsi_next (&gsi);
 	}

--------------4BF20A6F2B941E8BD1308DFD--