From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
 by sourceware.org (Postfix) with ESMTPS id C5E193851C27
 for <gcc-patches@gcc.gnu.org>; Thu, 20 May 2021 10:22:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C5E193851C27
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=rguenther@suse.de
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
 by mx2.suse.de (Postfix) with ESMTP id B2738AC5B;
 Thu, 20 May 2021 10:22:41 +0000 (UTC)
Date: Thu, 20 May 2021 12:22:41 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
cc: Richard Sandiford <richard.sandiford@arm.com>, 
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue
 vectorization
In-Reply-To: <c03961fa-ef0f-7ce4-180b-8414a610ee73@arm.com>
Message-ID: <nycvar.YFH.7.76.2105201147390.9200@zhemvz.fhfr.qr>
References: <ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com>
 <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
 <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com>
 <nycvar.YFH.7.76.2105051414540.9200@zhemvz.fhfr.qr>
 <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com>
 <c03961fa-ef0f-7ce4-180b-8414a610ee73@arm.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
X-Spam-Status: No, score=-5.3 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 20 May 2021 10:22:44 -0000

On Mon, 17 May 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> So this is my second attempt at finding a way to improve how we generate the
> vector IV's and teach the vectorizer to share them between main loop and
> epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
> was a terrible idea and I quickly threw it in the bin. The main problem, that
> for some reason I failed to see, was that the control_iv increases by 's' and
> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> amount of elements we handle per iteration. That means the epilogue loops
> would have to start from the last loop's IV * the last loop's NELEMENT's and
> that would just cause a mess.
> 
> Instead I started to think about creating IV's for the datarefs and what I
> thought worked best was to create these in scalar before peeling. That way the
> peeling mechanisms takes care of the duplication of these for the vector and
> scalar epilogues and it also takes care of adding phi-nodes for the
> skip_vector paths.

How does this work for if-converted loops where we use the 
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues?  I suppose
you're only adjusting the if-converted copy.

> These new IV's have two functions:
> 1) 'vect_create_data_ref_ptr' can use them to:
>  a) if it's the main loop: replace the values of the 'initial' value of the
> main loop's IV and the initial values in the skip_vector phi-nodes
>  b) Update the the skip_vector phi-nodes argument for the non-skip path with
> the updated vector ptr.

b) means the prologue IV will not be dead there so we actually need
to compute it?  I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?

> 2) They are used for the scalar epilogue ensuring they share the same
> datareference ptr.
> 
> There are still a variety of 'hacky' elements here and a lot of testing to be
> done, but I hope to be able to clean them away. One of the main issues I had
> was that I had to skip a couple of checks and things for the added phi-nodes
> and update statements as these do not have stmt_vec_info representation. 
> Though I'm not sure adding this representation at their creation was much
> cleaner... It is something I could play around with but I thought this was a
> good moment to ask you for input. For instance, maybe we could do this
> transformation before analysis?
> 
> Also be aware that because I create a IV for each dataref this leads to
> regressions with SVE codegen for instance. NEON is able to use the post-index
> addressing mode to increase each dr IV at access time, but SVE can't do this. 
> For this I don't know if maybe we could try to be smart and create shared
> IV's. So rather than make them based on the actual vector ptr, use a shared
> sizetype IV that can be shared among dr IV's with the same step. Or maybe this
> is something for IVOPTs?

Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well.  But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.

My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.

> Let me know what ya think!

Now as to the patch itself.  We shouldn't amend struct data_reference,
try to use dr_vec_info instead.

--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4941,11 +4941,14 @@ vect_create_data_ref_ptr (vec_info *vinfo, 
stmt_vec_info stmt_info,
        }

       standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+      gphi *update_phi
+       = as_a<gphi *> (SSA_NAME_DEF_STMT (*dr->iv_bases->get (loop)));


and avoid this by having some explicit representation of IVs.

iv_bases is a map that exists for each DR and maps the loop to the
IV def it seems.  I'd have added the map to loop_vinfo, mapping
the (vect-)DR to the IV [def].

That probably makes the convenience of transforming the scalar
loop before peeling go away, but I wonder whether that's a good
idea anyway.

@@ -9484,6 +9480,51 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
gimple *loop_vectorized_call)
   tree advance;
   drs_init_vec orig_drs_init;

+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) == NULL)
+    {
+      struct data_reference *dr;
+      unsigned int i;
+      gimple_seq stmts;
+      gimple_stmt_iterator gsi = gsi_for_stmt (get_loop_exit_condition 
(loop));
+
+      FOR_EACH_VEC_ELT (LOOP_VINFO_DATAREFS (loop_vinfo), i, dr) {
+         tree &iv_base = dr->iv_bases->get_or_insert (loop);

there's a comment missing - does this try to replace the
IVs used by the scalar DRs in the non-if-converted loop by
the new artificial IVs?

I notice that even for

void foo (int * __restrict x, int *y, int n1, int n2)
{
  int i;
  for (i = 0; i < n1; ++i)
    x[i] += y[i];
  for (; i < n2; ++i)
    x[i] -= y[i];
}

on x86 we're unable to re-use the IV from the first loop but
end up with

.L3:
        movl    (%rsi,%rax,4), %r8d
        addl    %r8d, (%rdi,%rax,4)
        addq    $1, %rax
        cmpq    %rax, %r9
        jne     .L3
.L2:
        cmpl    %edx, %ecx
        jle     .L1
        movslq  %edx, %rdx  <--- init from n1
        .p2align 4,,10
        .p2align 3
.L5:
        movl    (%rsi,%rdx,4), %eax
        subl    %eax, (%rdi,%rdx,4)
        addq    $1, %rdx
        cmpl    %edx, %ecx
        jg      .L5

we're also making it difficult for IVOPTs to eventually
see the connection since we've propagated the final value
of the IV from the first loop.  Disabing SCCP makes it
even worse though.

I wonder how other compilers deal with this situation...

And how we prevent IVOPTs from messing things up again.

Richard.