From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id E7C69383B414
 for <gcc-patches@gcc.gnu.org>; Wed, 16 Jun 2021 11:13:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E7C69383B414
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out1.suse.de (Postfix) with ESMTP id EC60C21A4E;
 Wed, 16 Jun 2021 11:13:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1623842035; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=0VpobYR5wxnOD33pndCURu/YMO9tcWu/8T0z5y43K5M=;
 b=oFEGtjV7jTJGasmCTOPOZWihEDlF9FF7/0b+0TTiD94DDBbOMYBQnEkROS3qEMAV/nTwXF
 T5+MKsXVcSRgDIszD+6TruE3Wuz/OEx7ADj0pPsY2k9ckUc5ACQDh9/dwU6RV/V4m7y4gY
 kecMMp3fiJvOzHB94sHkvZ84k+/sCjc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1623842035;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=0VpobYR5wxnOD33pndCURu/YMO9tcWu/8T0z5y43K5M=;
 b=Rw9hau9+x2OubbG3SMM0PK4aSS2jKFz9hveMots+JgWVHvBHJRB5eUy7Cl/MOJV+uJFnwc
 WcohatcQFPtzCsBw==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id E2388A3BAA;
 Wed, 16 Jun 2021 11:13:55 +0000 (UTC)
Date: Wed, 16 Jun 2021 13:13:55 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
cc: Richard Sandiford <richard.sandiford@arm.com>, 
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue
 vectorization
In-Reply-To: <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com>
Message-ID: <nycvar.YFH.7.76.2106161310480.9200@zhemvz.fhfr.qr>
References: <ea077462-00c3-f2dc-e1ec-1f95130e918c@arm.com>
 <nycvar.YFH.7.76.2105041149180.9200@zhemvz.fhfr.qr>
 <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com>
 <nycvar.YFH.7.76.2105051414540.9200@zhemvz.fhfr.qr>
 <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com>
 <c03961fa-ef0f-7ce4-180b-8414a610ee73@arm.com>
 <nycvar.YFH.7.76.2105201147390.9200@zhemvz.fhfr.qr>
 <4925fee5-dcea-0c14-388c-85c881ffd918@arm.com>
 <nycvar.YFH.7.76.2106141014261.9200@zhemvz.fhfr.qr>
 <nycvar.YFH.7.76.2106141253390.9200@zhemvz.fhfr.qr>
 <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Jun 2021 11:13:58 -0000

On Wed, 16 Jun 2021, Andre Vieira (lists) wrote:

> 
> On 14/06/2021 11:57, Richard Biener wrote:
> > On Mon, 14 Jun 2021, Richard Biener wrote:
> >
> >> Indeed. For example a simple
> >> int a[1024], b[1024], c[1024];
> >>
> >> void foo(int n)
> >> {
> >>    for (int i = 0; i < n; ++i)
> >>      a[i+1] += c[i+i] ? b[i+1] : 0;
> >> }
> >>
> >> should usually see peeling for alignment (though on x86 you need
> >> exotic -march= since cost models generally have equal aligned and
> >> unaligned access costs).  For example with -mavx2 -mtune=atom
> >> we'll see an alignment peeling prologue, a AVX2 vector loop,
> >> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> >> shows the original scalar loop being used in the scalar prologue
> >> and epilogue.
> >>
> >> We're not even trying to make the counting IV easily used
> >> across loops (we're not counting scalar iterations in the
> >> vector loops).
> > Specifically we see
> >
> > <bb 33> [local count: 94607391]:
> > niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> > _67 = niters_vector_mult_vf.10_62 + 7;
> > _64 = (int) niters_vector_mult_vf.10_62;
> > tmp.11_63 = i_43 + _64;
> > if (niters.8_45 == niters_vector_mult_vf.10_62)
> >    goto <bb 37>; [12.50%]
> > else
> >    goto <bb 36>; [87.50%]
> >
> > after the maini vect loop, recomputing the original IV (i) rather
> > than using the inserted canonical IV.  And then the vectorized
> > epilogue header check doing
> >
> > <bb 36> [local count: 93293400]:
> > # i_59 = PHI <tmp.11_63(33), 0(18)>
> > # _66 = PHI <_67(33), 0(18)>
> > _96 = (unsigned int) n_10(D);
> > niters.26_95 = _96 - _66;
> > _108 = (unsigned int) n_10(D);
> > _109 = _108 - _66;
> > _110 = _109 + 4294967295;
> > if (_110 <= 3)
> >    goto <bb 47>; [10.00%]
> > else
> >    goto <bb 40>; [90.00%]
> >
> > re-computing everything from scratch again (also notice how
> > the main vect loop guard jumps around the alignment prologue
> > as well and lands here - and the vectorized epilogue using
> > unaligned accesses - good!).
> >
> > That is, I'd expect _much_ easier jobs if we'd manage to
> > track the number of performed scalar iterations (or the
> > number of scalar iterations remaining) using the canonical
> > IV we add to all loops across all of the involved loops.
> >
> > Richard.
> 
> 
> So I am now looking at using an IV that counts scalar iterations rather than
> vector iterations and reusing that through all loops, (prologue, main loop,
> vect_epilogue and scalar epilogue). The first is easy, since that's what we
> already do for partial vectors or non-constant VFs. The latter requires some
> plumbing and removing a lot of the code in there that creates new IV's going
> from [0, niters - previous iterations]. I don't yet have a clear cut view of
> how to do this, I first thought of keeping track of the 'control' IV in the
> loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps
> a control_ivs struct, but that is used for overflow detection and only keeps
> track of what looks like a constant 'base' and 'step'. Not quite sure how all
> that works, but intuitively doesn't seem like the right thing to reuse.

Maybe it's enough to maintain this [remaining] scalar iterations counter
between loops, thus after the vector loop do

  remain_scalar_iter -= vector_iters * vf;

etc., this should make it possible to do some first order cleanups,
avoiding some repeated computations.  It does involve placing
additional PHIs for this remain_scalar_iter var of course (I'd be
hesitant to rely on the SSA renamer for this due to its expense).

I think that for all later jump-around tests tracking remaining
scalar iters is more convenient than tracking performed scalar iters.

> I'll go hack around and keep you posted on progress.

Thanks - it's an iffy area ...
Richard.