From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=8IU4=LK=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1])
	by sourceware.org (Postfix) with ESMTPS id 490AA3858C24
	for <gcc-patches@gcc.gnu.org>; Fri,  5 Apr 2024 07:07:09 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 490AA3858C24
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 490AA3858C24
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1712300831; cv=none;
	b=wm51DMB3I/GKfHDNydFX+5HQwWBRSNj0wCxkgOsYfflao5uHDFGy7KQXKOSTTThKebKoex+o1XNKTYDvRfJhGjKw79MBkfvxBmgID+S1I0T9obtTIUzJa880Jnqhh0w5fJQQxl1bztHoSX7noK3IrUlrJv9RYX4P5BQ9xaFNsjQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1712300831; c=relaxed/simple;
	bh=tmGzJJZqWMZSjWIz5O47NcZbea0DqQ3cS5ZxmS9U2mY=;
	h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
	 From:To:Subject:Message-ID:MIME-Version; b=sxbCTrGxpoWlIwaPkOYxEDxHR+4l8u5HzBSM4Zuci1u8ZCpE1LSrZS1JeNiD+WOpbj/Wsg1bX/1qSHtWVWRfRZ629RATN6bGvSv8Cw7Lkx6mXP764DzO41sr4ouhj5IgU889wm78ud2TRqzSFtC3e/Q/N1jrwmIrzcVCFN296a4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from [10.168.5.241] (unknown [10.168.5.241])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out1.suse.de (Postfix) with ESMTPS id CC60F219AE;
	Fri,  5 Apr 2024 07:07:06 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1712300828; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=vrUTCPGT2qowZRs/hk7xjxkOz0lgnSegd5aFvmyfmWY=;
	b=ssrNxTWhcyoETMRO+fyK5uYu8whwaJ1tnzYV8s7ioifN2KvhyM7HJX8j4/Kl8+hxmE99jK
	iwQqz46im/PDUnbcsaP4eOxbWSHq+Qw588SQfreuCUA+G5nAQdbMQrml5XlH9yiFGKeX3u
	aSdpb0U+xmvkhj6H1j5fgGeuhQiOr1s=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1712300828;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=vrUTCPGT2qowZRs/hk7xjxkOz0lgnSegd5aFvmyfmWY=;
	b=/xa/ZxHphQFRfdGDMtXgLWC+xOwK2xO7j+k87J7Z+A4lYblzqbxgTOs5BQkgZXf13XWdYk
	f2Br6EapHxtP84CA==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1712300826; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=vrUTCPGT2qowZRs/hk7xjxkOz0lgnSegd5aFvmyfmWY=;
	b=exsU5beaSixEeyvRoK936l0pRZCnp2ZyyA7pUgXiU1dYaFQy8GH4lYtIfJb6n//m58AVLJ
	clDEp1pRSwhK7QeerxEjhPSFOrznzy3rAX6km9EXCOkl9E9JCCRrgo8a1wfV2QRYd17gTM
	BnrWboKbpDAE1PGE6AYS8YqMvNVtumw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1712300826;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=vrUTCPGT2qowZRs/hk7xjxkOz0lgnSegd5aFvmyfmWY=;
	b=ciOeW622/RJ/u+DaVjlAb82bFLABMkssZ52pv/yzBZWHStBEwfo4XXvgKQFsVE6XPFa0pr
	CBdCD8mJrN8drDAA==
Date: Fri, 5 Apr 2024 09:07:06 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Tamar Christina <tamar.christina@arm.com>
cc: gcc-patches@gcc.gnu.org, nd@arm.com, jlaw@ventanamicro.com, 
    richard.sandiford@arm.com
Subject: Re: [PATCH]middle-end vect: adjust loop upper bounds when peeling
 for gaps and early break [PR114403]
In-Reply-To: <patch-18385-tamar@arm.com>
Message-ID: <8s8877p0-rqno-p9rr-7nno-2rr0n34n6q65@fhfr.qr>
References: <patch-18385-tamar@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Level: 
X-Spamd-Result: default: False [-4.30 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	MISSING_XM_UA(0.00)[];
	MIME_TRACE(0.00)[0:+];
	TO_DN_SOME(0.00)[];
	ARC_NA(0.00)[];
	RCVD_COUNT_ZERO(0.00)[0];
	FROM_HAS_DN(0.00)[];
	DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
	FROM_EQ_ENVFROM(0.00)[];
	RCPT_COUNT_FIVE(0.00)[5];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-loop.cc:url,generic-match-8.cc:url,generic-match-1.cc:url]
X-Spam-Score: -4.30
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, 4 Apr 2024, Tamar Christina wrote:

> Hi All,
> 
> The report shows that we end up in a situation where the code has been peeled
> for gaps and we have an early break.
> 
> The code for peeling for gaps assume that a scalar loop needs to perform at
> least one iteration.  However this doesn't take into account early break where
> the scalar loop may not need to be executed.

But we always re-start the vector iteration where the early break happens?

> That the early break loop can be partial is not accounted for in this scenario.
> loop partiality is normally handled by setting bias_for_lowest to 1, but when
> peeling for gaps we end up with 0, which when the loop upper bounds are
> calculated means that a partial loop iteration loses the final partial iter:
> 
> Analyzing # of iterations of loop 1
>   exit condition [8, + , 18446744073709551615] != 0
>   bounds on difference of bases: -8 ... -8
>   result:
>     # of iterations 8, bounded by 8
> 
> and a VF=4 calculating:
> 
> Loop 1 iterates at most 1 times.
> Loop 1 likely iterates at most 1 times.
> Analyzing # of iterations of loop 1
>   exit condition [1, + , 1](no_overflow) < bnd.5505_39
>   bounds on difference of bases: 0 ... 4611686018427387902
> Matching expression match.pd:2011, generic-match-8.cc:27
> Applying pattern match.pd:2067, generic-match-1.cc:4813
>   result:
>     # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902
> Estimating sizes for loop 1
> ...
>    Induction variable computation will be folded away.
>   size:   2 if (ivtmp_312 < bnd.5505_39)
>    Exit condition will be eliminated in last copy.
> size: 24-3, last_iteration: 24-5
>   Loop size: 24
>   Estimated size after unrolling: 26
> ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1.
> 
> upper bound should be 2 not 1.

Why?  This means the vector loop will iterate once (thus the body
executed twice), isn't that correct?  Peeling for gaps means the
main IV will exit the loop in time.

> 
> This patch forced the bias_for_lowest to be 1 even when peeling for gaps.

(*)

> I have however not been able to write a standalone reproducer for this so I have
> no tests but bootstrap and LLVM build fine now.
> 
> The testcase:
> 
> #define COUNT 9
> #define SIZE COUNT * 4
> #define TYPE unsigned long
> 
> TYPE x[SIZE], y[SIZE];
> 
> void __attribute__((noipa))
> loop (TYPE val)
> {
>   for (int i = 0; i < COUNT; ++i)
>     {
>       if (x[i * 4] > val || x[i * 4 + 1] > val)
>         return;
>       x[i * 4] = y[i * 2] + 1;
>       x[i * 4 + 1] = y[i * 2] + 2;
>       x[i * 4 + 2] = y[i * 2 + 1] + 3;
>       x[i * 4 + 3] = y[i * 2 + 1] + 4;
>     }
> }
> 
> does perform the peeling for gaps and early beak, however it creates a hybrid
> loop which works fine. adjusting the indices to non linear also works. So I'd
> like to submit the fix and work on a testcase separately if needed.

You can have peeling for gaps without SLP by doing interleaving.

#define COUNT 9
#define TYPE unsigned long

TYPE x[COUNT], y[COUNT*2];

void __attribute__((noipa))
loop (TYPE val)
{
  for (int i = 0; i < COUNT; ++i)
    { 
      if (x[i] > val)
        return;
      x[i] = y[i * 2];
   }
}

gets me partial vectors and peeling for gaps with -O3 -march=armv8.2-a+sve 
-fno-vect-cost-model (with cost modeling we choose ADVSIMD).  Does
this reproduce the issue?

Richard.


> Bootstrapped Regtested on x86_64-pc-linux-gnu no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/114403
> 	* tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
> 	peeling for gaps and early break.
> 
> ---
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>    /* The minimum number of iterations performed by the epilogue.  This
>       is 1 when peeling for gaps because we always need a final scalar
>       iteration.  */
> -  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
> +  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +			   && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;

(*) This adjusts min_epilogue_iters though and honestly the whole code
looks like a mess.  I'm quoting a bit more here:

>    /* +1 to convert latch counts to loop iteration counts,
>       -min_epilogue_iters to remove iterations that cannot be performed
>         by the vector code.  */
  int bias_for_lowest = 1 - min_epilogue_iters;
  int bias_for_assumed = bias_for_lowest;

The variable names and comments now have nothing to do with the
actual magic we compute into them.

I think it would be an improvement to disentangle this a bit like
doing

   /* +1 to convert latch counts to loop iteration counts.  */
   int bias_for_lowest = 1;
   /* Comment, explain why peeling for gaps isn't relevant.  */
   if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     bias_for_lowest += ... ?
   /* When peeling for gaps we have at least one scalar iteration in
      the epilog.  */
   else if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     bias_for_lowest -= 1;
   int bias_for_assumed = bias_for_lowest;

I'm still not convinced that you need to "ignore" peeling for gaps
when doing early exit vectorization?

Richard.