From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=D95m=DF=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c])
	by sourceware.org (Postfix) with ESMTPS id 073993858C66
	for <gcc-patches@gcc.gnu.org>; Wed, 19 Jul 2023 11:52:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 073993858C66
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 1738321BBC;
	Wed, 19 Jul 2023 11:52:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1689767574; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NoE6Hw9nZuMdFUDgUigjDZ7034UKGEr38TGOnofxKw0=;
	b=TIx0L5om5ErOxckNIzSa7o+ugAz2PzW3EWMqanIidIj3v1c5/HAzgU6E6KR2Zkr2MYNKBS
	yy2mnzZyDdpTHokmfiUwrADW9Yqlvgel4DP+mol5jRco9s3sRloMIMdsD/4Se6oAXCezC7
	TIZ7x3mbYgyrwPKTSIHta46wlzKT0hI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1689767574;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NoE6Hw9nZuMdFUDgUigjDZ7034UKGEr38TGOnofxKw0=;
	b=qyD0F85FPgl6IUVdL5aGHNww9alKtm/bjeCOjnrurCKlklvL2+mY5w1oHWdEEr4aGO5TZK
	N2rU27vzgvS7pMCA==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id EAB152C142;
	Wed, 19 Jul 2023 11:52:53 +0000 (UTC)
Date: Wed, 19 Jul 2023 11:52:53 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Matthew Malcomson <matthew.malcomson@arm.com>
cc: gcc-patches@gcc.gnu.org, ook@ucw.cz, tamar.christina@arm.com
Subject: Re: vectorizer: Avoid an OOB access from vectorization
In-Reply-To: <24ccac55-7ec3-4f1b-b53b-6496df8b5c32@AZ-NEU-EX03.Arm.com>
Message-ID: <nycvar.YFH.7.77.849.2307191152200.12935@jbgna.fhfr.qr>
References: <7f2d155c-20e4-4bae-89d8-849882526a07@AZ-NEU-EX03.Arm.com> <24ccac55-7ec3-4f1b-b53b-6496df8b5c32@AZ-NEU-EX03.Arm.com>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, 18 Jul 2023, Matthew Malcomson wrote:

> Tamar pointed out it would be good to have a `scan-tree-dump` in the testcase
> just to make sure that when something is currently vectorizing it stays
> vectorizing (and hence that the new code is still likely running).
> 
> Attached patch has that change, also inlined for ease of reply.

The patch is OK with a suitable changelog.

Thanks,
Richard.

> ------------------------------
> > Our checks for whether the vectorization of a given loop would make an
> > out of bounds access miss the case when the vector we load is so large
> > as to span multiple iterations worth of data (while only being there to
> > implement a single iteration).
> > 
> > This patch adds a check for such an access.
> > 
> > Example where this was going wrong (smaller version of testcase added):
> > 
> > ```
> >   extern unsigned short multi_array[5][16][16];
> >   extern void initialise_s(int *);
> >   extern int get_sval();
> > 
> >   void foo() {
> >     int s0 = get_sval();
> >     int s[31];
> >     int i,j;
> >     initialise_s(&s[0]);
> >     s0 = get_sval();
> >     for (j=0; j < 16; j++)
> >       for (i=0; i < 16; i++)
> > 	multi_array[1][j][i]=s[j*2];
> >   }
> > ```
> > 
> > With the above loop we would load the `s[j*2]` integer into a 4 element
> > vector, which reads 3 extra elements than the scalar loop would.
> > `get_group_load_store_type` identifies that the loop requires a scalar
> > epilogue due to gaps.  However we do not identify that the above code
> > requires *two* scalar loops to be peeled due to the fact that each
> > iteration loads an amount of data from the *next* iteration (while not
> > using it).
> > 
> > Bootstrapped and regtested on aarch64-none-linux-gnu.
> > N.b. out of interest we came across this working with Morello.
> > 
> > 
> 
> 
> ###############     Attachment also inlined for ease of reply    ###############
> 
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1aab4c5a14d1e8346d89587bd9544a1516535a45
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> @@ -0,0 +1,61 @@
> +/* For some targets we end up vectorizing the below loop such that the `sp`
> +   single integer is loaded into a 4 integer vector.
> +   While the writes are all safe, without 2 scalar loops being peeled into the
> +   epilogue we would read past the end of the 31 integer array.  This happens
> +   because we load a 4 integer chunk to only use the first integer and
> +   increment by 2 integers at a time, hence the last load needs s[30-33] and
> +   the penultimate load needs s[28-31].
> +   This testcase ensures that we do not crash due to that behaviour.  */
> +/* { dg-require-effective-target mmap } */
> +#include <sys/mman.h>
> +#include <stdio.h>
> +
> +#define MMAP_SIZE 0x20000
> +#define ADDRESS 0x1122000000
> +
> +#define MB_BLOCK_SIZE 16
> +#define VERT_PRED_16 0
> +#define HOR_PRED_16 1
> +#define DC_PRED_16 2
> +int *sptr;
> +extern void intrapred_luma_16x16();
> +unsigned short mprr_2[5][16][16];
> +void initialise_s(int *s) { }
> +int main() {
> +    void *s_mapping;
> +    void *end_s;
> +    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
> +		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (s_mapping == MAP_FAILED)
> +      {
> +	perror ("mmap");
> +	return 1;
> +      }
> +    end_s = (s_mapping + MMAP_SIZE);
> +    sptr = (int*)(end_s - sizeof(int[31]));
> +    intrapred_luma_16x16(sptr);
> +    return 0;
> +}
> +
> +void intrapred_luma_16x16(int * restrict sp) {
> +    for (int j=0; j < MB_BLOCK_SIZE; j++)
> +      {
> +	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
> +      }
> +}
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" {target vect_int } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  	     but the access in the loop doesn't cover the full vector
>  	     we can end up with no gap recorded but still excess
>  	     elements accessed, see PR103116.  Make sure we peel for
> -	     gaps if necessary and sufficient and give up if not.  */
> +	     gaps if necessary and sufficient and give up if not.
> +	     If there is a combination of the access not covering the full vector and
> +	     a gap recorded then we may need to peel twice.  */
>  	  if (loop_vinfo
>  	      && *memory_access_type == VMAT_CONTIGUOUS
>  	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> @@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  		     access excess elements.
>  		     ???  Enhancements include peeling multiple iterations
>  		     or using masked loads with a static mask.  */
> -		  || (group_size * cvf) % cnunits + group_size < cnunits)
> +		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
>  		{
>  		  if (dump_enabled_p ())
>  		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)