From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id C355E3858D3C for ; Tue, 25 Jul 2023 13:19:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C355E3858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0285315BF; Tue, 25 Jul 2023 06:20:13 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EEE093F5A1; Tue, 25 Jul 2023 06:19:28 -0700 (PDT) From: Richard Sandiford To: Matthew Malcomson Mail-Followup-To: Matthew Malcomson ,, , , richard.sandiford@arm.com Cc: , , Subject: Re: vectorizer: Avoid an OOB access from vectorization References: <7f2d155c-20e4-4bae-89d8-849882526a07@AZ-NEU-EX03.Arm.com> Date: Tue, 25 Jul 2023 14:19:27 +0100 In-Reply-To: <7f2d155c-20e4-4bae-89d8-849882526a07@AZ-NEU-EX03.Arm.com> (Matthew Malcomson's message of "Fri, 14 Jul 2023 16:11:25 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-26.4 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Was leaving a bit of time in case Richi had any comments, but: Matthew Malcomson writes: > Our checks for whether the vectorization of a given loop would make an > out of bounds access miss the case when the vector we load is so large > as to span multiple iterations worth of data (while only being there to > implement a single iteration). > > This patch adds a check for such an access. > > Example where this was going wrong (smaller version of testcase added): > > ``` > extern unsigned short multi_array[5][16][16]; > extern void initialise_s(int *); > extern int get_sval(); > > void foo() { > int s0 = get_sval(); > int s[31]; > int i,j; > initialise_s(&s[0]); > s0 = get_sval(); > for (j=0; j < 16; j++) > for (i=0; i < 16; i++) > multi_array[1][j][i]=s[j*2]; > } > ``` > > With the above loop we would load the `s[j*2]` integer into a 4 element > vector, which reads 3 extra elements than the scalar loop would. > `get_group_load_store_type` identifies that the loop requires a scalar > epilogue due to gaps. However we do not identify that the above code > requires *two* scalar loops to be peeled due to the fact that each > iteration loads an amount of data from the *next* iteration (while not > using it). > > Bootstrapped and regtested on aarch64-none-linux-gnu. > N.b. out of interest we came across this working with Morello. > > > ############### Attachment also inlined for ease of reply ############### > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c > new file mode 100644 > index 0000000000000000000000000000000000000000..1b721fd26cab8d5583b153dd6b28c914db870ec3 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c > @@ -0,0 +1,60 @@ > +/* For some targets we end up vectorizing the below loop such that the `sp` > + single integer is loaded into a 4 integer vector. > + While the writes are all safe, without 2 scalar loops being peeled into the > + epilogue we would read past the end of the 31 integer array. This happens > + because we load a 4 integer chunk to only use the first integer and > + increment by 2 integers at a time, hence the last load needs s[30-33] and > + the penultimate load needs s[28-31]. > + This testcase ensures that we do not crash due to that behaviour. */ > +/* { dg-require-effective-target mmap } */ > +#include > +#include I think this should include "tree-vect.h" and should call check_vect in main. > + > +#define MMAP_SIZE 0x20000 > +#define ADDRESS 0x1122000000 > + > +#define MB_BLOCK_SIZE 16 > +#define VERT_PRED_16 0 > +#define HOR_PRED_16 1 > +#define DC_PRED_16 2 > +int *sptr; > +extern void intrapred_luma_16x16(); > +unsigned short mprr_2[5][16][16]; > +void initialise_s(int *s) { } > +int main() { > + void *s_mapping; > + void *end_s; > + s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (s_mapping == MAP_FAILED) > + { > + perror ("mmap"); > + return 1; > + } > + end_s = (s_mapping + MMAP_SIZE); > + sptr = (int*)(end_s - sizeof(int[31])); > + intrapred_luma_16x16(sptr); > + return 0; > +} > + > +void intrapred_luma_16x16(int * restrict sp) { > + for (int j=0; j < MB_BLOCK_SIZE; j++) > + { > + mprr_2[VERT_PRED_16][j][0]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][1]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][2]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][3]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][4]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][5]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][6]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][7]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][8]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][9]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][10]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][11]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][12]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][13]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][14]=sp[j*2]; > + mprr_2[VERT_PRED_16][j][15]=sp[j*2]; > + } > +} > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, > but the access in the loop doesn't cover the full vector > we can end up with no gap recorded but still excess > elements accessed, see PR103116. Make sure we peel for > - gaps if necessary and sufficient and give up if not. */ > + gaps if necessary and sufficient and give up if not. > + If there is a combination of the access not covering the full vector and > + a gap recorded then we may need to peel twice. */ Nit: long line. Might be worth adding a paragraph break. OK with those changes, thanks. Richard > if (loop_vinfo > && *memory_access_type == VMAT_CONTIGUOUS > && SLP_TREE_LOAD_PERMUTATION (slp_node).exists () > @@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, > access excess elements. > ??? Enhancements include peeling multiple iterations > or using masked loads with a static mask. */ > - || (group_size * cvf) % cnunits + group_size < cnunits) > + || (group_size * cvf) % cnunits + group_size - gap < cnunits) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,