public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* vectorizer: Avoid an OOB access from vectorization
@ 2023-07-14 15:11 Matthew Malcomson
  2023-07-18 14:47 ` Matthew Malcomson
  2023-07-25 13:19 ` Richard Sandiford
  0 siblings, 2 replies; 4+ messages in thread
From: Matthew Malcomson @ 2023-07-14 15:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: ook, rguenther, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 5006 bytes --]

Our checks for whether the vectorization of a given loop would make an
out of bounds access miss the case when the vector we load is so large
as to span multiple iterations worth of data (while only being there to
implement a single iteration).

This patch adds a check for such an access.

Example where this was going wrong (smaller version of testcase added):

```
  extern unsigned short multi_array[5][16][16];
  extern void initialise_s(int *);
  extern int get_sval();

  void foo() {
    int s0 = get_sval();
    int s[31];
    int i,j;
    initialise_s(&s[0]);
    s0 = get_sval();
    for (j=0; j < 16; j++)
      for (i=0; i < 16; i++)
	multi_array[1][j][i]=s[j*2];
  }
```

With the above loop we would load the `s[j*2]` integer into a 4 element
vector, which reads 3 extra elements than the scalar loop would.
`get_group_load_store_type` identifies that the loop requires a scalar
epilogue due to gaps.  However we do not identify that the above code
requires *two* scalar loops to be peeled due to the fact that each
iteration loads an amount of data from the *next* iteration (while not
using it).

Bootstrapped and regtested on aarch64-none-linux-gnu.
N.b. out of interest we came across this working with Morello.


###############     Attachment also inlined for ease of reply    ###############


diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
new file mode 100644
index 0000000000000000000000000000000000000000..1b721fd26cab8d5583b153dd6b28c914db870ec3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
@@ -0,0 +1,60 @@
+/* For some targets we end up vectorizing the below loop such that the `sp`
+   single integer is loaded into a 4 integer vector.
+   While the writes are all safe, without 2 scalar loops being peeled into the
+   epilogue we would read past the end of the 31 integer array.  This happens
+   because we load a 4 integer chunk to only use the first integer and
+   increment by 2 integers at a time, hence the last load needs s[30-33] and
+   the penultimate load needs s[28-31].
+   This testcase ensures that we do not crash due to that behaviour.  */
+/* { dg-require-effective-target mmap } */
+#include <sys/mman.h>
+#include <stdio.h>
+
+#define MMAP_SIZE 0x20000
+#define ADDRESS 0x1122000000
+
+#define MB_BLOCK_SIZE 16
+#define VERT_PRED_16 0
+#define HOR_PRED_16 1
+#define DC_PRED_16 2
+int *sptr;
+extern void intrapred_luma_16x16();
+unsigned short mprr_2[5][16][16];
+void initialise_s(int *s) { }
+int main() {
+    void *s_mapping;
+    void *end_s;
+    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (s_mapping == MAP_FAILED)
+      {
+	perror ("mmap");
+	return 1;
+      }
+    end_s = (s_mapping + MMAP_SIZE);
+    sptr = (int*)(end_s - sizeof(int[31]));
+    intrapred_luma_16x16(sptr);
+    return 0;
+}
+
+void intrapred_luma_16x16(int * restrict sp) {
+    for (int j=0; j < MB_BLOCK_SIZE; j++)
+      {
+	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
+      }
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	     but the access in the loop doesn't cover the full vector
 	     we can end up with no gap recorded but still excess
 	     elements accessed, see PR103116.  Make sure we peel for
-	     gaps if necessary and sufficient and give up if not.  */
+	     gaps if necessary and sufficient and give up if not.
+	     If there is a combination of the access not covering the full vector and
+	     a gap recorded then we may need to peel twice.  */
 	  if (loop_vinfo
 	      && *memory_access_type == VMAT_CONTIGUOUS
 	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
@@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		     access excess elements.
 		     ???  Enhancements include peeling multiple iterations
 		     or using masked loads with a static mask.  */
-		  || (group_size * cvf) % cnunits + group_size < cnunits)
+		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
 		{
 		  if (dump_enabled_p ())
 		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,




[-- Attachment #2: vect-multipeel.patch --]
[-- Type: text/plain, Size: 3690 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
new file mode 100644
index 0000000000000000000000000000000000000000..1b721fd26cab8d5583b153dd6b28c914db870ec3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
@@ -0,0 +1,60 @@
+/* For some targets we end up vectorizing the below loop such that the `sp`
+   single integer is loaded into a 4 integer vector.
+   While the writes are all safe, without 2 scalar loops being peeled into the
+   epilogue we would read past the end of the 31 integer array.  This happens
+   because we load a 4 integer chunk to only use the first integer and
+   increment by 2 integers at a time, hence the last load needs s[30-33] and
+   the penultimate load needs s[28-31].
+   This testcase ensures that we do not crash due to that behaviour.  */
+/* { dg-require-effective-target mmap } */
+#include <sys/mman.h>
+#include <stdio.h>
+
+#define MMAP_SIZE 0x20000
+#define ADDRESS 0x1122000000
+
+#define MB_BLOCK_SIZE 16
+#define VERT_PRED_16 0
+#define HOR_PRED_16 1
+#define DC_PRED_16 2
+int *sptr;
+extern void intrapred_luma_16x16();
+unsigned short mprr_2[5][16][16];
+void initialise_s(int *s) { }
+int main() {
+    void *s_mapping;
+    void *end_s;
+    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (s_mapping == MAP_FAILED)
+      {
+	perror ("mmap");
+	return 1;
+      }
+    end_s = (s_mapping + MMAP_SIZE);
+    sptr = (int*)(end_s - sizeof(int[31]));
+    intrapred_luma_16x16(sptr);
+    return 0;
+}
+
+void intrapred_luma_16x16(int * restrict sp) {
+    for (int j=0; j < MB_BLOCK_SIZE; j++)
+      {
+	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
+      }
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	     but the access in the loop doesn't cover the full vector
 	     we can end up with no gap recorded but still excess
 	     elements accessed, see PR103116.  Make sure we peel for
-	     gaps if necessary and sufficient and give up if not.  */
+	     gaps if necessary and sufficient and give up if not.
+	     If there is a combination of the access not covering the full vector and
+	     a gap recorded then we may need to peel twice.  */
 	  if (loop_vinfo
 	      && *memory_access_type == VMAT_CONTIGUOUS
 	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
@@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		     access excess elements.
 		     ???  Enhancements include peeling multiple iterations
 		     or using masked loads with a static mask.  */
-		  || (group_size * cvf) % cnunits + group_size < cnunits)
+		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
 		{
 		  if (dump_enabled_p ())
 		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: vectorizer: Avoid an OOB access from vectorization
  2023-07-14 15:11 vectorizer: Avoid an OOB access from vectorization Matthew Malcomson
@ 2023-07-18 14:47 ` Matthew Malcomson
  2023-07-19 11:52   ` Richard Biener
  2023-07-25 13:19 ` Richard Sandiford
  1 sibling, 1 reply; 4+ messages in thread
From: Matthew Malcomson @ 2023-07-18 14:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: ook, rguenther, tamar.christina

[-- Attachment #1: Type: text/plain, Size: 5481 bytes --]

Tamar pointed out it would be good to have a `scan-tree-dump` in the testcase
just to make sure that when something is currently vectorizing it stays
vectorizing (and hence that the new code is still likely running).

Attached patch has that change, also inlined for ease of reply.

------------------------------
> Our checks for whether the vectorization of a given loop would make an
> out of bounds access miss the case when the vector we load is so large
> as to span multiple iterations worth of data (while only being there to
> implement a single iteration).
> 
> This patch adds a check for such an access.
> 
> Example where this was going wrong (smaller version of testcase added):
> 
> ```
>   extern unsigned short multi_array[5][16][16];
>   extern void initialise_s(int *);
>   extern int get_sval();
> 
>   void foo() {
>     int s0 = get_sval();
>     int s[31];
>     int i,j;
>     initialise_s(&s[0]);
>     s0 = get_sval();
>     for (j=0; j < 16; j++)
>       for (i=0; i < 16; i++)
> 	multi_array[1][j][i]=s[j*2];
>   }
> ```
> 
> With the above loop we would load the `s[j*2]` integer into a 4 element
> vector, which reads 3 extra elements than the scalar loop would.
> `get_group_load_store_type` identifies that the loop requires a scalar
> epilogue due to gaps.  However we do not identify that the above code
> requires *two* scalar loops to be peeled due to the fact that each
> iteration loads an amount of data from the *next* iteration (while not
> using it).
> 
> Bootstrapped and regtested on aarch64-none-linux-gnu.
> N.b. out of interest we came across this working with Morello.
> 
> 


###############     Attachment also inlined for ease of reply    ###############


diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
new file mode 100644
index 0000000000000000000000000000000000000000..1aab4c5a14d1e8346d89587bd9544a1516535a45
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
@@ -0,0 +1,61 @@
+/* For some targets we end up vectorizing the below loop such that the `sp`
+   single integer is loaded into a 4 integer vector.
+   While the writes are all safe, without 2 scalar loops being peeled into the
+   epilogue we would read past the end of the 31 integer array.  This happens
+   because we load a 4 integer chunk to only use the first integer and
+   increment by 2 integers at a time, hence the last load needs s[30-33] and
+   the penultimate load needs s[28-31].
+   This testcase ensures that we do not crash due to that behaviour.  */
+/* { dg-require-effective-target mmap } */
+#include <sys/mman.h>
+#include <stdio.h>
+
+#define MMAP_SIZE 0x20000
+#define ADDRESS 0x1122000000
+
+#define MB_BLOCK_SIZE 16
+#define VERT_PRED_16 0
+#define HOR_PRED_16 1
+#define DC_PRED_16 2
+int *sptr;
+extern void intrapred_luma_16x16();
+unsigned short mprr_2[5][16][16];
+void initialise_s(int *s) { }
+int main() {
+    void *s_mapping;
+    void *end_s;
+    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (s_mapping == MAP_FAILED)
+      {
+	perror ("mmap");
+	return 1;
+      }
+    end_s = (s_mapping + MMAP_SIZE);
+    sptr = (int*)(end_s - sizeof(int[31]));
+    intrapred_luma_16x16(sptr);
+    return 0;
+}
+
+void intrapred_luma_16x16(int * restrict sp) {
+    for (int j=0; j < MB_BLOCK_SIZE; j++)
+      {
+	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
+      }
+}
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" {target vect_int } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	     but the access in the loop doesn't cover the full vector
 	     we can end up with no gap recorded but still excess
 	     elements accessed, see PR103116.  Make sure we peel for
-	     gaps if necessary and sufficient and give up if not.  */
+	     gaps if necessary and sufficient and give up if not.
+	     If there is a combination of the access not covering the full vector and
+	     a gap recorded then we may need to peel twice.  */
 	  if (loop_vinfo
 	      && *memory_access_type == VMAT_CONTIGUOUS
 	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
@@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		     access excess elements.
 		     ???  Enhancements include peeling multiple iterations
 		     or using masked loads with a static mask.  */
-		  || (group_size * cvf) % cnunits + group_size < cnunits)
+		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
 		{
 		  if (dump_enabled_p ())
 		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,




[-- Attachment #2: vect-multipeel.patch --]
[-- Type: text/plain, Size: 3773 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
new file mode 100644
index 0000000000000000000000000000000000000000..1aab4c5a14d1e8346d89587bd9544a1516535a45
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
@@ -0,0 +1,61 @@
+/* For some targets we end up vectorizing the below loop such that the `sp`
+   single integer is loaded into a 4 integer vector.
+   While the writes are all safe, without 2 scalar loops being peeled into the
+   epilogue we would read past the end of the 31 integer array.  This happens
+   because we load a 4 integer chunk to only use the first integer and
+   increment by 2 integers at a time, hence the last load needs s[30-33] and
+   the penultimate load needs s[28-31].
+   This testcase ensures that we do not crash due to that behaviour.  */
+/* { dg-require-effective-target mmap } */
+#include <sys/mman.h>
+#include <stdio.h>
+
+#define MMAP_SIZE 0x20000
+#define ADDRESS 0x1122000000
+
+#define MB_BLOCK_SIZE 16
+#define VERT_PRED_16 0
+#define HOR_PRED_16 1
+#define DC_PRED_16 2
+int *sptr;
+extern void intrapred_luma_16x16();
+unsigned short mprr_2[5][16][16];
+void initialise_s(int *s) { }
+int main() {
+    void *s_mapping;
+    void *end_s;
+    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
+		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (s_mapping == MAP_FAILED)
+      {
+	perror ("mmap");
+	return 1;
+      }
+    end_s = (s_mapping + MMAP_SIZE);
+    sptr = (int*)(end_s - sizeof(int[31]));
+    intrapred_luma_16x16(sptr);
+    return 0;
+}
+
+void intrapred_luma_16x16(int * restrict sp) {
+    for (int j=0; j < MB_BLOCK_SIZE; j++)
+      {
+	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
+	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
+      }
+}
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" {target vect_int } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	     but the access in the loop doesn't cover the full vector
 	     we can end up with no gap recorded but still excess
 	     elements accessed, see PR103116.  Make sure we peel for
-	     gaps if necessary and sufficient and give up if not.  */
+	     gaps if necessary and sufficient and give up if not.
+	     If there is a combination of the access not covering the full vector and
+	     a gap recorded then we may need to peel twice.  */
 	  if (loop_vinfo
 	      && *memory_access_type == VMAT_CONTIGUOUS
 	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
@@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		     access excess elements.
 		     ???  Enhancements include peeling multiple iterations
 		     or using masked loads with a static mask.  */
-		  || (group_size * cvf) % cnunits + group_size < cnunits)
+		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
 		{
 		  if (dump_enabled_p ())
 		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: vectorizer: Avoid an OOB access from vectorization
  2023-07-18 14:47 ` Matthew Malcomson
@ 2023-07-19 11:52   ` Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2023-07-19 11:52 UTC (permalink / raw)
  To: Matthew Malcomson; +Cc: gcc-patches, ook, tamar.christina

On Tue, 18 Jul 2023, Matthew Malcomson wrote:

> Tamar pointed out it would be good to have a `scan-tree-dump` in the testcase
> just to make sure that when something is currently vectorizing it stays
> vectorizing (and hence that the new code is still likely running).
> 
> Attached patch has that change, also inlined for ease of reply.

The patch is OK with a suitable changelog.

Thanks,
Richard.

> ------------------------------
> > Our checks for whether the vectorization of a given loop would make an
> > out of bounds access miss the case when the vector we load is so large
> > as to span multiple iterations worth of data (while only being there to
> > implement a single iteration).
> > 
> > This patch adds a check for such an access.
> > 
> > Example where this was going wrong (smaller version of testcase added):
> > 
> > ```
> >   extern unsigned short multi_array[5][16][16];
> >   extern void initialise_s(int *);
> >   extern int get_sval();
> > 
> >   void foo() {
> >     int s0 = get_sval();
> >     int s[31];
> >     int i,j;
> >     initialise_s(&s[0]);
> >     s0 = get_sval();
> >     for (j=0; j < 16; j++)
> >       for (i=0; i < 16; i++)
> > 	multi_array[1][j][i]=s[j*2];
> >   }
> > ```
> > 
> > With the above loop we would load the `s[j*2]` integer into a 4 element
> > vector, which reads 3 extra elements than the scalar loop would.
> > `get_group_load_store_type` identifies that the loop requires a scalar
> > epilogue due to gaps.  However we do not identify that the above code
> > requires *two* scalar loops to be peeled due to the fact that each
> > iteration loads an amount of data from the *next* iteration (while not
> > using it).
> > 
> > Bootstrapped and regtested on aarch64-none-linux-gnu.
> > N.b. out of interest we came across this working with Morello.
> > 
> > 
> 
> 
> ###############     Attachment also inlined for ease of reply    ###############
> 
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1aab4c5a14d1e8346d89587bd9544a1516535a45
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> @@ -0,0 +1,61 @@
> +/* For some targets we end up vectorizing the below loop such that the `sp`
> +   single integer is loaded into a 4 integer vector.
> +   While the writes are all safe, without 2 scalar loops being peeled into the
> +   epilogue we would read past the end of the 31 integer array.  This happens
> +   because we load a 4 integer chunk to only use the first integer and
> +   increment by 2 integers at a time, hence the last load needs s[30-33] and
> +   the penultimate load needs s[28-31].
> +   This testcase ensures that we do not crash due to that behaviour.  */
> +/* { dg-require-effective-target mmap } */
> +#include <sys/mman.h>
> +#include <stdio.h>
> +
> +#define MMAP_SIZE 0x20000
> +#define ADDRESS 0x1122000000
> +
> +#define MB_BLOCK_SIZE 16
> +#define VERT_PRED_16 0
> +#define HOR_PRED_16 1
> +#define DC_PRED_16 2
> +int *sptr;
> +extern void intrapred_luma_16x16();
> +unsigned short mprr_2[5][16][16];
> +void initialise_s(int *s) { }
> +int main() {
> +    void *s_mapping;
> +    void *end_s;
> +    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
> +		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (s_mapping == MAP_FAILED)
> +      {
> +	perror ("mmap");
> +	return 1;
> +      }
> +    end_s = (s_mapping + MMAP_SIZE);
> +    sptr = (int*)(end_s - sizeof(int[31]));
> +    intrapred_luma_16x16(sptr);
> +    return 0;
> +}
> +
> +void intrapred_luma_16x16(int * restrict sp) {
> +    for (int j=0; j < MB_BLOCK_SIZE; j++)
> +      {
> +	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
> +      }
> +}
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" {target vect_int } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  	     but the access in the loop doesn't cover the full vector
>  	     we can end up with no gap recorded but still excess
>  	     elements accessed, see PR103116.  Make sure we peel for
> -	     gaps if necessary and sufficient and give up if not.  */
> +	     gaps if necessary and sufficient and give up if not.
> +	     If there is a combination of the access not covering the full vector and
> +	     a gap recorded then we may need to peel twice.  */
>  	  if (loop_vinfo
>  	      && *memory_access_type == VMAT_CONTIGUOUS
>  	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> @@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  		     access excess elements.
>  		     ???  Enhancements include peeling multiple iterations
>  		     or using masked loads with a static mask.  */
> -		  || (group_size * cvf) % cnunits + group_size < cnunits)
> +		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
>  		{
>  		  if (dump_enabled_p ())
>  		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: vectorizer: Avoid an OOB access from vectorization
  2023-07-14 15:11 vectorizer: Avoid an OOB access from vectorization Matthew Malcomson
  2023-07-18 14:47 ` Matthew Malcomson
@ 2023-07-25 13:19 ` Richard Sandiford
  1 sibling, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2023-07-25 13:19 UTC (permalink / raw)
  To: Matthew Malcomson; +Cc: gcc-patches, ook, rguenther

Was leaving a bit of time in case Richi had any comments, but:

Matthew Malcomson <matthew.malcomson@arm.com> writes:
> Our checks for whether the vectorization of a given loop would make an
> out of bounds access miss the case when the vector we load is so large
> as to span multiple iterations worth of data (while only being there to
> implement a single iteration).
>
> This patch adds a check for such an access.
>
> Example where this was going wrong (smaller version of testcase added):
>
> ```
>   extern unsigned short multi_array[5][16][16];
>   extern void initialise_s(int *);
>   extern int get_sval();
>
>   void foo() {
>     int s0 = get_sval();
>     int s[31];
>     int i,j;
>     initialise_s(&s[0]);
>     s0 = get_sval();
>     for (j=0; j < 16; j++)
>       for (i=0; i < 16; i++)
> 	multi_array[1][j][i]=s[j*2];
>   }
> ```
>
> With the above loop we would load the `s[j*2]` integer into a 4 element
> vector, which reads 3 extra elements than the scalar loop would.
> `get_group_load_store_type` identifies that the loop requires a scalar
> epilogue due to gaps.  However we do not identify that the above code
> requires *two* scalar loops to be peeled due to the fact that each
> iteration loads an amount of data from the *next* iteration (while not
> using it).
>
> Bootstrapped and regtested on aarch64-none-linux-gnu.
> N.b. out of interest we came across this working with Morello.
>
>
> ###############     Attachment also inlined for ease of reply    ###############
>
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1b721fd26cab8d5583b153dd6b28c914db870ec3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-multi-peel-gaps.c
> @@ -0,0 +1,60 @@
> +/* For some targets we end up vectorizing the below loop such that the `sp`
> +   single integer is loaded into a 4 integer vector.
> +   While the writes are all safe, without 2 scalar loops being peeled into the
> +   epilogue we would read past the end of the 31 integer array.  This happens
> +   because we load a 4 integer chunk to only use the first integer and
> +   increment by 2 integers at a time, hence the last load needs s[30-33] and
> +   the penultimate load needs s[28-31].
> +   This testcase ensures that we do not crash due to that behaviour.  */
> +/* { dg-require-effective-target mmap } */
> +#include <sys/mman.h>
> +#include <stdio.h>

I think this should include "tree-vect.h" and should call check_vect in main.

> +
> +#define MMAP_SIZE 0x20000
> +#define ADDRESS 0x1122000000
> +
> +#define MB_BLOCK_SIZE 16
> +#define VERT_PRED_16 0
> +#define HOR_PRED_16 1
> +#define DC_PRED_16 2
> +int *sptr;
> +extern void intrapred_luma_16x16();
> +unsigned short mprr_2[5][16][16];
> +void initialise_s(int *s) { }
> +int main() {
> +    void *s_mapping;
> +    void *end_s;
> +    s_mapping = mmap ((void *)ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
> +		      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +    if (s_mapping == MAP_FAILED)
> +      {
> +	perror ("mmap");
> +	return 1;
> +      }
> +    end_s = (s_mapping + MMAP_SIZE);
> +    sptr = (int*)(end_s - sizeof(int[31]));
> +    intrapred_luma_16x16(sptr);
> +    return 0;
> +}
> +
> +void intrapred_luma_16x16(int * restrict sp) {
> +    for (int j=0; j < MB_BLOCK_SIZE; j++)
> +      {
> +	mprr_2[VERT_PRED_16][j][0]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][1]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][2]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][3]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][4]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][5]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][6]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][7]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][8]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][9]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][10]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][11]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][12]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][13]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][14]=sp[j*2];
> +	mprr_2[VERT_PRED_16][j][15]=sp[j*2];
> +      }
> +}
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c08d0ef951fc63adcfffc601917134ddf51ece45..1c8c6784cc7b5f2d327339ff55a5a5ea08835aab 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2217,7 +2217,9 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  	     but the access in the loop doesn't cover the full vector
>  	     we can end up with no gap recorded but still excess
>  	     elements accessed, see PR103116.  Make sure we peel for
> -	     gaps if necessary and sufficient and give up if not.  */
> +	     gaps if necessary and sufficient and give up if not.
> +	     If there is a combination of the access not covering the full vector and
> +	     a gap recorded then we may need to peel twice.  */

Nit: long line.  Might be worth adding a paragraph break.

OK with those changes, thanks.

Richard

>  	  if (loop_vinfo
>  	      && *memory_access_type == VMAT_CONTIGUOUS
>  	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
> @@ -2233,7 +2235,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  		     access excess elements.
>  		     ???  Enhancements include peeling multiple iterations
>  		     or using masked loads with a static mask.  */
> -		  || (group_size * cvf) % cnunits + group_size < cnunits)
> +		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
>  		{
>  		  if (dump_enabled_p ())
>  		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-07-25 13:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-14 15:11 vectorizer: Avoid an OOB access from vectorization Matthew Malcomson
2023-07-18 14:47 ` Matthew Malcomson
2023-07-19 11:52   ` Richard Biener
2023-07-25 13:19 ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).