Re: [PATCH 4/4] Testsuite updates

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH 4/4] Testsuite updates
@ 2024-05-22 10:58 Richard Biener
  2024-05-22 11:37 ` Richard Sandiford
  2024-05-22 15:14 ` Jeff Law
  0 siblings, 2 replies; 4+ messages in thread
From: Richard Biener @ 2024-05-22 10:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: tamar.christina, richard.sandiford

On Tue, 21 May 2024, Richard Biener wrote:

> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
> the 8 store group into lanes 0-5 which we SLP with an unroll factor
> of two (on x86-64 with SSE) and the remaining two lanes are using
> interleaving vectorization with a final unroll factor of four.  Thus
> we're using hybrid SLP within a single store group.  After the change
> we discover the same 0-5 lane SLP part as well as two single-lane
> parts feeding the full store group.  But that results in a load
> permutation that isn't supported (I have WIP patchs to rectify that).
> So we end up cancelling SLP and vectorizing the whole loop with
> interleaving which is IMO good and results in better code.
> 
> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
> generates much better code than hybrid SLP.  I'm unsure how to update
> the testcase though.
> 
> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
> when analyzing SLP operations we discard an instance we currently
> force the full loop to have no SLP because hybrid detection is
> broken.  It's probably not worth fixing this at this moment.
> 
> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
> into two but merge the two 8 lane loads into one before doing the
> store and thus have only a single SLP instance.  A similar situation
> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
> single SLP store only have a single lane.  Likewise for
> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
> 
> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
> with a SLP store group of size two but two single-lane branches.
> 
> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
> because we don't expect a constant and internal branch to be
> merged with a permute node in
> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
> permutes merging two SLP nodes are two-operator nodes right now).
> This still requires fixing.
> 
> The whole series has been bootstrapped and tested on 
> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
> unfixed.
> 
> Comments welcome (and hello ARM CI), RISC-V and other arch
> testing appreciated.  Unless there are comments to the contrary
> I plan to push patch 1 and 2 tomorrow.

RISC-V CI didn't trigger (not sure what magic is required).  Both
ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
fragile because we sometimes cancel SLP becuase we want to use
load/store-lanes.

I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
it doesn't really matter (and if we are finished with all-SLP it will
matter nowhere).  I've conditionalized the outcome based on
vect_load_lanes for gcc.dg/vect/slp-11c.c and
gcc.dg/vect/slp-cond-1.c

On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
ICEs, I have a fix for that.

gcc.target/aarch64/pr99873_2.c FAILs because with a single
SLP store group merged from two two-lane load groups we cancel
the SLP and want to use load/store-lanes.  I'll leave this
FAILing or shall I XFAIL it?

Thanks,
Richard.

> Thanks,
> Richard.
> 
> 	* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
> 	* gcc.dg/vect/slp-11c.c: Likewise.
> 	* gcc.dg/vect/vect-complex-5.c: Likewise.
> 	* gcc.dg/vect/slp-12a.c: Do not expect SLP.
> 	* gcc.dg/vect/slp-21.c: Likewise.
> 	* gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
> 	* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
> 	* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
> ---
>  gcc/testsuite/gcc.dg/vect/pr97428.c          |  2 +-
>  gcc/testsuite/gcc.dg/vect/slp-11c.c          |  5 +++--
>  gcc/testsuite/gcc.dg/vect/slp-12a.c          |  6 +++++-
>  gcc/testsuite/gcc.dg/vect/slp-21.c           | 19 +++++--------------
>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c       |  2 +-
>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c    |  1 -
>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>  8 files changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c b/gcc/testsuite/gcc.dg/vect/pr97428.c
> index 60dd984cfd3..3cc9976c00c 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" "vect" } } */
>  /* We're not able to peel & apply re-aligning to make accesses well-aligned for !vect_hw_misalign,
>     but we could by peeling the stores for alignment and applying re-aligning loads.  */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail { ! vect_hw_misalign } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_hw_misalign } } } } */
>  /* { dg-final { scan-tree-dump-not "gap of 6 elements" "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c b/gcc/testsuite/gcc.dg/vect/slp-11c.c
> index 0f680cd4e60..169b0d10eec 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-11c.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c
> @@ -13,7 +13,8 @@ main1 ()
>    unsigned int in[N*8] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};
>    float out[N*8];
>  
> -  /* Different operations - not SLPable.  */
> +  /* Different operations - we SLP the store and split the group to two
> +     single-lane branches.  */
>    for (i = 0; i < N*4; i++)
>      {
>        out[i*2] = ((float) in[i*2] * 2 + 6) ;
> @@ -44,4 +45,4 @@ int main (void)
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0  "vect"  } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1  "vect"  } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
> index 973de6ada21..2f98dc9da0b 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
> @@ -40,6 +40,10 @@ main1 ()
>        out[i*8 + 3] = b3 - 1;
>        out[i*8 + 4] = b4 - 8;
>        out[i*8 + 5] = b5 - 7;
> +      /* Due to the use in the ia[i] store we keep the feeding expression
> +         in the form ((in[i*8 + 6] + 11) * 3 - 3) while other expressions
> +	 got associated as for example (in[i*5 + 5] * 4 + 33).  That
> +	 causes SLP discovery to fail.  */
>        out[i*8 + 6] = b6 - 3;
>        out[i*8 + 7] = b7 - 7;
>  
> @@ -76,5 +80,5 @@ int main (void)
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-21.c b/gcc/testsuite/gcc.dg/vect/slp-21.c
> index 58751688414..dc153a53b47 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-21.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-21.c
> @@ -12,6 +12,7 @@ main1 ()
>    unsigned short out[N*8], out2[N*8], b0, b1, b2, b3, b4, a0, a1, a2, a3, b5;
>    unsigned short in[N*8];
>  
> +#pragma GCC novector
>    for (i = 0; i < N*8; i++)
>      {
>        in[i] = i;
> @@ -202,18 +203,8 @@ int main (void)
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
> -/* Some targets can vectorize the second of the three main loops using
> -   hybrid SLP.  For 128-bit vectors, the required 4->3 permutations are:
> -
> -   { 0, 1, 2, 4, 5, 6, 8, 9 }
> -   { 2, 4, 5, 6, 8, 9, 10, 12 }
> -   { 5, 6, 8, 9, 10, 12, 13, 14 }
> -
> -   Not all vect_perm targets support that, and it's a bit too specific to have
> -   its own effective-target selector, so we just test targets directly.  */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_strided4 && { ! { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  { target { ! { vect_strided4 } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect" { xfail *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
>    
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
> index 450c7141c96..16ab0cc7605 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
> @@ -125,4 +125,4 @@ main ()
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
> index addcf60438c..ac562dc475c 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
> @@ -41,4 +41,4 @@ main (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
> index 4c23b808333..10e64e64d47 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
> @@ -36,6 +36,5 @@ f3 (int *restrict y, int *restrict x, int *restrict indices)
>      }
>  }
>  
> -/* { dg-final { scan-tree-dump-not "Loop contains only SLP stmts" vect } } */
>  /* { dg-final { scan-tree-dump "different gather base" vect { target { ! vect_gather_load_ifn } } } } */
>  /* { dg-final { scan-tree-dump "different gather scale" vect { target { ! vect_gather_load_ifn } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr52252-atom.c b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
> index 11f94411575..02736d56d31 100644
> --- a/gcc/testsuite/gcc.target/i386/pr52252-atom.c
> +++ b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
> @@ -25,4 +25,5 @@ matrix_mul (byte *in, byte *out, int size)
>      }
>  }
>  
> -/* { dg-final { scan-assembler "palignr" } } */
> +/* We are no longer using hybrid SLP.  */
> +/* { dg-final { scan-assembler "palignr" { xfail *-*-* } } } */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 4/4] Testsuite updates
  2024-05-22 10:58 [PATCH 4/4] Testsuite updates Richard Biener
@ 2024-05-22 11:37 ` Richard Sandiford
  2024-05-22 15:14 ` Jeff Law
  1 sibling, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2024-05-22 11:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, tamar.christina

Richard Biener <rguenther@suse.de> writes:
> On Tue, 21 May 2024, Richard Biener wrote:
>
>> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
>> the 8 store group into lanes 0-5 which we SLP with an unroll factor
>> of two (on x86-64 with SSE) and the remaining two lanes are using
>> interleaving vectorization with a final unroll factor of four.  Thus
>> we're using hybrid SLP within a single store group.  After the change
>> we discover the same 0-5 lane SLP part as well as two single-lane
>> parts feeding the full store group.  But that results in a load
>> permutation that isn't supported (I have WIP patchs to rectify that).
>> So we end up cancelling SLP and vectorizing the whole loop with
>> interleaving which is IMO good and results in better code.
>> 
>> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
>> generates much better code than hybrid SLP.  I'm unsure how to update
>> the testcase though.
>> 
>> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
>> when analyzing SLP operations we discard an instance we currently
>> force the full loop to have no SLP because hybrid detection is
>> broken.  It's probably not worth fixing this at this moment.
>> 
>> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
>> into two but merge the two 8 lane loads into one before doing the
>> store and thus have only a single SLP instance.  A similar situation
>> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
>> single SLP store only have a single lane.  Likewise for
>> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
>> 
>> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
>> with a SLP store group of size two but two single-lane branches.
>> 
>> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
>> because we don't expect a constant and internal branch to be
>> merged with a permute node in
>> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
>> permutes merging two SLP nodes are two-operator nodes right now).
>> This still requires fixing.
>> 
>> The whole series has been bootstrapped and tested on 
>> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
>> unfixed.
>> 
>> Comments welcome (and hello ARM CI), RISC-V and other arch
>> testing appreciated.  Unless there are comments to the contrary
>> I plan to push patch 1 and 2 tomorrow.
>
> RISC-V CI didn't trigger (not sure what magic is required).  Both
> ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
> fragile because we sometimes cancel SLP becuase we want to use
> load/store-lanes.
>
> I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
> it doesn't really matter (and if we are finished with all-SLP it will
> matter nowhere).  I've conditionalized the outcome based on
> vect_load_lanes for gcc.dg/vect/slp-11c.c and
> gcc.dg/vect/slp-cond-1.c
>
> On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
> ICEs, I have a fix for that.
>
> gcc.target/aarch64/pr99873_2.c FAILs because with a single
> SLP store group merged from two two-lane load groups we cancel
> the SLP and want to use load/store-lanes.  I'll leave this
> FAILing or shall I XFAIL it?

Yeah, agree it's probably worth leaving it FAILing for now, since it
is something we should try to fix for GCC 15.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard.
>> 
>> 	* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
>> 	* gcc.dg/vect/slp-11c.c: Likewise.
>> 	* gcc.dg/vect/vect-complex-5.c: Likewise.
>> 	* gcc.dg/vect/slp-12a.c: Do not expect SLP.
>> 	* gcc.dg/vect/slp-21.c: Likewise.
>> 	* gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
>> 	* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
>> 	* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
>> ---
>>  gcc/testsuite/gcc.dg/vect/pr97428.c          |  2 +-
>>  gcc/testsuite/gcc.dg/vect/slp-11c.c          |  5 +++--
>>  gcc/testsuite/gcc.dg/vect/slp-12a.c          |  6 +++++-
>>  gcc/testsuite/gcc.dg/vect/slp-21.c           | 19 +++++--------------
>>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c       |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c    |  1 -
>>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>>  8 files changed, 18 insertions(+), 22 deletions(-)
>> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> index 60dd984cfd3..3cc9976c00c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" "vect" } } */
>>  /* We're not able to peel & apply re-aligning to make accesses well-aligned for !vect_hw_misalign,
>>     but we could by peeling the stores for alignment and applying re-aligning loads.  */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail { ! vect_hw_misalign } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_hw_misalign } } } } */
>>  /* { dg-final { scan-tree-dump-not "gap of 6 elements" "vect" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c b/gcc/testsuite/gcc.dg/vect/slp-11c.c
>> index 0f680cd4e60..169b0d10eec 100644
>> --- a/gcc/testsuite/gcc.dg/vect/slp-11c.c
>> +++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c
>> @@ -13,7 +13,8 @@ main1 ()
>>    unsigned int in[N*8] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};
>>    float out[N*8];
>>  
>> -  /* Different operations - not SLPable.  */
>> +  /* Different operations - we SLP the store and split the group to two
>> +     single-lane branches.  */
>>    for (i = 0; i < N*4; i++)
>>      {
>>        out[i*2] = ((float) in[i*2] * 2 + 6) ;
>> @@ -44,4 +45,4 @@ int main (void)
>>  
>>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */
>>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0  "vect"  } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1  "vect"  } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
>> index 973de6ada21..2f98dc9da0b 100644
>> --- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
>> +++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
>> @@ -40,6 +40,10 @@ main1 ()
>>        out[i*8 + 3] = b3 - 1;
>>        out[i*8 + 4] = b4 - 8;
>>        out[i*8 + 5] = b5 - 7;
>> +      /* Due to the use in the ia[i] store we keep the feeding expression
>> +         in the form ((in[i*8 + 6] + 11) * 3 - 3) while other expressions
>> +	 got associated as for example (in[i*5 + 5] * 4 + 33).  That
>> +	 causes SLP discovery to fail.  */
>>        out[i*8 + 6] = b6 - 3;
>>        out[i*8 + 7] = b7 - 7;
>>  
>> @@ -76,5 +80,5 @@ int main (void)
>>  
>>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
>>  /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
>>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/slp-21.c b/gcc/testsuite/gcc.dg/vect/slp-21.c
>> index 58751688414..dc153a53b47 100644
>> --- a/gcc/testsuite/gcc.dg/vect/slp-21.c
>> +++ b/gcc/testsuite/gcc.dg/vect/slp-21.c
>> @@ -12,6 +12,7 @@ main1 ()
>>    unsigned short out[N*8], out2[N*8], b0, b1, b2, b3, b4, a0, a1, a2, a3, b5;
>>    unsigned short in[N*8];
>>  
>> +#pragma GCC novector
>>    for (i = 0; i < N*8; i++)
>>      {
>>        in[i] = i;
>> @@ -202,18 +203,8 @@ int main (void)
>>    return 0;
>>  }
>>  
>> -/* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
>> -/* Some targets can vectorize the second of the three main loops using
>> -   hybrid SLP.  For 128-bit vectors, the required 4->3 permutations are:
>> -
>> -   { 0, 1, 2, 4, 5, 6, 8, 9 }
>> -   { 2, 4, 5, 6, 8, 9, 10, 12 }
>> -   { 5, 6, 8, 9, 10, 12, 13, 14 }
>> -
>> -   Not all vect_perm targets support that, and it's a bit too specific to have
>> -   its own effective-target selector, so we just test targets directly.  */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_strided4 && { ! { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  { target { ! { vect_strided4 } } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect" { xfail *-*-* } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
>>    
>> diff --git a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
>> index 450c7141c96..16ab0cc7605 100644
>> --- a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
>> +++ b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
>> @@ -125,4 +125,4 @@ main ()
>>    return 0;
>>  }
>>  
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
>> index addcf60438c..ac562dc475c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
>> +++ b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
>> @@ -41,4 +41,4 @@ main (void)
>>  }
>>  
>>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */
>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
>> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
>> index 4c23b808333..10e64e64d47 100644
>> --- a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
>> +++ b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
>> @@ -36,6 +36,5 @@ f3 (int *restrict y, int *restrict x, int *restrict indices)
>>      }
>>  }
>>  
>> -/* { dg-final { scan-tree-dump-not "Loop contains only SLP stmts" vect } } */
>>  /* { dg-final { scan-tree-dump "different gather base" vect { target { ! vect_gather_load_ifn } } } } */
>>  /* { dg-final { scan-tree-dump "different gather scale" vect { target { ! vect_gather_load_ifn } } } } */
>> diff --git a/gcc/testsuite/gcc.target/i386/pr52252-atom.c b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
>> index 11f94411575..02736d56d31 100644
>> --- a/gcc/testsuite/gcc.target/i386/pr52252-atom.c
>> +++ b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
>> @@ -25,4 +25,5 @@ matrix_mul (byte *in, byte *out, int size)
>>      }
>>  }
>>  
>> -/* { dg-final { scan-assembler "palignr" } } */
>> +/* We are no longer using hybrid SLP.  */
>> +/* { dg-final { scan-assembler "palignr" { xfail *-*-* } } } */
>> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 4/4] Testsuite updates
  2024-05-22 10:58 [PATCH 4/4] Testsuite updates Richard Biener
  2024-05-22 11:37 ` Richard Sandiford
@ 2024-05-22 15:14 ` Jeff Law
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff Law @ 2024-05-22 15:14 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: tamar.christina, richard.sandiford



On 5/22/24 4:58 AM, Richard Biener wrote:

> 
> RISC-V CI didn't trigger (not sure what magic is required).  Both
> ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
> fragile because we sometimes cancel SLP becuase we want to use
> load/store-lanes.
The RISC-V tag on the subject line is the trigger.

Jeff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 4/4] Testsuite updates
@ 2024-05-21 12:48 Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2024-05-21 12:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: tamar.christina, richard.sandiford

The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four.  Thus
we're using hybrid SLP within a single store group.  After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group.  But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.

This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP.  I'm unsure how to update
the testcase though.

gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken.  It's probably not worth fixing this at this moment.

For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance.  A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane.  Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.

gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.

gcc.target/i386/pr98928.c ICEs in SLP permute optimization
because we don't expect a constant and internal branch to be
merged with a permute node in
vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
permutes merging two SLP nodes are two-operator nodes right now).
This still requires fixing.

The whole series has been bootstrapped and tested on 
x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
unfixed.

Comments welcome (and hello ARM CI), RISC-V and other arch
testing appreciated.  Unless there are comments to the contrary
I plan to push patch 1 and 2 tomorrow.

Thanks,
Richard.

	* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
	* gcc.dg/vect/slp-11c.c: Likewise.
	* gcc.dg/vect/vect-complex-5.c: Likewise.
	* gcc.dg/vect/slp-12a.c: Do not expect SLP.
	* gcc.dg/vect/slp-21.c: Likewise.
	* gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
	* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
	* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
---
 gcc/testsuite/gcc.dg/vect/pr97428.c          |  2 +-
 gcc/testsuite/gcc.dg/vect/slp-11c.c          |  5 +++--
 gcc/testsuite/gcc.dg/vect/slp-12a.c          |  6 +++++-
 gcc/testsuite/gcc.dg/vect/slp-21.c           | 19 +++++--------------
 gcc/testsuite/gcc.dg/vect/slp-cond-1.c       |  2 +-
 gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
 gcc/testsuite/gcc.dg/vect/vect-gather-2.c    |  1 -
 gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
 8 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c b/gcc/testsuite/gcc.dg/vect/pr97428.c
index 60dd984cfd3..3cc9976c00c 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97428.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
@@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
 /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" "vect" } } */
 /* We're not able to peel & apply re-aligning to make accesses well-aligned for !vect_hw_misalign,
    but we could by peeling the stores for alignment and applying re-aligning loads.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail { ! vect_hw_misalign } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_hw_misalign } } } } */
 /* { dg-final { scan-tree-dump-not "gap of 6 elements" "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c b/gcc/testsuite/gcc.dg/vect/slp-11c.c
index 0f680cd4e60..169b0d10eec 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11c.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c
@@ -13,7 +13,8 @@ main1 ()
   unsigned int in[N*8] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63};
   float out[N*8];
 
-  /* Different operations - not SLPable.  */
+  /* Different operations - we SLP the store and split the group to two
+     single-lane branches.  */
   for (i = 0; i < N*4; i++)
     {
       out[i*2] = ((float) in[i*2] * 2 + 6) ;
@@ -44,4 +45,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0  "vect"  } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1  "vect"  } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index 973de6ada21..2f98dc9da0b 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -40,6 +40,10 @@ main1 ()
       out[i*8 + 3] = b3 - 1;
       out[i*8 + 4] = b4 - 8;
       out[i*8 + 5] = b5 - 7;
+      /* Due to the use in the ia[i] store we keep the feeding expression
+         in the form ((in[i*8 + 6] + 11) * 3 - 3) while other expressions
+	 got associated as for example (in[i*5 + 5] * 4 + 33).  That
+	 causes SLP discovery to fail.  */
       out[i*8 + 6] = b6 - 3;
       out[i*8 + 7] = b7 - 7;
 
@@ -76,5 +80,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-21.c b/gcc/testsuite/gcc.dg/vect/slp-21.c
index 58751688414..dc153a53b47 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-21.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-21.c
@@ -12,6 +12,7 @@ main1 ()
   unsigned short out[N*8], out2[N*8], b0, b1, b2, b3, b4, a0, a1, a2, a3, b5;
   unsigned short in[N*8];
 
+#pragma GCC novector
   for (i = 0; i < N*8; i++)
     {
       in[i] = i;
@@ -202,18 +203,8 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
-/* Some targets can vectorize the second of the three main loops using
-   hybrid SLP.  For 128-bit vectors, the required 4->3 permutations are:
-
-   { 0, 1, 2, 4, 5, 6, 8, 9 }
-   { 2, 4, 5, 6, 8, 9, 10, 12 }
-   { 5, 6, 8, 9, 10, 12, 13, 14 }
-
-   Not all vect_perm targets support that, and it's a bit too specific to have
-   its own effective-target selector, so we just test targets directly.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_strided4 && { ! { powerpc64*-*-* s390*-*-* loongarch*-*-* } } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  { target { ! { vect_strided4 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect"  { target { vect_strided4 || vect_extract_even_odd } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  { target  { ! { vect_strided4 || vect_extract_even_odd } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 6 "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
index 450c7141c96..16ab0cc7605 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c
@@ -125,4 +125,4 @@ main ()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
index addcf60438c..ac562dc475c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c
@@ -41,4 +41,4 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
index 4c23b808333..10e64e64d47 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-gather-2.c
@@ -36,6 +36,5 @@ f3 (int *restrict y, int *restrict x, int *restrict indices)
     }
 }
 
-/* { dg-final { scan-tree-dump-not "Loop contains only SLP stmts" vect } } */
 /* { dg-final { scan-tree-dump "different gather base" vect { target { ! vect_gather_load_ifn } } } } */
 /* { dg-final { scan-tree-dump "different gather scale" vect { target { ! vect_gather_load_ifn } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr52252-atom.c b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
index 11f94411575..02736d56d31 100644
--- a/gcc/testsuite/gcc.target/i386/pr52252-atom.c
+++ b/gcc/testsuite/gcc.target/i386/pr52252-atom.c
@@ -25,4 +25,5 @@ matrix_mul (byte *in, byte *out, int size)
     }
 }
 
-/* { dg-final { scan-assembler "palignr" } } */
+/* We are no longer using hybrid SLP.  */
+/* { dg-final { scan-assembler "palignr" { xfail *-*-* } } } */
-- 
2.35.3

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-22 15:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-22 10:58 [PATCH 4/4] Testsuite updates Richard Biener
2024-05-22 11:37 ` Richard Sandiford
2024-05-22 15:14 ` Jeff Law
  -- strict thread matches above, loose matches on Subject: below --
2024-05-21 12:48 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).