[PATCH][AArch64] PR79262: Adjust vector cost

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][AArch64] PR79262: Adjust vector cost
@ 2018-01-22 15:12 Wilco Dijkstra
  2018-01-22 15:44 ` Richard Biener
  2018-11-09 14:14 ` Wilco Dijkstra
  0 siblings, 2 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2018-01-22 15:12 UTC (permalink / raw)
  To: GCC Patches; +Cc: nd, James Greenhalgh

PR79262 has been fixed for almost all AArch64 cpus, however the example is still
vectorized in a few cases, resulting in lower performance.  Increase the cost of
vector-to-scalar moves so it is more similar to the other vector costs. As a result
-mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
performance improves.

OK for commit?

ChangeLog:
2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>

	PR target/79262
	* config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
   1, /* vec_int_stmt_cost  */
   1, /* vec_fp_stmt_cost  */
   2, /* vec_permute_cost  */
-  1, /* vec_to_scalar_cost  */
+  2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* vec_align_load_cost  */
   1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-01-22 15:12 [PATCH][AArch64] PR79262: Adjust vector cost Wilco Dijkstra
@ 2018-01-22 15:44 ` Richard Biener
  2018-11-09 14:57   ` James Greenhalgh
  2018-11-09 14:14 ` Wilco Dijkstra
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Biener @ 2018-01-22 15:44 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd, James Greenhalgh

On Mon, Jan 22, 2018 at 4:01 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> PR79262 has been fixed for almost all AArch64 cpus, however the example is still
> vectorized in a few cases, resulting in lower performance.  Increase the cost of
> vector-to-scalar moves so it is more similar to the other vector costs. As a result
> -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
> performance improves.
>
> OK for commit?

It would be better to dissect this cost into vec_to_scalar and vec_extract where
vec_to_scalar really means getting at the scalar value of a vector of
uniform values
which most targets can do without any instruction (just use a subreg).

I suppose we could also make vec_to_scalar equal to vector extraction and remove
the uses for the other case (reduction vector result to scalar reg).

Richard.

> ChangeLog:
> 2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR target/79262
>         * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
> --
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>    1, /* vec_int_stmt_cost  */
>    1, /* vec_fp_stmt_cost  */
>    2, /* vec_permute_cost  */
> -  1, /* vec_to_scalar_cost  */
> +  2, /* vec_to_scalar_cost  */
>    1, /* scalar_to_vec_cost  */
>    1, /* vec_align_load_cost  */
>    1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH][AArch64] PR79262: Adjust vector cost
  2018-01-22 15:12 [PATCH][AArch64] PR79262: Adjust vector cost Wilco Dijkstra
  2018-01-22 15:44 ` Richard Biener
@ 2018-11-09 14:14 ` Wilco Dijkstra
  2018-11-09 14:54   ` James Greenhalgh
  2019-10-10 17:49   ` Wilco Dijkstra
  1 sibling, 2 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2018-11-09 14:14 UTC (permalink / raw)
  To: GCC Patches; +Cc: nd, James Greenhalgh

PR79262 has been fixed for almost all AArch64 cpus, however the example is still
vectorized in a few cases, resulting in lower performance.  Increase the cost of
vector-to-scalar moves so it is more similar to the other vector costs. As a result
-mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
performance improves.

OK for commit?

ChangeLog:
2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>

        PR target/79262
        * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
   1, /* vec_int_stmt_cost  */
   1, /* vec_fp_stmt_cost  */
   2, /* vec_permute_cost  */
-  1, /* vec_to_scalar_cost  */
+  2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* vec_align_load_cost  */
   1, /* vec_unalign_load_cost  */
    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-11-09 14:14 ` Wilco Dijkstra
@ 2018-11-09 14:54   ` James Greenhalgh
  2018-11-09 17:31     ` Wilco Dijkstra
  2019-10-10 17:49   ` Wilco Dijkstra
  1 sibling, 1 reply; 12+ messages in thread
From: James Greenhalgh @ 2018-11-09 14:54 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, nd

On Fri, Nov 09, 2018 at 08:14:27AM -0600, Wilco Dijkstra wrote:
> PR79262 has been fixed for almost all AArch64 cpus, however the example is still
> vectorized in a few cases, resulting in lower performance.Â  Increase the cost of
> vector-to-scalar moves so it is more similar to the other vector costs. As a result
> -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
> performance improves.
> 
> OK for commit?

No.

We have 7 unique target tuning structures in the AArch64 backend, of which
only one has a 2x ratio between scalar_int_cost and vec_to_scalar_cost. Other
ratios are 1, 3, 8, 3, 4, 6.

What makes this choice correct? What makes it more correct than what we
have now? On which of the 28 entries in config/aarch64/aarch64-cores.def does
performance improve? Are the Spec benchmarks sufficiently representative to
change the generic vectorisation costs?

Please validate the performance effect of this patch, which changes default
code generation for everyone, on more than one testcase in a bug report.

Thanks,
James

> ChangeLog:
> 2018-01-22Â  Wilco DijkstraÂ  <wdijkstr@arm.com>
> 
> Â Â Â Â Â Â Â  PR target/79262
> Â Â Â Â Â Â Â  * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-01-22 15:44 ` Richard Biener
@ 2018-11-09 14:57   ` James Greenhalgh
  2018-11-09 17:43     ` Wilco Dijkstra
  0 siblings, 1 reply; 12+ messages in thread
From: James Greenhalgh @ 2018-11-09 14:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: Wilco Dijkstra, GCC Patches, nd

On Mon, Jan 22, 2018 at 09:22:27AM -0600, Richard Biener wrote:
> On Mon, Jan 22, 2018 at 4:01 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > PR79262 has been fixed for almost all AArch64 cpus, however the example is still
> > vectorized in a few cases, resulting in lower performance.  Increase the cost of
> > vector-to-scalar moves so it is more similar to the other vector costs. As a result
> > -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
> > performance improves.
> >
> > OK for commit?
> 
> It would be better to dissect this cost into vec_to_scalar and vec_extract where
> vec_to_scalar really means getting at the scalar value of a vector of
> uniform values
> which most targets can do without any instruction (just use a subreg).
> 
> I suppose we could also make vec_to_scalar equal to vector extraction and remove
> the uses for the other case (reduction vector result to scalar reg).

I have dug up Richard's comments from last year, which you appear to have
ignored and made no reference to when resubmitting the patch.

Please don't do that. Carefully consider Richard's review feedback before
resubmitting this patch.

To reiterate, it is not OK for trunk.

Thanks,
James

> 
> Richard.
> 
> > ChangeLog:
> > 2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>
> >
> >         PR target/79262
> >         * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-11-09 14:54   ` James Greenhalgh
@ 2018-11-09 17:31     ` Wilco Dijkstra
  0 siblings, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2018-11-09 17:31 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: GCC Patches, nd

Hi James,

> We have 7 unique target tuning structures in the AArch64 backend, of which
> only one has a 2x ratio between scalar_int_cost and vec_to_scalar_cost. Other
> ratios are 1, 3, 8, 3, 4, 6.

I wouldn't read too much in the exact value here - the costs are simply relative to
other values for the same tuning, ie. cores that use 4 or 6 also tend to use larger
values for the other entries.

> What makes this choice correct? What makes it more correct than what we
> have now? On which of the 28 entries in config/aarch64/aarch64-cores.def does
> performance improve? 

It's correct since it's the minimum value that stops vectorization for this particular
bug without increasing costs too much and accidentally blocking useful loops from
being vectorized (as confirmed by benchmarking). Given that all other vector cost
tunings already block vectorization for this case, this is clearly what is required
for best performance. So this improves performance for all 28 entries.

> Please validate the performance effect of this patch, which changes default
> code generation for everyone, on more than one testcase in a bug report.

I did validate the performance like I mentioned in the patch. Since it has been a
while, I can easily rerun the benchmarks using latest trunk to verify it's still a gain.

> Are the Spec benchmarks sufficiently representative to change the generic
> vectorisation costs?

SPEC is sufficiently large and complex to show there is a gain without regressions.
Do you have reason to believe other benchmarks might not see the same gain?

Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-11-09 14:57   ` James Greenhalgh
@ 2018-11-09 17:43     ` Wilco Dijkstra
  0 siblings, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2018-11-09 17:43 UTC (permalink / raw)
  To: James Greenhalgh, Richard Biener; +Cc: GCC Patches, nd

Hi James,

>On Mon, Jan 22, 2018 at 09:22:27AM -0600, Richard Biener wrote:
>> It would be better to dissect this cost into vec_to_scalar and vec_extract where
>> vec_to_scalar really means getting at the scalar value of a vector of
>> uniform values
>> which most targets can do without any instruction (just use a subreg).
>> 
>> I suppose we could also make vec_to_scalar equal to vector extraction and remove
>> the uses for the other case (reduction vector result to scalar reg).
>
> I have dug up Richard's comments from last year, which you appear to have
> ignored and made no reference to when resubmitting the patch.
>
> Please don't do that. Carefully consider Richard's review feedback before
> resubmitting this patch.

I seem to have missed this comment... However I can't see how it relates to my
patch, particularly since Richard claimed in PR79262 that this PR is a backend
issue. Sure it *would* be great to fix the vector cost infrastructure, but that's a lot
of work, and wasn't my goal here, nor is it required to fix PR79262. The existing
costs allow the issue to be solved, just like the other targets/tunings have done
already, so that's clearly the best approach for this PR.

Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2018-11-09 14:14 ` Wilco Dijkstra
  2018-11-09 14:54   ` James Greenhalgh
@ 2019-10-10 17:49   ` Wilco Dijkstra
  2019-10-16 21:35     ` Richard Sandiford
  2019-11-19 15:30     ` Wilco Dijkstra
  1 sibling, 2 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2019-10-10 17:49 UTC (permalink / raw)
  To: GCC Patches, Richard Earnshaw, Kyrylo Tkachov, Richard Sandiford; +Cc: nd

ping

PR79262 has been fixed for almost all AArch64 cpus, however the example is still
vectorized in a few cases, resulting in lower performance.  Increase the cost of
vector-to-scalar moves so it is more similar to the other vector costs. As a result
-mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
performance improves.

OK for commit?

ChangeLog:
2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>

        PR target/79262
        * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
   1, /* vec_int_stmt_cost  */
   1, /* vec_fp_stmt_cost  */
   2, /* vec_permute_cost  */
-  1, /* vec_to_scalar_cost  */
+  2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* vec_align_load_cost  */
   1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2019-10-10 17:49   ` Wilco Dijkstra
@ 2019-10-16 21:35     ` Richard Sandiford
  2019-11-19 15:30     ` Wilco Dijkstra
  1 sibling, 0 replies; 12+ messages in thread
From: Richard Sandiford @ 2019-10-16 21:35 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, Richard Earnshaw, Kyrylo Tkachov, nd

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> ping
>
> PR79262 has been fixed for almost all AArch64 cpus, however the example is still
> vectorized in a few cases, resulting in lower performance.  Increase the cost of
> vector-to-scalar moves so it is more similar to the other vector costs. As a result
> -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
> performance improves.
>
> OK for commit?
>
> ChangeLog:
> 2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR target/79262
>         * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.

OK, thanks, and sorry for the delay.

qdf24xx_vector_cost is the only specific CPU cost table with a
vec_to_scalar_cost as low as 1.  It's not obvious how emphatic
that choice is though.  It looks like qdf24xx_vector_cost might
(very reasonably!) have started out as a copy of the generic costs
with some targeted changes.

But even if 1 is accurate there from a h/w perspective, the problem
is that the vectoriser's costings have a tendency to miss additional
overhead involved in scalarisation.  Although increasing the cost
to avoid that might be a bit of a hack, it's the accepted hack.

So I suspect in practice all CPUs will benefit from a higher cost,
not just those whose CPU tables already have one.  On that basis,
increasing the generic cost by the smallest possible amount should
be a good change across the board.

If anyone finds a counter-example, please let us know or file a bug.

Richard

> --
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>    1, /* vec_int_stmt_cost  */
>    1, /* vec_fp_stmt_cost  */
>    2, /* vec_permute_cost  */
> -  1, /* vec_to_scalar_cost  */
> +  2, /* vec_to_scalar_cost  */
>    1, /* scalar_to_vec_cost  */
>    1, /* vec_align_load_cost  */
>    1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2019-10-10 17:49   ` Wilco Dijkstra
  2019-10-16 21:35     ` Richard Sandiford
@ 2019-11-19 15:30     ` Wilco Dijkstra
  2019-11-19 15:53       ` Richard Sandiford
  1 sibling, 1 reply; 12+ messages in thread
From: Wilco Dijkstra @ 2019-11-19 15:30 UTC (permalink / raw)
  To: GCC Patches, Richard Earnshaw, Kyrylo Tkachov, Richard Sandiford; +Cc: nd


ping

PR79262 has been fixed for almost all AArch64 cpus, however the example is still
vectorized in a few cases, resulting in lower performance.  Increase the cost of
vector-to-scalar moves so it is more similar to the other vector costs. As a result
-mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
performance improves.

OK for commit?

ChangeLog:
2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>

        PR target/79262
        * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
   1, /* vec_int_stmt_cost  */
   1, /* vec_fp_stmt_cost  */
   2, /* vec_permute_cost  */
-  1, /* vec_to_scalar_cost  */
+  2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* vec_align_load_cost  */
   1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2019-11-19 15:30     ` Wilco Dijkstra
@ 2019-11-19 15:53       ` Richard Sandiford
  2019-11-19 16:07         ` Wilco Dijkstra
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2019-11-19 15:53 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GCC Patches, Richard Earnshaw, Kyrylo Tkachov, nd

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> ping

I acked this here: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01229.html

>
> PR79262 has been fixed for almost all AArch64 cpus, however the example is still
> vectorized in a few cases, resulting in lower performance.  Increase the cost of
> vector-to-scalar moves so it is more similar to the other vector costs. As a result
> -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6
> performance improves.
>
> OK for commit?
>
> ChangeLog:
> 2018-01-22  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR target/79262
>         * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost.
> --
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>    1, /* vec_int_stmt_cost  */
>    1, /* vec_fp_stmt_cost  */
>    2, /* vec_permute_cost  */
> -  1, /* vec_to_scalar_cost  */
> +  2, /* vec_to_scalar_cost  */
>    1, /* scalar_to_vec_cost  */
>    1, /* vec_align_load_cost  */
>    1, /* vec_unalign_load_cost  */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][AArch64] PR79262: Adjust vector cost
  2019-11-19 15:53       ` Richard Sandiford
@ 2019-11-19 16:07         ` Wilco Dijkstra
  0 siblings, 0 replies; 12+ messages in thread
From: Wilco Dijkstra @ 2019-11-19 16:07 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: GCC Patches, Richard Earnshaw, Kyrylo Tkachov, nd

Hi Richard,

> I acked this here: 
> https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01229.html

Thanks - I missed your email, but it's committed now. Yes we will
need to look at the vector costs again and retune them based on
recent vectorizer improvements and latest microarchitectures.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-19 16:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22 15:12 [PATCH][AArch64] PR79262: Adjust vector cost Wilco Dijkstra
2018-01-22 15:44 ` Richard Biener
2018-11-09 14:57   ` James Greenhalgh
2018-11-09 17:43     ` Wilco Dijkstra
2018-11-09 14:14 ` Wilco Dijkstra
2018-11-09 14:54   ` James Greenhalgh
2018-11-09 17:31     ` Wilco Dijkstra
2019-10-10 17:49   ` Wilco Dijkstra
2019-10-16 21:35     ` Richard Sandiford
2019-11-19 15:30     ` Wilco Dijkstra
2019-11-19 15:53       ` Richard Sandiford
2019-11-19 16:07         ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).