From: Richard Sandiford <richard.sandiford@arm.com>
To: Hao Liu OS <hliu@os.amperecomputing.com>
Cc: "GCC-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
Date: Mon, 24 Jul 2023 12:10:52 +0100 [thread overview]
Message-ID: <mptr0oxh8eb.fsf@arm.com> (raw)
In-Reply-To: <SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com> (Hao Liu's message of "Wed, 19 Jul 2023 04:33:48 +0000")
Hao Liu OS <hliu@os.amperecomputing.com> writes:
> This only affects the new costs in aarch64 backend. Currently, the reduction
> latency of vector body is too large as it is multiplied by stmt count. As the
> scalar reduction latency is small, the new costs model may think "scalar code
> would issue more quickly" and increase the vector body cost a lot, which will
> miss vectorization opportunities.
>
> Tested by bootstrapping on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
> PR target/110625
> * config/aarch64/aarch64.cc (count_ops): Remove the '* count'
> for reduction_latency.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/pr110625.c: New testcase.
> ---
> gcc/config/aarch64/aarch64.cc | 5 +--
> gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++
> 2 files changed, 47 insertions(+), 4 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..27afa64b7d5 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
> {
> unsigned int base
> = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> -
> - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> - that's not yet the case. */
> - ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> + ops->reduction_latency = MAX (ops->reduction_latency, base);
The ??? is referring to the single_defuse_cycle code in
vectorizable_reduction. E.g. consider:
long
f (long res, short *ptr1, short *ptr2, int n) {
for (int i = 0; i < n; ++i)
res += (long) ptr1[i] << ptr2[i];
return res;
}
compiled at -O3. The main loop is:
movi v25.4s, 0 // init accumulator
lsl x5, x5, 4
.p2align 3,,7
.L4:
ldr q31, [x1, x4]
ldr q29, [x2, x4]
add x4, x4, 16
sxtl v30.4s, v31.4h
sxtl2 v31.4s, v31.8h
sxtl v28.4s, v29.4h
sxtl2 v29.4s, v29.8h
sxtl v27.2d, v30.2s
sxtl2 v30.2d, v30.4s
sxtl v23.2d, v28.2s
sxtl2 v26.2d, v28.4s
sxtl v24.2d, v29.2s
sxtl v28.2d, v31.2s
sshl v27.2d, v27.2d, v23.2d
sshl v30.2d, v30.2d, v26.2d
sxtl2 v31.2d, v31.4s
sshl v28.2d, v28.2d, v24.2d
add v27.2d, v27.2d, v25.2d // v25 -> v27
sxtl2 v29.2d, v29.4s
add v30.2d, v30.2d, v27.2d // v27 -> v30
sshl v31.2d, v31.2d, v29.2d
add v30.2d, v28.2d, v30.2d // v30 -> v30
add v25.2d, v31.2d, v30.2d // v30 -> v25
cmp x4, x5
bne .L4
Here count is 4 and the latency is 4 additions (8 cycles). So as a
worst case, the current cost is correct.
To remove the count in all cases, we would need to
(1) avoid single def-use cycles or
(2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc)
But looking again, it seems we do have the information to distinguish
the cases. We can do something like:
stmt_vec_info reduc_info = info_for_reduction (m_vinfo, stmt_info);
if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info))
/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
and then accumulate that, but at the moment the loop-carried
dependency includes all copies. */
ops->reduction_latency = MAX (ops->reduction_latency, base * count);
else
ops->reduction_latency = MAX (ops->reduction_latency, base);
(completely untested).
Thanks,
Richard
next prev parent reply other threads:[~2023-07-24 11:10 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-19 4:33 Hao Liu OS
2023-07-24 1:58 ` Hao Liu OS
2023-07-24 11:10 ` Richard Sandiford [this message]
2023-07-25 9:10 ` Hao Liu OS
2023-07-25 9:44 ` Richard Sandiford
2023-07-26 2:01 ` Hao Liu OS
2023-07-26 8:47 ` Richard Biener
2023-07-26 9:14 ` Richard Sandiford
2023-07-26 10:02 ` Richard Biener
2023-07-26 10:12 ` Richard Sandiford
2023-07-26 12:00 ` Richard Biener
2023-07-26 12:54 ` Hao Liu OS
2023-07-28 10:06 ` Hao Liu OS
2023-07-28 17:35 ` Richard Sandiford
2023-07-31 2:39 ` Hao Liu OS
2023-07-31 9:11 ` Richard Sandiford
2023-07-31 9:25 ` Hao Liu OS
2023-08-01 9:43 ` Hao Liu OS
2023-08-02 3:45 ` Hao Liu OS
2023-08-03 9:33 ` Hao Liu OS
2023-08-03 10:10 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mptr0oxh8eb.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hliu@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).