From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 6/8] aarch64: Tweak MLA vector costs
Date: Wed, 04 Aug 2021 13:19:42 +0100 [thread overview]
Message-ID: <mptr1f95rmp.fsf@arm.com> (raw)
In-Reply-To: <mpty29h5rvd.fsf@arm.com> (Richard Sandiford via Gcc-patches's message of "Wed, 04 Aug 2021 13:14:30 +0100")
Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> The issue-based vector costs currently assume that a multiply-add
>>> sequence can be implemented using a single instruction. This is
>>> generally true for scalars (which have a 4-operand instruction)
>>> and SVE (which allows the output to be tied to any input).
>>> However, for Advanced SIMD, multiplying two values and adding
>>> an invariant will end up being a move and an MLA.
>>>
>>> The only target to use the issue-based vector costs is Neoverse V1,
>>> which would generally prefer SVE in this case anyway. I therefore
>>> don't have a self-contained testcase. However, the distinction
>>> becomes more important with a later patch.
>>
>> But we do cost any invariants separately (for the prologue), so they
>> should be available in a register. How doesn't that work?
>
> Yeah, that works, and the prologue part is costed correctly. But the
> result of an Advanced SIMD FMLA is tied to the accumulator input, so if
> the accumulator input is an invariant, we need a register move (in the
> loop body) before the FMLA.
>
> E.g. for:
>
> void
> f (float *restrict x, float *restrict y, float *restrict z, float a)
> {
> for (int i = 0; i < 100; ++i)
> x[i] = y[i] * z[i];
+ 1.0, not sure where that went.
> }
>
> the scalar code is:
>
> .L2:
> ldr s1, [x1, x3]
> ldr s2, [x2, x3]
> fmadd s1, s1, s2, s0
> str s1, [x0, x3]
> add x3, x3, 4
> cmp x3, 400
> bne .L2
>
> the SVE code is:
>
> .L2:
> ld1w z1.s, p0/z, [x1, x3, lsl 2]
> ld1w z0.s, p0/z, [x2, x3, lsl 2]
> fmad z0.s, p1/m, z1.s, z2.s
> st1w z0.s, p0, [x0, x3, lsl 2]
> add x3, x3, x5
> whilelo p0.s, w3, w4
> b.any .L2
>
> but the Advanced SIMD code is:
>
> .L2:
> mov v0.16b, v3.16b // <-------- boo
> ldr q2, [x2, x3]
> ldr q1, [x1, x3]
> fmla v0.4s, v2.4s, v1.4s
> str q0, [x0, x3]
> add x3, x3, 16
> cmp x3, 400
> bne .L2
>
> Thanks,
> Richard
>
>
>>
>>> gcc/
>>> * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>> parameter. Detect cases in which an Advanced SIMD MLA would almost
>>> certainly require a MOV.
>>> (aarch64_count_ops): Update accordingly.
>>> ---
>>> gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 084f8caa0da..19045ef6944 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>>>
>>> /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>> or multiply-subtract sequence that might be suitable for fusing into a
>>> - single instruction. */
>>> + single instruction. If VEC_FLAGS is zero, analyze the operation as
>>> + a scalar one, otherwise analyze it as an operation on vectors with those
>>> + VEC_* flags. */
>>> static bool
>>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>>> + unsigned int vec_flags)
>>> {
>>> gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>> if (!assign)
>>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>> continue;
>>>
>>> + if (vec_flags & VEC_ADVSIMD)
>>> + {
>>> + /* Scalar and SVE code can tie the result to any FMLA input (or none,
>>> + although that requires a MOVPRFX for SVE). However, Advanced SIMD
>>> + only supports MLA forms, so will require a move if the result
>>> + cannot be tied to the accumulator. The most important case in
>>> + which this is true is when the accumulator input is invariant. */
>>> + rhs = gimple_op (assign, 3 - i);
>>> + if (TREE_CODE (rhs) != SSA_NAME)
>>> + return false;
>>> + def_stmt_info = vinfo->lookup_def (rhs);
>>> + if (!def_stmt_info
>>> + || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>>> + return false;
>>> + }
>>> +
>>> return true;
>>> }
>>> return false;
>>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>> }
>>>
>>> /* Assume that multiply-adds will become a single operation. */
>>> - if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>>> + if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>> return;
>>>
>>> /* When costing scalar statements in vector code, the count already
next prev parent reply other threads:[~2021-08-04 12:19 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
2021-08-03 12:03 ` [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask Richard Sandiford
2021-08-03 12:04 ` [PATCH 2/8] aarch64: Add a simple fixed-point class for costing Richard Sandiford
2021-08-03 12:04 ` [PATCH 3/8] aarch64: Split out aarch64_adjust_body_cost_sve Richard Sandiford
2021-08-03 12:05 ` [PATCH 4/8] aarch64: Add gather_load_xNN_cost tuning fields Richard Sandiford
2021-08-03 12:05 ` [PATCH 5/8] aarch64: Tweak the cost of elementwise stores Richard Sandiford
2021-08-04 11:43 ` Richard Biener
2021-08-05 12:04 ` [PATCH] vect: Move costing helpers from aarch64 code Richard Sandiford
2021-08-05 12:11 ` Richard Biener
2021-08-05 13:06 ` Richard Sandiford
2021-08-03 12:06 ` [PATCH 6/8] aarch64: Tweak MLA vector costs Richard Sandiford
2021-08-04 11:45 ` Richard Biener
2021-08-04 12:14 ` Richard Sandiford
2021-08-04 12:19 ` Richard Sandiford [this message]
2021-08-03 12:06 ` [PATCH 7/8] aarch64: Restrict issue heuristics to inner vector loop Richard Sandiford
2021-08-03 12:06 ` [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb Richard Sandiford
2021-08-17 14:37 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mptr1f95rmp.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).