From: "Kewen.Lin" <linkw@linux.ibm.com>
To: richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 00/10] vect: Reuse reduction accumulators between loops
Date: Sat, 10 Jul 2021 10:11:35 +0800 [thread overview]
Message-ID: <7b5ba7c1-4e8b-9aab-c0f3-ce05a701c899@linux.ibm.com> (raw)
In-Reply-To: <mptpmvtrner.fsf@arm.com>
Hi Richard,
on 2021/7/8 下午8:38, Richard Sandiford via Gcc-patches wrote:
> Quoting from the final patch in the series:
>
> ------------------------------------------------------------------------
> This patch adds support for reusing a main loop's reduction accumulator
> in an epilogue loop. This in turn lets the loops share a single piece
> of vector->scalar reduction code.
>
> The patch has the following restrictions:
>
> (1) The epilogue reduction can only operate on a single vector
> (e.g. ncopies must be 1 for non-SLP reductions, and the group size
> must be <= the element count for SLP reductions).
>
> (2) Both loops must use the same vector mode for their accumulators.
> This means that the patch is restricted to targets that support
> --param vect-partial-vector-usage=1.
>
> (3) The reduction must be a standard “tree code” reduction.
>
> However, these restrictions could be lifted in future. For example,
> if the main loop operates on 128-bit vectors and the epilogue loop
> operates on 64-bit vectors, we could in future reduce the 128-bit
> vector by one stage and use the 64-bit result as the starting point
> for the epilogue result.
>
> The patch tries to handle chained SLP reductions, unchained SLP
> reductions and non-SLP reductions. It also handles cases in which
> the epilogue loop is entered directly (rather than via the main loop)
> and cases in which the epilogue loop can be skipped.
> ------------------------------------------------------------------------
>
> However, it ended up being difficult to do that without some preparatory
> clean-ups. Some of them could probably stand on their own, but others
> are a bit “meh” without the final patch to justify them.
>
> The diff below shows the effect of the patch when compiling:
>
> unsigned short __attribute__((noipa))
> add_loop (unsigned short *x, int n)
> {
> unsigned short res = 0;
> for (int i = 0; i < n; ++i)
> res += x[i];
> return res;
> }
>
> with -O3 --param vect-partial-vector-usage=1 on an SVE target:
>
> add_loop: add_loop:
> .LFB0: .LFB0:
> .cfi_startproc .cfi_startproc
> mov x4, x0 <
> cmp w1, 0 cmp w1, 0
> ble .L7 ble .L7
> cnth x0 | cnth x4
> sub w2, w1, #1 sub w2, w1, #1
> sub w3, w0, #1 | sub w3, w4, #1
> cmp w2, w3 cmp w2, w3
> bcc .L8 bcc .L8
> sub w0, w1, w0 | sub w4, w1, w4
> mov x3, 0 mov x3, 0
> cnth x5 cnth x5
> mov z0.b, #0 mov z0.b, #0
> ptrue p0.b, all ptrue p0.b, all
> .p2align 3,,7 .p2align 3,,7
> .L4: .L4:
> ld1h z1.h, p0/z, [x4, x3, | ld1h z1.h, p0/z, [x0, x3,
> mov x2, x3 mov x2, x3
> add x3, x3, x5 add x3, x3, x5
> add z0.h, z0.h, z1.h add z0.h, z0.h, z1.h
> cmp w0, w3 | cmp w4, w3
> bcs .L4 bcs .L4
> uaddv d0, p0, z0.h <
> umov w0, v0.h[0] <
> inch x2 inch x2
> and w0, w0, 65535 <
> cmp w1, w2 cmp w1, w2
> beq .L2 | beq .L6
> .L3: .L3:
> sub w1, w1, w2 sub w1, w1, w2
> mov z1.b, #0 | add x2, x0, w2, uxtw 1
> whilelo p0.h, wzr, w1 whilelo p0.h, wzr, w1
> add x2, x4, w2, uxtw 1 | ld1h z1.h, p0/z, [x2]
> ptrue p1.b, all | add z0.h, p0/m, z0.h, z1.
> ld1h z0.h, p0/z, [x2] | .L6:
> sel z0.h, p0, z0.h, z1.h | ptrue p0.b, all
> uaddv d0, p1, z0.h | uaddv d0, p0, z0.h
> fmov x1, d0 | umov w0, v0.h[0]
> add w0, w0, w1, uxth <
> and w0, w0, 65535 and w0, w0, 65535
> .L2: <
> ret ret
> .p2align 2,,3 .p2align 2,,3
> .L7: .L7:
> mov w0, 0 mov w0, 0
> ret ret
> .L8: .L8:
> mov w2, 0 mov w2, 0
> mov w0, 0 | mov z0.b, #0
> b .L3 b .L3
> .cfi_endproc .cfi_endproc
>
> Kewen, could you give this a spin on Power 10 to see whether it
> works/helps there? I've attached a combined diff.
>
Thanks for the combined diff file.
I'm sorry that the current length based partial vector doesn't support
reduction, there are no conditional operations for length, we have to
preprocess the inactive lanes for the intermediate operations or final
reduction operations as operation types since the inactive lane value
is supposed to be undefined, this seems to require an efficient way to
turn length to a mask vector, Power10 doesn't have the corresponding
instruction so we have to do some tricks, it's still on my TODO list.
I did a hacking to relax the check in vectorizable_operation for
operations involved for reduction, I can see this patch series takes
effect for length based partial vector, so I believe it will help
length based partial vector once we enable it for reduction later.
Thanks for improving this!
This patch series was bootstrapped and regress-tested on Power10, also
benchmarked with SPEC2017 based on r12-2179 at Ofast unroll, no
remarkable regression and improvement was observed.
BR,
Kewen
next prev parent reply other threads:[~2021-07-10 2:11 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-08 12:38 Richard Sandiford
2021-07-08 12:39 ` [PATCH 01/10] vect: Simplify epilogue reduction code Richard Sandiford
2021-07-08 12:58 ` Richard Biener
2021-07-08 12:39 ` [PATCH 02/10] vect: Create array_slice of live-out stmts Richard Sandiford
2021-07-08 12:58 ` Richard Biener
2021-07-08 12:39 ` [PATCH 03/10] vect: Remove new_phis from Richard Sandiford
2021-07-08 12:59 ` Richard Biener
2021-07-08 12:40 ` [PATCH 04/10] vect: Ensure reduc_inputs always have vectype Richard Sandiford
2021-07-08 13:01 ` Richard Biener
2021-07-13 9:26 ` Richard Sandiford
2021-07-08 12:40 ` [PATCH 05/10] vect: Add a vect_phi_initial_value helper function Richard Sandiford
2021-07-08 13:05 ` Richard Biener
2021-07-08 13:12 ` Richard Sandiford
2021-07-08 12:40 ` [PATCH 06/10] vect: Pass reduc_info to get_initial_defs_for_reduction Richard Sandiford
2021-07-08 13:10 ` Richard Biener
2021-07-08 16:48 ` Richard Sandiford
2021-07-09 11:33 ` Richard Biener
2021-07-08 12:41 ` [PATCH 07/10] vect: Pass reduc_info to get_initial_def_for_reduction Richard Sandiford
2021-07-08 12:41 ` [PATCH 08/10] vect: Generalise neutral_op_for_slp_reduction Richard Sandiford
2021-07-08 13:13 ` Richard Biener
2021-07-08 12:41 ` [PATCH 09/10] vect: Simplify get_initial_def_for_reduction Richard Sandiford
2021-07-08 13:14 ` Richard Biener
2021-07-08 12:43 ` [PATCH 10/10] vect: Reuse reduction accumulators between loops Richard Sandiford
2021-07-09 11:58 ` Richard Biener
2021-07-09 13:12 ` Richard Sandiford
2021-07-12 6:32 ` Richard Biener
2021-07-12 17:55 ` Richard Sandiford
2021-07-13 6:09 ` Richard Biener
2021-07-10 2:11 ` Kewen.Lin [this message]
2021-07-13 9:27 ` [PATCH 00/10] " Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7b5ba7c1-4e8b-9aab-c0f3-ce05a701c899@linux.ibm.com \
--to=linkw@linux.ibm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).