From: 钟居哲 <juzhe.zhong@rivai.ai>
To: richard.sandiford <richard.sandiford@arm.com>
Cc: gcc-patches <gcc-patches@gcc.gnu.org>, rguenther <rguenther@suse.de>
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
Date: Wed, 24 May 2023 21:26:58 +0800 [thread overview]
Message-ID: <19FAF964BA45AEBD+202305242126582416128@rivai.ai> (raw)
In-Reply-To: <mptedn5hpf3.fsf@arm.com>
[-- Attachment #1: Type: text/plain, Size: 12097 bytes --]
>> In other words, why is this different from what
>>vect_set_loop_controls_directly would do?
Oh, I see. You are confused that why I do not make multiple-rgroup vec_trunk
handling inside "vect_set_loop_controls_directly".
Well. Frankly, I just replicate the handling of ARM SVE:
unsigned int nmasks = i + 1;
if (use_masks_p && (nmasks & 1) == 0)
{
rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
if (!half_rgc->controls.is_empty ()
&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
continue;
}
/* Try to use permutes to define the masks in DEST_RGM using the masks
in SRC_RGM, given that the former has twice as many masks as the
latter. Return true on success, adding any new statements to SEQ. */
static bool
vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
rgroup_controls *src_rgm)
{
tree src_masktype = src_rgm->type;
tree dest_masktype = dest_rgm->type;
machine_mode src_mode = TYPE_MODE (src_masktype);
insn_code icode1, icode2;
if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
&& (icode1 = optab_handler (vec_unpacku_hi_optab,
src_mode)) != CODE_FOR_nothing
&& (icode2 = optab_handler (vec_unpacku_lo_optab,
src_mode)) != CODE_FOR_nothing)
{
/* Unpacking the source masks gives at least as many mask bits as
we need. We can then VIEW_CONVERT any excess bits away. */
machine_mode dest_mode = insn_data[icode1].operand[0].mode;
gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
tree src = src_rgm->controls[i / 2];
tree dest = dest_rgm->controls[i];
tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
? VEC_UNPACK_HI_EXPR
: VEC_UNPACK_LO_EXPR);
gassign *stmt;
if (dest_masktype == unpack_masktype)
stmt = gimple_build_assign (dest, code, src);
else
{
tree temp = make_ssa_name (unpack_masktype);
stmt = gimple_build_assign (temp, code, src);
gimple_seq_add_stmt (seq, stmt);
stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
build1 (VIEW_CONVERT_EXPR,
dest_masktype, temp));
}
gimple_seq_add_stmt (seq, stmt);
}
return true;
}
vec_perm_indices indices[2];
if (dest_masktype == src_masktype
&& interleave_supported_p (&indices[0], src_masktype, 0)
&& interleave_supported_p (&indices[1], src_masktype, 1))
{
/* The destination requires twice as many mask bits as the source, so
we can use interleaving permutes to double up the number of bits. */
tree masks[2];
for (unsigned int i = 0; i < 2; ++i)
masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
tree src = src_rgm->controls[i / 2];
tree dest = dest_rgm->controls[i];
gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
src, src, masks[i & 1]);
gimple_seq_add_stmt (seq, stmt);
}
return true;
}
return false;
}
I know this is just optimization for ARM SVE with sub_rgc (int16) is half size of rgc (int8).
But when I just copy the codes from ARM SVE and make it general for all cases (int8 <-> int64).
They all work well and codegen is good.
If you don't like this way, would you mind give me some suggestions?
Thanks.
juzhe.zhong@rivai.ai
From: Richard Sandiford
Date: 2023-05-24 20:41
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
Sorry, I realised later that I had an implicit assumption here:
if there are multiple rgroups, it's better to have a single IV
for the smallest rgroup and scale that up to bigger rgroups.
E.g. if the loop control IV is taken from an N-control rgroup
and has a step S, an N*M-control rgroup would be based on M*S.
Of course, it's also OK to create multiple IVs if you prefer.
It's just a question of which approach gives the best output
in practice.
Another way of going from an N-control rgroup ("G1") to an N*M-control
rgroup ("G2") would be to reuse all N controls from G1. E.g. the
first M controls in G2 would come from G1[0], the next M from
G1[1], etc. That might lower the longest dependency chain.
But whatever we do, it doesn't feel like max_nscalars_per_iter
should be part of the decision. (I realise it will be part of
the decision for the follow-on SELECT_IV patch. But that's
because we require the number of elements processed in each
iteration to be a multiple of max_nscalars_per_iter, and AIUI
SELECT_IV wouldn't guarantee that. max_nscalars_per_iter shouldn't
matter for the current patch though.)
钟居哲 <juzhe.zhong@rivai.ai> writes:
> Hi, Richard. It's quite complicated for me and I am not sure whether I can catch up with you.
> So I will rather split the work step by step to implement the decrement IV
>
> For the first step you mentioned:
>
>>> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
>
> >> FOR_EACH_VEC_ELT (*controls, i, rgc)
> >> if (!rgc->controls.is_empty ())
>
>>> call vect_set_loop_controls_directly. That is:
>
>>> >> /* See whether zero-based IV would ever generate all-false masks
>>> or zero length before wrapping around. */
>>> bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>>>
> /* Set up all controls for this group. */
>>> test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> >> &preheader_seq,
> >> &header_seq,
> >> loop_cond_gsi, rgc,
> >> niters, niters_skip,
> >> might_wrap_p);
>
>>> needs to be an "if" that (for LOOP_VINFO_USING_DECREMENTING_IV_P)
>>> is only executed on the first iteration.
>
> Is it correct like this?
>
> FOR_EACH_VEC_ELT (*controls, i, rgc)
> if (!rgc->controls.is_empty ())
> {
> /* First try using permutes. This adds a single vector
> instruction to the loop for each mask, but needs no extra
> loop invariants or IVs. */
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
> {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
> continue;
> }
>
> /* See whether zero-based IV would ever generate all-false masks
> or zero length before wrapping around. */
> bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>
> /* Set up all controls for this group. */
> test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> &preheader_seq,
> &header_seq,
> loop_cond_gsi, rgc,
> niters, niters_skip,
> might_wrap_p);
>
> /* Decrement IV only run vect_set_loop_controls_directly once. */
> if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> break;
> }
I meant something like:
FOR_EACH_VEC_ELT (*controls, i, rgc)
if (!rgc->controls.is_empty ())
{
/* First try using permutes. This adds a single vector
instruction to the loop for each mask, but needs no extra
loop invariants or IVs. */
unsigned int nmasks = i + 1;
if (use_masks_p && (nmasks & 1) == 0)
{
rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
if (!half_rgc->controls.is_empty ()
&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
continue;
}
if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
|| !LOOP_VINFO_DECREMENTING_IV_STEP (loop_info))
{
/* See whether zero-based IV would ever generate all-false masks
or zero length before wrapping around. */
bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
/* Set up all controls for this group. */
test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
&preheader_seq,
&header_seq,
loop_cond_gsi, rgc,
niters, niters_skip,
might_wrap_p);
}
if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
&& rgc->controls.length () > 1)
...use vect_adjust_loop_lens_control...
}
where LOOP_VINFO_DECREMENTING_IV_STEP (loop_info) is "S" from my
previous review.
vect_set_loop_controls_directly would then set
LOOP_VINFO_DECREMENTING_IV_STEP but would not call
vect_adjust_loop_lens_control.
But like I say, this is all based on the assumption that we should
have a single IV and scale it up for later rgroups. If you'd prefer
separate IVs then that's fine. But then I think it's less clear
why we have:
> + if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> + && rgc->max_nscalars_per_iter == 1
> + && rgc != &LOOP_VINFO_LENS (loop_vinfo)[0])
> + {
> + /* Multiple rgroup (non-SLP):
> + ...
> + _38 = (unsigned long) n_12(D);
> + ...
> + # ivtmp_38 = PHI <ivtmp_39(3), 100(2)>
> + ...
> + _40 = MIN_EXPR <ivtmp_38, POLY_INT_CST [8, 8]>;
> + loop_len_21 = MIN_EXPR <_40, POLY_INT_CST [2, 2]>;
> + _41 = _40 - loop_len_21;
> + loop_len_20 = MIN_EXPR <_41, POLY_INT_CST [2, 2]>;
> + _42 = _40 - loop_len_20;
> + loop_len_19 = MIN_EXPR <_42, POLY_INT_CST [2, 2]>;
> + _43 = _40 - loop_len_19;
> + loop_len_16 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
> + ...
> + vect__4.8_15 = .LEN_LOAD (_6, 64B, loop_len_21, 0);
> + ...
> + vect__4.9_8 = .LEN_LOAD (_13, 64B, loop_len_20, 0);
> + ...
> + vect__4.10_28 = .LEN_LOAD (_46, 64B, loop_len_19, 0);
> + ...
> + vect__4.11_30 = .LEN_LOAD (_49, 64B, loop_len_16, 0);
> + vect__7.13_31 = VEC_PACK_TRUNC_EXPR <vect__4.8_15, vect__4.9_8>;
> + vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>;
> + vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>;
> + ...
> + .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
> + ivtmp_39 = ivtmp_38 - _40;
> + ...
> + if (ivtmp_39 != 0)
> + goto <bb 3>; [92.31%]
> + else
> + goto <bb 4>; [7.69%]
> + */
> + rgroup_controls *sub_rgc
> + = &(*controls)[nmasks / rgc->controls.length () - 1];
> + if (!sub_rgc->controls.is_empty ())
> + {
> + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> + vect_adjust_loop_lens_control (iv_type, &header_seq, rgc,
> + sub_rgc, NULL_TREE);
> + continue;
> + }
> + }
In other words, why is this different from what
vect_set_loop_controls_directly would do?
Thanks,
Richard
next prev parent reply other threads:[~2023-05-24 13:27 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-22 8:38 juzhe.zhong
2023-05-23 12:32 ` juzhe.zhong
2023-05-24 11:23 ` Richard Sandiford
2023-05-24 11:52 ` 钟居哲
2023-05-24 12:41 ` Richard Sandiford
2023-05-24 12:51 ` Richard Biener
2023-05-24 13:27 ` 钟居哲
2023-05-24 13:26 ` 钟居哲 [this message]
2023-05-24 14:01 ` Richard Sandiford
2023-05-24 14:10 ` 钟居哲
2023-05-24 14:57 ` Richard Sandiford
2023-05-24 14:58 ` 钟居哲
[not found] ` <2023052422100415521814@rivai.ai>
2023-05-24 14:11 ` 钟居哲
2023-05-24 14:22 ` 钟居哲
2023-05-24 14:35 ` 钟居哲
2023-05-24 12:19 ` 钟居哲
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=19FAF964BA45AEBD+202305242126582416128@rivai.ai \
--to=juzhe.zhong@rivai.ai \
--cc=gcc-patches@gcc.gnu.org \
--cc=rguenther@suse.de \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).