From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=4MOh=BN=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id CCC613858D28
	for <gcc-patches@gcc.gnu.org>; Wed, 24 May 2023 14:01:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CCC613858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6322B1042;
	Wed, 24 May 2023 07:01:55 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BB4273F762;
	Wed, 24 May 2023 07:01:09 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: =?utf-8?B?6ZKf5bGF5ZOy?= <juzhe.zhong@rivai.ai>
Mail-Followup-To: =?utf-8?B?6ZKf5bGF5ZOy?=
 <juzhe.zhong@rivai.ai>,gcc-patches <gcc-patches@gcc.gnu.org>,  rguenther
 <rguenther@suse.de>, richard.sandiford@arm.com
Cc: gcc-patches <gcc-patches@gcc.gnu.org>,  rguenther <rguenther@suse.de>
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
References: <20230522083814.1647787-1-juzhe.zhong@rivai.ai>
	<mpt7csyht2b.fsf@arm.com>
	<E89F6BFE64A78D84+202305241952507669172@rivai.ai>
	<mptedn5hpf3.fsf@arm.com>
	<19FAF964BA45AEBD+202305242126582416128@rivai.ai>
Date: Wed, 24 May 2023 15:01:08 +0100
In-Reply-To: <19FAF964BA45AEBD+202305242126582416128@rivai.ai>
 (=?utf-8?B?IumSn+WxheWTsiIncw==?=
	message of "Wed, 24 May 2023 21:26:58 +0800")
Message-ID: <mptjzwxg76j.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-21.2 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

=E9=92=9F=E5=B1=85=E5=93=B2 <juzhe.zhong@rivai.ai> writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_t=
runk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks =3D i + 1;
>         if (use_masks_p && (nmasks & 1) =3D=3D 0)
>           {
>             rgroup_controls *half_rgc =3D &(*controls)[nmasks / 2 - 1];
>             if (!half_rgc->controls.is_empty ()
>                 && vect_maybe_permute_loop_masks (&header_seq, rgc, half_=
rgc))
>               continue;
>           }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>    in SRC_RGM, given that the former has twice as many masks as the
>    latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>                                rgroup_controls *src_rgm)
> {
>   tree src_masktype =3D src_rgm->type;
>   tree dest_masktype =3D dest_rgm->type;
>   machine_mode src_mode =3D TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <=3D src_rgm->max_nscalars_per_iter
>       && (icode1 =3D optab_handler (vec_unpacku_hi_optab,
>                                   src_mode)) !=3D CODE_FOR_nothing
>       && (icode2 =3D optab_handler (vec_unpacku_lo_optab,
>                                   src_mode)) !=3D CODE_FOR_nothing)
>     {
>       /* Unpacking the source masks gives at least as many mask bits as
>          we need.  We can then VIEW_CONVERT any excess bits away.  */
>       machine_mode dest_mode =3D insn_data[icode1].operand[0].mode;
>       gcc_assert (dest_mode =3D=3D insn_data[icode2].operand[0].mode);
>       tree unpack_masktype =3D vect_halve_mask_nunits (src_masktype, dest=
_mode);
>       for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i)
>         {
>           tree src =3D src_rgm->controls[i / 2];
>           tree dest =3D dest_rgm->controls[i];
>           tree_code code =3D ((i & 1) =3D=3D (BYTES_BIG_ENDIAN ? 0 : 1)
>                             ? VEC_UNPACK_HI_EXPR
>                             : VEC_UNPACK_LO_EXPR);
>           gassign *stmt;
>           if (dest_masktype =3D=3D unpack_masktype)
>             stmt =3D gimple_build_assign (dest, code, src);
>           else
>             {
>               tree temp =3D make_ssa_name (unpack_masktype);
>               stmt =3D gimple_build_assign (temp, code, src);
>               gimple_seq_add_stmt (seq, stmt);
>               stmt =3D gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>                                           build1 (VIEW_CONVERT_EXPR,
>                                                   dest_masktype, temp));
>             }
>           gimple_seq_add_stmt (seq, stmt);
>         }
>       return true;
>     }
>   vec_perm_indices indices[2];
>   if (dest_masktype =3D=3D src_masktype
>       && interleave_supported_p (&indices[0], src_masktype, 0)
>       && interleave_supported_p (&indices[1], src_masktype, 1))
>     {
>       /* The destination requires twice as many mask bits as the source, =
so
>          we can use interleaving permutes to double up the number of bits=
.  */
>       tree masks[2];
>       for (unsigned int i =3D 0; i < 2; ++i)
>         masks[i] =3D vect_gen_perm_mask_checked (src_masktype, indices[i]=
);
>       for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i)
>         {
>           tree src =3D src_rgm->controls[i / 2];
>           tree dest =3D dest_rgm->controls[i];
>           gimple *stmt =3D gimple_build_assign (dest, VEC_PERM_EXPR,
>                                               src, src, masks[i & 1]);
>           gimple_seq_add_stmt (seq, stmt);
>         }
>       return true;
>     }
>   return false;
> }
>
> I know this is just optimization for ARM SVE with sub_rgc (int16)  is hal=
f size of rgc (int8).
> But when I just copy the codes from ARM SVE and make it general for all c=
ases (int8 <-> int64).
> They all work well and codegen is good.=20
>
> If you don't like this way, would you mind give me some suggestions?

It's not a case of disliking one approach or disliking another.
There are two separate parts of this: one specific and one general.

The specific part is that the code had:

	    rgroup_controls *sub_rgc
	      =3D &(*controls)[nmasks / rgc->controls.length () - 1];
	    if (!sub_rgc->controls.is_empty ())
	      {
		tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
		vect_adjust_loop_lens_control (iv_type, &header_seq, rgc,
					       sub_rgc, NULL_TREE);
		continue;
	      }

But AIUI, nmasks is always equal to rgc->controls.length ()
(if rgc->controls is non-empty).  So I think this always used
(*controls)[0] as the source rgroup.  And I think that's the
only case that would work, since vect_adjust_loop_lens_control
only reads from sub_rgc once.  Is that right?

It would be possible to make other source rgroups work, but it
would involve reading every element of sub_rgc->controls, not just
sub_rgc->controls[0].  (This is also what the SVE code does.)

The general point is that there are two extreme approaches:

(1) Use one IV, and base everything off that.

(2) Use a separate IV for every rgroup.

It's also possible to do something between these two extremes.
(And we should, if it gives better code.)

The code above reuses an existing IV.  Doing that for all multi-control
rgroups would give (1).  At the other extreme, (2) would involve calling
vect_set_loop_controls_directly for every rgroup.

Both approaches are fine.  I'm not against one or the other.

What I didn't understand was why your patch only reuses existing IVs
for max_nscalars_per_iter =3D=3D 1.  Was it to avoid having to do a
multiplication (well, really a shift left) when moving from one
rgroup to another?  E.g. if one rgroup had;

  nscalars_per_iter =3D=3D 2 && factor =3D=3D 1

and another had:

  nscalars_per_iter =3D=3D 4 && factor =3D=3D 1

then we would need to mulitply by 2 when going from the first rgroup
to the second.

If so, avoiding a multiplication seems like a good reason for the choice
you were making in the path.  But we then need to check
max_nscalars_per_iter =3D=3D 1 for both the source rgroup and the
destination rgroup, not just the destination.  And I think the
condition for =E2=80=9Cno multiplication needed=E2=80=9D should be that:

  max_nscalars_per_iter * factor

(aka "nitems_per_iter") is the same for both groups.

Thanks,
Richard