From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=4MOh=BN=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 99B323857712
	for <gcc-patches@gcc.gnu.org>; Wed, 24 May 2023 12:41:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 99B323857712
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 54E8C1042;
	Wed, 24 May 2023 05:42:39 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AE1843F840;
	Wed, 24 May 2023 05:41:53 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: =?utf-8?B?6ZKf5bGF5ZOy?= <juzhe.zhong@rivai.ai>
Mail-Followup-To: =?utf-8?B?6ZKf5bGF5ZOy?=
 <juzhe.zhong@rivai.ai>,gcc-patches <gcc-patches@gcc.gnu.org>,  rguenther
 <rguenther@suse.de>, richard.sandiford@arm.com
Cc: gcc-patches <gcc-patches@gcc.gnu.org>,  rguenther <rguenther@suse.de>
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
References: <20230522083814.1647787-1-juzhe.zhong@rivai.ai>
	<mpt7csyht2b.fsf@arm.com>
	<E89F6BFE64A78D84+202305241952507669172@rivai.ai>
Date: Wed, 24 May 2023 13:41:52 +0100
In-Reply-To: <E89F6BFE64A78D84+202305241952507669172@rivai.ai>
 (=?utf-8?B?IumSn+WxheWTsiIncw==?=
	message of "Wed, 24 May 2023 19:52:51 +0800")
Message-ID: <mptedn5hpf3.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-21.2 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Sorry, I realised later that I had an implicit assumption here:
if there are multiple rgroups, it's better to have a single IV
for the smallest rgroup and scale that up to bigger rgroups.

E.g. if the loop control IV is taken from an N-control rgroup
and has a step S, an N*M-control rgroup would be based on M*S.

Of course, it's also OK to create multiple IVs if you prefer.
It's just a question of which approach gives the best output
in practice.

Another way of going from an N-control rgroup ("G1") to an N*M-control
rgroup ("G2") would be to reuse all N controls from G1.  E.g. the
first M controls in G2 would come from G1[0], the next M from
G1[1], etc.  That might lower the longest dependency chain.

But whatever we do, it doesn't feel like max_nscalars_per_iter
should be part of the decision.  (I realise it will be part of
the decision for the follow-on SELECT_IV patch.  But that's
because we require the number of elements processed in each
iteration to be a multiple of max_nscalars_per_iter, and AIUI
SELECT_IV wouldn't guarantee that.  max_nscalars_per_iter shouldn't
matter for the current patch though.)

=E9=92=9F=E5=B1=85=E5=93=B2 <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.  It's quite complicated for me and I am not sure whether I c=
an catch up with you.
> So I will rather split the work step by step to  implement the decrement =
IV
>
> For the first step you mentioned:
>
>>> (1) In vect_set_loop_condition_partial_vectors, for the first iteration=
 of:
>
>  >>  FOR_EACH_VEC_ELT (*controls, i, rgc)
>  >>    if (!rgc->controls.is_empty ())
>
>>> call vect_set_loop_controls_directly.  That is:
>
>>> >> /* See whether zero-based IV would ever generate all-false masks
>>>    or zero length before wrapping around.  */
>>> bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>>>=20
> /* Set up all controls for this group.  */
>>> test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo,
>  >>     &preheader_seq,
>   >>    &header_seq,
>  >>     loop_cond_gsi, rgc,
>  >>     niters, niters_skip,
>  >>     might_wrap_p);
>
>>> needs to be an "if" that (for LOOP_VINFO_USING_DECREMENTING_IV_P)
>>> is only executed on the first iteration.
>
> Is it correct like this?
>
>   FOR_EACH_VEC_ELT (*controls, i, rgc)
>     if (!rgc->controls.is_empty ())
>       {
>         /* First try using permutes.  This adds a single vector
>            instruction to the loop for each mask, but needs no extra
>            loop invariants or IVs.  */
>         unsigned int nmasks =3D i + 1;
>         if (use_masks_p && (nmasks & 1) =3D=3D 0)
>           {
>             rgroup_controls *half_rgc =3D &(*controls)[nmasks / 2 - 1];
>             if (!half_rgc->controls.is_empty ()
>                 && vect_maybe_permute_loop_masks (&header_seq, rgc, half_=
rgc))
>               continue;
>           }
>
>         /* See whether zero-based IV would ever generate all-false masks
>            or zero length before wrapping around.  */
>         bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, rg=
c);
>
>         /* Set up all controls for this group.  */
>         test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo,
>                                                      &preheader_seq,
>                                                      &header_seq,
>                                                      loop_cond_gsi, rgc,
>                                                      niters, niters_skip,
>                                                      might_wrap_p);
>
>         /* Decrement IV only run vect_set_loop_controls_directly once.  */
>         if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
>           break;
>       }

I meant something like:

  FOR_EACH_VEC_ELT (*controls, i, rgc)
    if (!rgc->controls.is_empty ())
      {
        /* First try using permutes.  This adds a single vector
           instruction to the loop for each mask, but needs no extra
           loop invariants or IVs.  */
        unsigned int nmasks =3D i + 1;
        if (use_masks_p && (nmasks & 1) =3D=3D 0)
          {
            rgroup_controls *half_rgc =3D &(*controls)[nmasks / 2 - 1];
            if (!half_rgc->controls.is_empty ()
                && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rg=
c))
              continue;
          }

        if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
            || !LOOP_VINFO_DECREMENTING_IV_STEP (loop_info))
	  {
            /* See whether zero-based IV would ever generate all-false masks
               or zero length before wrapping around.  */
            bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, =
rgc);

            /* Set up all controls for this group.  */
            test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo,
                                                         &preheader_seq,
                                                         &header_seq,
                                                         loop_cond_gsi, rgc,
                                                         niters, niters_ski=
p,
                                                         might_wrap_p);
	  }
=09
        if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
	    && rgc->controls.length () > 1)
	  ...use vect_adjust_loop_lens_control...
      }

where LOOP_VINFO_DECREMENTING_IV_STEP (loop_info) is "S" from my
previous review.

vect_set_loop_controls_directly would then set
LOOP_VINFO_DECREMENTING_IV_STEP but would not call
vect_adjust_loop_lens_control.

But like I say, this is all based on the assumption that we should
have a single IV and scale it up for later rgroups.  If you'd prefer
separate IVs then that's fine.  But then I think it's less clear
why we have:

> +	if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> +	    && rgc->max_nscalars_per_iter =3D=3D 1
> +	    && rgc !=3D &LOOP_VINFO_LENS (loop_vinfo)[0])
> +	  {
> +	    /* Multiple rgroup (non-SLP):
> +	      ...
> +	      _38 =3D (unsigned long) n_12(D);
> +	      ...
> +	      # ivtmp_38 =3D PHI <ivtmp_39(3), 100(2)>
> +	      ...
> +	      _40 =3D MIN_EXPR <ivtmp_38, POLY_INT_CST [8, 8]>;
> +	      loop_len_21 =3D MIN_EXPR <_40, POLY_INT_CST [2, 2]>;
> +	      _41 =3D _40 - loop_len_21;
> +	      loop_len_20 =3D MIN_EXPR <_41, POLY_INT_CST [2, 2]>;
> +	      _42 =3D _40 - loop_len_20;
> +	      loop_len_19 =3D MIN_EXPR <_42, POLY_INT_CST [2, 2]>;
> +	      _43 =3D _40 - loop_len_19;
> +	      loop_len_16 =3D MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
> +	      ...
> +	      vect__4.8_15 =3D .LEN_LOAD (_6, 64B, loop_len_21, 0);
> +	      ...
> +	      vect__4.9_8 =3D .LEN_LOAD (_13, 64B, loop_len_20, 0);
> +	      ...
> +	      vect__4.10_28 =3D .LEN_LOAD (_46, 64B, loop_len_19, 0);
> +	      ...
> +	      vect__4.11_30 =3D .LEN_LOAD (_49, 64B, loop_len_16, 0);
> +	      vect__7.13_31 =3D VEC_PACK_TRUNC_EXPR <vect__4.8_15, vect__4.9_8>;
> +	      vect__7.13_32 =3D VEC_PACK_TRUNC_EXPR <...>;
> +	      vect__7.12_33 =3D VEC_PACK_TRUNC_EXPR <...>;
> +	      ...
> +	      .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
> +	      ivtmp_39 =3D ivtmp_38 - _40;
> +	      ...
> +	      if (ivtmp_39 !=3D 0)
> +		goto <bb 3>; [92.31%]
> +	      else
> +		goto <bb 4>; [7.69%]
> +	    */
> +	    rgroup_controls *sub_rgc
> +	      =3D &(*controls)[nmasks / rgc->controls.length () - 1];
> +	    if (!sub_rgc->controls.is_empty ())
> +	      {
> +		tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +		vect_adjust_loop_lens_control (iv_type, &header_seq, rgc,
> +					       sub_rgc, NULL_TREE);
> +		continue;
> +	      }
> +	  }

In other words, why is this different from what
vect_set_loop_controls_directly would do?

Thanks,
Richard