From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 5087C385703C
 for <gcc-patches@gcc.gnu.org>; Fri, 24 Jul 2020 16:21:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 5087C385703C
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=richard.sandiford@arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F296D101E;
 Fri, 24 Jul 2020 09:21:51 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2B1D23F66F;
 Fri, 24 Jul 2020 09:21:51 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: "Kewen.Lin" <linkw@linux.ibm.com>
Mail-Followup-To: "Kewen.Lin" <linkw@linux.ibm.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>, Bill Schmidt <wschmidt@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Richard Biener <richard.guenther@gmail.com>, richard.sandiford@arm.com
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
 Bill Schmidt <wschmidt@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Richard Biener <richard.guenther@gmail.com>
Subject: Re: [PATCH v3] vect/rs6000: Support vector with length cost modeling
References: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com>
 <CAFiYyc07vGM7RBYmxWGwEpZuBA0vfvBMPeb+RQPMnXr+E55edw@mail.gmail.com>
 <a06e714e-04c3-8a2f-fa1d-02a72aecf7f4@linux.ibm.com>
 <mptv9ifud65.fsf@arm.com>
 <1aeabdc7-0cf4-055b-a3ec-74c283053cf5@linux.ibm.com>
 <ed2e2bdd-a207-2b6a-41cb-7348d61f24bc@linux.ibm.com>
Date: Fri, 24 Jul 2020 17:21:49 +0100
In-Reply-To: <ed2e2bdd-a207-2b6a-41cb-7348d61f24bc@linux.ibm.com> (Kewen Lin's
 message of "Thu, 23 Jul 2020 00:25:36 +0800")
Message-ID: <mpt5zacq3xe.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Jul 2020 16:21:54 -0000

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,
>
> Sorry, please ignore the previously attached file, which isn't the latest=
 one
> although almost are the same.   The latest tested is attached here.=20=20
>
> Sorry for the inconvenience.
>
> BR,
> Kewen
>
> on 2020/7/22 =E4=B8=8B=E5=8D=8811:48, Kewen.Lin via Gcc-patches wrote:
>>=20
>> It's a great idea, by following your subsequent suggestion to make the s=
tructure
>> like:
>>=20
>>   - calculate peel_iters_prologue
>>   - calculate peel_iters_epilogue
>>   - add costs associated with peel_iters_prologue
>>   - add costs associated with peel_iters_epilogue
>>   - add costs related to branch taken/not_taken.
>>=20
>> the updated v3 is attached.
>>=20
>> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=3D1, I'll test it without partial vectors
>> setting, also on aarch64 later.
>>=20
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>=20
>> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
>> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
>> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
>> 	to determine peel_iters_epilogue to function ...
>> 	(vect_get_peel_iters_epilogue): ... this.  New function.
>> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
>> 	with length, refactor cost calculation on peel_iters_prologue and
>> 	peel_iters_epilogue.
>>=20

Thanks, the rearrangement of the existing code looks good.  Could you
split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
out into separate patches?

> -/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
> -int
> -vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_pr=
ologue,
> -                             int *peel_iters_epilogue,
> -                             stmt_vector_for_cost *scalar_cost_vec,
> -			     stmt_vector_for_cost *prologue_cost_vec,
> -			     stmt_vector_for_cost *epilogue_cost_vec)
> +/* Calculate how many iterations peeled for epilogue with information LO=
OP_VINFO
> +   and PEEL_ITERS_PROLOGUE.  */

Maybe:

/* Estimate the number of peeled epilogue iterations for LOOP_VINFO.
   PEEL_ITERS_PROLOGUE is the number of peeled prologue iterations,
   or -1 if not known.  */

> [=E2=80=A6]
> -      peel_iters_prologue =3D niters < peel_iters_prologue ?
> -                            niters : peel_iters_prologue;
> -      *peel_iters_epilogue =3D (niters - peel_iters_prologue) % assumed_=
vf;
> +      int npeel_prol
> +	=3D niters < peel_iters_prologue ? niters : peel_iters_prologue;
> +      int npeel_epil =3D (niters - npeel_prol) % assumed_vf;

Here and in the rest of the patch, I think we should stick with
the =E2=80=9Cogue=E2=80=9D rather than shorten it this much.

This is a comment about the original code, but it seems simpler to
write as either:

  if (peel_iters_prologue > niters)
    peel_iters_prologue =3D niters;

or:

  peel_iters_prologue =3D MIN (niters, peel_iters_prologue);

> [=E2=80=A6]
> +      /* Refer to the functions vect_set_loop_condition_partial_vectors

s/Refer/Referring/

> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimization, we consider the worst case

optimizations

> +	 here.  */
> +
> +      /* For now we only operate length-based partial vectors on Power,
> +	 which has constant VF all the time, we need some tweakings below
> +	 if it doesn't hold in future.  */
> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());

Where do you rely on this?  There didn't seem to be any obvious
to_constant uses.  Since this is =E2=80=9Conly=E2=80=9D a cost calculation,=
 we should
be using assumed_vf.

> +
> +      /* For wrap around checking.  */
> +      tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision =3D TYPE_PRECISION (compare_type);
> +      widest_int iv_limit =3D vect_iv_limit_for_partial_vectors (loop_vi=
nfo);
> +
> +      bool niters_known_p =3D LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	=3D (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +	 scalar_stmt. For now we only use length-based partial vectors on
> +	 Power, target specific cost tweaking may be needed for other
> +	 ports in future.  */
> +      unsigned int min_max_cost =3D 2;
> +      unsigned int shift_cost =3D 1, minus_cost =3D 1;
> +
> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prol_cnt =3D 0;
> +      unsigned int body_cnt =3D 0;
> +
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rg=
c)
> +	if (rgc->type)
> +	  {
> +	    unsigned nitems =3D rgc->max_nscalars_per_iter * rgc->factor;
>=20=20
> -      /* If peeled iterations are unknown, count a taken branch and a no=
t taken
> -         branch per peeled loop. Even if scalar loop iterations are know=
n,
> -         vector iterations are not known since peeled prologue iteration=
s are
> -         not known. Hence guards remain the same.  */
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch=
_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch=
_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      stmt_info_for_cost *si;
> -      int j;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j=
, si)
> -	{
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_prologue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_prologue);
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_epilogue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_epilogue);
> -	}
> -    }
> -  else
> -    {
> -      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
> -      stmt_info_for_cost *si;
> -      int j;
> -      void *data =3D LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
> +	    /* Need one shift for niters_total computation.  */
> +	    if (!niters_known_p && nitems !=3D 1)
> +	      prol_cnt +=3D shift_cost;
>=20=20
> -      prologue_cost_vec.create (2);
> -      epilogue_cost_vec.create (2);
> -      peel_iters_prologue =3D npeel;
> +	    /* Need to handle wrap around.  */
> +	    if (iv_limit =3D=3D -1
> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +		    > compare_precision))
> +	      prol_cnt +=3D (min_max_cost + minus_cost);
>=20=20
> -      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologu=
e,
> -					  &peel_iters_epilogue,
> -					  &LOOP_VINFO_SCALAR_ITERATION_COST
> -					    (loop_vinfo),
> -					  &prologue_cost_vec,
> -					  &epilogue_cost_vec);
> +	    /* Need to handle batch limit excepting for the 1st one.  */
> +	    prol_cnt +=3D (min_max_cost + minus_cost) * num_vectors_m1;
>=20=20
> -      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_prologue);
> +	    unsigned int num_vectors =3D num_vectors_m1 + 1;
> +	    /* Need to set up lengths in prologue, only one MIN required
> +	       since start index is zero.  */
> +	    prol_cnt +=3D min_max_cost * num_vectors;
>=20=20
> -      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_epilogue);
> +	    /* Need to update lengths in body for next iteration.  */
> +	    if (need_iterate_p)
> +	      body_cnt +=3D (2 * min_max_cost + minus_cost) * num_vectors;
> +	  }
>=20=20
> -      prologue_cost_vec.release ();
> -      epilogue_cost_vec.release ();
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scal=
ar_stmt,
> +			    NULL, NULL_TREE, 0, vect_prologue);
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scal=
ar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);

IMO this seems to be reproducing too much of the functions that you
referred to.  And the danger with that is that they could easily
get out of sync later.

Thanks,
Richard