From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=H3Oj=EA=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 906D53858C5F
	for <gcc-patches@gcc.gnu.org>; Tue, 15 Aug 2023 08:44:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 906D53858C5F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 69F131063;
	Tue, 15 Aug 2023 01:45:13 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8981B3F762;
	Tue, 15 Aug 2023 01:44:30 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Mail-Followup-To: Richard Biener <richard.guenther@gmail.com>,"Kewen.Lin" <linkw@linux.ibm.com>,  GCC Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: "Kewen.Lin" <linkw@linux.ibm.com>,  GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest
References: <b39e934e-869e-840d-eb7a-5b2de24146a8@linux.ibm.com>
	<7314a4eb-26d0-e33e-94c2-31daca9f490e@linux.ibm.com>
	<mptzg2trf1l.fsf@arm.com>
	<dce2077b-81b6-0b78-eb54-a9339309ae78@linux.ibm.com>
	<mpth6p1r9ny.fsf@arm.com>
	<c6b56548-f451-03bf-e680-b92f90b248dd@linux.ibm.com>
	<CAFiYyc0d931GyuqwUXfRP-8UzfYbbEL7up0k91sJ5u3hh-Xu3Q@mail.gmail.com>
Date: Tue, 15 Aug 2023 09:44:29 +0100
In-Reply-To: <CAFiYyc0d931GyuqwUXfRP-8UzfYbbEL7up0k91sJ5u3hh-Xu3Q@mail.gmail.com>
	(Richard Biener's message of "Tue, 15 Aug 2023 09:53:20 +0200")
Message-ID: <mptleecpudu.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 15, 2023 at 4:44=E2=80=AFAM Kewen.Lin <linkw@linux.ibm.com> w=
rote:
>>
>> on 2023/8/14 22:16, Richard Sandiford wrote:
>> > No, it was more that 219-142=3D77, so it seems like a lot of lines
>> > are being duplicated rather than simply being moved.  (Unlike for
>> > VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>> > was a clear improvement.)
>> >
>> > So I was just wondering if there was any obvious factoring-out that
>> > could be done to reduce the duplication.
>>
>> ah, thanks for the clarification!
>>
>> I think the main duplication are on the loop body beginning and end,
>> let's take a look at them in details:
>>
>> +  if (memory_access_type =3D=3D VMAT_GATHER_SCATTER)
>> +    {
>> +      gcc_assert (alignment_support_scheme =3D=3D dr_aligned
>> +                 || alignment_support_scheme =3D=3D dr_unaligned_suppor=
ted);
>> +      gcc_assert (!grouped_load && !slp_perm);
>> +
>> +      unsigned int inside_cost =3D 0, prologue_cost =3D 0;
>>
>> // These above are newly added.
>>
>> +      for (j =3D 0; j < ncopies; j++)
>> +       {
>> +         /* 1. Create the vector or array pointer update chain.  */
>> +         if (j =3D=3D 0 && !costing_p)
>> +           {
>> +             if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +               vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> +                                            slp_node, &gs_info, &datare=
f_ptr,
>> +                                            &vec_offsets);
>> +             else
>> +               dataref_ptr
>> +                 =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, =
aggr_type,
>> +                                             at_loop, offset, &dummy, g=
si,
>> +                                             &ptr_incr, false, bump);
>> +           }
>> +         else if (!costing_p)
>> +           {
>> +             gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> +             if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +               dataref_ptr =3D bump_vector_ptr (vinfo, dataref_ptr, ptr=
_incr,
>> +                                              gsi, stmt_info, bump);
>> +           }
>>
>> // These are for dataref_ptr, in the final looop nest we deal with more =
cases
>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared =
between,
>> IMHO factoring out it seems like a overkill.
>>
>> +
>> +         if (mask && !costing_p)
>> +           vec_mask =3D vec_masks[j];
>>
>> // It's merged out from j =3D=3D 0 and j !=3D 0
>>
>> +
>> +         gimple *new_stmt =3D NULL;
>> +         for (i =3D 0; i < vec_num; i++)
>> +           {
>> +             tree final_mask =3D NULL_TREE;
>> +             tree final_len =3D NULL_TREE;
>> +             tree bias =3D NULL_TREE;
>> +             if (!costing_p)
>> +               {
>> +                 if (loop_masks)
>> +                   final_mask
>> +                     =3D vect_get_loop_mask (loop_vinfo, gsi, loop_mask=
s,
>> +                                           vec_num * ncopies, vectype,
>> +                                           vec_num * j + i);
>> +                 if (vec_mask)
>> +                   final_mask =3D prepare_vec_mask (loop_vinfo, mask_ve=
ctype,
>> +                                                  final_mask, vec_mask,=
 gsi);
>> +
>> +                 if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +                   dataref_ptr =3D bump_vector_ptr (vinfo, dataref_ptr,=
 ptr_incr,
>> +                                                  gsi, stmt_info, bump);
>> +               }
>>
>> // This part is directly copied from the original, the original gets upd=
ated by
>> removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't con=
sider
>> this before, do you prefer me to factor this part out?
>>
>> +             if (gs_info.ifn !=3D IFN_LAST)
>> +               {
>> ...
>> +               }
>> +             else
>> +               {
>> +                 /* Emulated gather-scatter.  */
>> ...
>>
>> // This part is just moved from the original.
>>
>> +             vec_dest =3D vect_create_destination_var (scalar_dest, vec=
type);
>> +             /* DATA_REF is null if we've already built the statement. =
 */
>> +             if (data_ref)
>> +               {
>> +                 vect_copy_ref_info (data_ref, DR_REF (first_dr_info->d=
r));
>> +                 new_stmt =3D gimple_build_assign (vec_dest, data_ref);
>> +               }
>> +             new_temp =3D make_ssa_name (vec_dest, new_stmt);
>> +             gimple_set_lhs (new_stmt, new_temp);
>> +             vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, g=
si);
>> +
>> +             /* Store vector loads in the corresponding SLP_NODE.  */
>> +             if (slp)
>> +               slp_node->push_vec_def (new_stmt);
>> +
>> +         if (!slp && !costing_p)
>> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +       }
>> +
>> +      if (!slp && !costing_p)
>> +       *vec_stmt =3D STMT_VINFO_VEC_STMTS (stmt_info)[0];
>>
>> // This part is some subsequent handlings, it's duplicated from the orig=
inal
>> but removing some more useless code.  I guess this part is not worthy
>> being factored out?
>>
>> +      if (costing_p)
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_NOTE, vect_location,
>> +                            "vect_model_load_cost: inside_cost =3D %u, "
>> +                            "prologue_cost =3D %u .\n",
>> +                            inside_cost, prologue_cost);
>> +       }
>> +      return true;
>> +    }
>>
>> // Duplicating the dumping, I guess it's unnecessary to be factored out.
>>
>> oh, I just noticed that this should be shorten as
>> "if (costing_p && dump_enabled_p ())" instead, just the same as what's
>> adopted for VMAT_LOAD_STORE_LANES dumping.
>
> Just to mention, the original motivational idea was even though we
> duplicate some
> code we make it overall more readable and thus maintainable.

Not sure I necessarily agree with the "thus".  Maybe it tends to be
true with a good, well-factored API.  But the internal vector APIs
make it extremely easy to get things wrong.  If we have multiple copies
of the same operation, the tendency is to fix bugs in the copy that the
bugs were seen in.  It's easy to forget that other copies exist
elsewhere that probably need updating in the same way.

> In the end we
> might have vectorizable_load () for analysis but have not only
> load_vec_info_type but one for each VMAT_* which means multiple separate
> vect_transform_load () functions.  Currently vectorizable_load is structu=
red
> very inconsistently, having the transforms all hang off a single
> switch (vmat-kind) {} would be an improvement IMHO.

Yeah, agree vectorizable_load ought to be refactored.

> But sure some of our internal APIs are verbose and maybe badly factored,
> any improvement there is welcome.  Inventing new random APIs just to
> save a few lines of code without actually making the code more readable
> is IMHO bad.

OK, fair enough.  So the idea is: see where we end up and then try to
improve/factor the APIs in a less peephole way?

Thanks,
Richard

> But, if we can for example enhance prepare_vec_mask to handle both loop
> and conditional mask and handle querying the mask that would be fine
> (of course you need to check all uses to see if that makes sense).
>
> Richard.
>
>> BR,
>> Kewen