From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id CCC613858D28 for ; Wed, 24 May 2023 14:01:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CCC613858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6322B1042; Wed, 24 May 2023 07:01:55 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BB4273F762; Wed, 24 May 2023 07:01:09 -0700 (PDT) From: Richard Sandiford To: =?utf-8?B?6ZKf5bGF5ZOy?= Mail-Followup-To: =?utf-8?B?6ZKf5bGF5ZOy?= ,gcc-patches , rguenther , richard.sandiford@arm.com Cc: gcc-patches , rguenther Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support References: <20230522083814.1647787-1-juzhe.zhong@rivai.ai> <19FAF964BA45AEBD+202305242126582416128@rivai.ai> Date: Wed, 24 May 2023 15:01:08 +0100 In-Reply-To: <19FAF964BA45AEBD+202305242126582416128@rivai.ai> (=?utf-8?B?IumSn+WxheWTsiIncw==?= message of "Wed, 24 May 2023 21:26:58 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-21.2 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: =E9=92=9F=E5=B1=85=E5=93=B2 writes: >>> In other words, why is this different from what >>>vect_set_loop_controls_directly would do? > Oh, I see. You are confused that why I do not make multiple-rgroup vec_t= runk > handling inside "vect_set_loop_controls_directly". > > Well. Frankly, I just replicate the handling of ARM SVE: > unsigned int nmasks =3D i + 1; > if (use_masks_p && (nmasks & 1) =3D=3D 0) > { > rgroup_controls *half_rgc =3D &(*controls)[nmasks / 2 - 1]; > if (!half_rgc->controls.is_empty () > && vect_maybe_permute_loop_masks (&header_seq, rgc, half_= rgc)) > continue; > } > > /* Try to use permutes to define the masks in DEST_RGM using the masks > in SRC_RGM, given that the former has twice as many masks as the > latter. Return true on success, adding any new statements to SEQ. */ > > static bool > vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, > rgroup_controls *src_rgm) > { > tree src_masktype =3D src_rgm->type; > tree dest_masktype =3D dest_rgm->type; > machine_mode src_mode =3D TYPE_MODE (src_masktype); > insn_code icode1, icode2; > if (dest_rgm->max_nscalars_per_iter <=3D src_rgm->max_nscalars_per_iter > && (icode1 =3D optab_handler (vec_unpacku_hi_optab, > src_mode)) !=3D CODE_FOR_nothing > && (icode2 =3D optab_handler (vec_unpacku_lo_optab, > src_mode)) !=3D CODE_FOR_nothing) > { > /* Unpacking the source masks gives at least as many mask bits as > we need. We can then VIEW_CONVERT any excess bits away. */ > machine_mode dest_mode =3D insn_data[icode1].operand[0].mode; > gcc_assert (dest_mode =3D=3D insn_data[icode2].operand[0].mode); > tree unpack_masktype =3D vect_halve_mask_nunits (src_masktype, dest= _mode); > for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i) > { > tree src =3D src_rgm->controls[i / 2]; > tree dest =3D dest_rgm->controls[i]; > tree_code code =3D ((i & 1) =3D=3D (BYTES_BIG_ENDIAN ? 0 : 1) > ? VEC_UNPACK_HI_EXPR > : VEC_UNPACK_LO_EXPR); > gassign *stmt; > if (dest_masktype =3D=3D unpack_masktype) > stmt =3D gimple_build_assign (dest, code, src); > else > { > tree temp =3D make_ssa_name (unpack_masktype); > stmt =3D gimple_build_assign (temp, code, src); > gimple_seq_add_stmt (seq, stmt); > stmt =3D gimple_build_assign (dest, VIEW_CONVERT_EXPR, > build1 (VIEW_CONVERT_EXPR, > dest_masktype, temp)); > } > gimple_seq_add_stmt (seq, stmt); > } > return true; > } > vec_perm_indices indices[2]; > if (dest_masktype =3D=3D src_masktype > && interleave_supported_p (&indices[0], src_masktype, 0) > && interleave_supported_p (&indices[1], src_masktype, 1)) > { > /* The destination requires twice as many mask bits as the source, = so > we can use interleaving permutes to double up the number of bits= . */ > tree masks[2]; > for (unsigned int i =3D 0; i < 2; ++i) > masks[i] =3D vect_gen_perm_mask_checked (src_masktype, indices[i]= ); > for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i) > { > tree src =3D src_rgm->controls[i / 2]; > tree dest =3D dest_rgm->controls[i]; > gimple *stmt =3D gimple_build_assign (dest, VEC_PERM_EXPR, > src, src, masks[i & 1]); > gimple_seq_add_stmt (seq, stmt); > } > return true; > } > return false; > } > > I know this is just optimization for ARM SVE with sub_rgc (int16) is hal= f size of rgc (int8). > But when I just copy the codes from ARM SVE and make it general for all c= ases (int8 <-> int64). > They all work well and codegen is good.=20 > > If you don't like this way, would you mind give me some suggestions? It's not a case of disliking one approach or disliking another. There are two separate parts of this: one specific and one general. The specific part is that the code had: rgroup_controls *sub_rgc =3D &(*controls)[nmasks / rgc->controls.length () - 1]; if (!sub_rgc->controls.is_empty ()) { tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); vect_adjust_loop_lens_control (iv_type, &header_seq, rgc, sub_rgc, NULL_TREE); continue; } But AIUI, nmasks is always equal to rgc->controls.length () (if rgc->controls is non-empty). So I think this always used (*controls)[0] as the source rgroup. And I think that's the only case that would work, since vect_adjust_loop_lens_control only reads from sub_rgc once. Is that right? It would be possible to make other source rgroups work, but it would involve reading every element of sub_rgc->controls, not just sub_rgc->controls[0]. (This is also what the SVE code does.) The general point is that there are two extreme approaches: (1) Use one IV, and base everything off that. (2) Use a separate IV for every rgroup. It's also possible to do something between these two extremes. (And we should, if it gives better code.) The code above reuses an existing IV. Doing that for all multi-control rgroups would give (1). At the other extreme, (2) would involve calling vect_set_loop_controls_directly for every rgroup. Both approaches are fine. I'm not against one or the other. What I didn't understand was why your patch only reuses existing IVs for max_nscalars_per_iter =3D=3D 1. Was it to avoid having to do a multiplication (well, really a shift left) when moving from one rgroup to another? E.g. if one rgroup had; nscalars_per_iter =3D=3D 2 && factor =3D=3D 1 and another had: nscalars_per_iter =3D=3D 4 && factor =3D=3D 1 then we would need to mulitply by 2 when going from the first rgroup to the second. If so, avoiding a multiplication seems like a good reason for the choice you were making in the path. But we then need to check max_nscalars_per_iter =3D=3D 1 for both the source rgroup and the destination rgroup, not just the destination. And I think the condition for =E2=80=9Cno multiplication needed=E2=80=9D should be that: max_nscalars_per_iter * factor (aka "nitems_per_iter") is the same for both groups. Thanks, Richard