From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2901 invoked by alias); 11 Dec 2014 15:59:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 2887 invoked by uid 89); 11 Dec 2014 15:59:28 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 11 Dec 2014 15:59:25 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by service87.mimecast.com; Thu, 11 Dec 2014 15:59:21 +0000 Received: from [10.1.209.51] ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 11 Dec 2014 15:59:20 +0000 Message-ID: <5489BF58.9090008@arm.com> Date: Thu, 11 Dec 2014 15:59:00 -0000 From: Alan Lawrence User-Agent: Thunderbird 2.0.0.24 (X11/20101213) MIME-Version: 1.0 To: Michael Meissner , Segher Boessenkool CC: "gcc-patches@gcc.gnu.org" Subject: Ping: Re: [PATCH 10/11][RS6000] Migrate reduction optabs to reduc_..._scal In-Reply-To: <5463ABD0.8020004@arm.com> X-MC-Unique: 114121115592104301 Content-Type: multipart/mixed; boundary="------------020100090706070703040306" X-IsSubscribed: yes X-SW-Source: 2014-12/txt/msg01024.txt.bz2 This is a multi-part message in MIME format. --------------020100090706070703040306 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable Content-length: 6118 So I'm afraid I'm not going to get involved in a discussion about=20 CANNOT_CHANGE_MODE_CLASS on RS6000, and what you might want to do there - s= orry,=20 but I don't think I can really contribute anything there. However, I *am* t= rying=20 to migrate all platforms off the old reduc_xxx optabs to the new version=20 producing a scalar. Hence, can I ping the attached patch (which is just a simple combination of= the=20 previously-posted patch + snippet)? No regressions on gcc112.fsffrance.org. This works in exactly the same way as the old code path, with a second insn= to=20 pull the scalar result out of the reduction, just as the expander would hav= e=20 done (or the bitfieldref before that), and avoiding the v2df combine patter= n=20 (again, as previously). gcc/ChangeLog: * config/rs6000/altivec.md (reduc_splus_): Rename to... (reduc_plus_scal_): ...this, add rs6000_expand_vector_extract. (reduc_uplus_v16qi): Remove. * config/rs6000/vector.md (VEC_reduc_name): change "splus" to "plus" (reduc__v2df): Remove. (reduc__scal_v2df): New. (reduc__v4sf): Rename to... (reduc__scal_v4sf): ...this, wrap VEC_reduc in a vec_select of element 3, add scratch register. > Have run check-gcc on gcc110.fsffrance.org (powerpc64-unknown-linux-gnu) = using this snippet on top of original patch; no regressions. >=20 >=20 > Alan Lawrence wrote: >=20 > So I'm no expert on RS6000 here, but following on from Segher's obser= vation about the change in pattern...so the difference in 'expand' is exact= ly that, a vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a = vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by comb= ining the two previous insns. >=20 >=20 > However, inspecting the logs from -fdump-rtl-combine-all, *without* m= y patch, when the combiner tries to put those two together, I see: >=20 >=20 > Trying 30 -> 31: > Failed to match this instruction: > (set (reg:DF 179 [ stmp_s_5.7D.2196 ]) > (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_= s_5.6D.2195 ]) > (parallel [ > (const_int 1 [0x1]) > (const_int 0 [0]) > ])) > (reg:V2DF 173 [ vect_s_5.6D.2195 ])) > (parallel [ > (const_int 1 [0x1]) > ]))) >=20 > That is, it looks like combine_simplify_rtx has transformed the (vec_= concat (vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df = insn, into a single vec_select, which does not match the vsx_reduc_plus_v2d= f_scalar insn. >=20 >=20 > So despite the comment (in vsx.md): >=20 > ;; Combiner patterns with the vector reduction patterns that knows we= can get > ;; to the top element of the V2DF array without doing an extract. >=20 > It looks like the code generation prior to my patch, considered bette= r, was because the combiner didn't actually use the pattern? >=20 >=20 > In that case whilst you may want to dig into register allocation, can= not_change_mode_class, etc., for other reasons, I think the best fix for mi= grating to reduc_plus_scal... is simply to avoid using the "Combiner" patte= rns and just emit two insns, the old pattern followed by a vec_extract. The= attached snippet does this (I won't call it a patch yet, and it applies on= top of the previous patch - I went the route of calling the two gen functi= ons rather than copying their RTL sequences, but could do the latter if tha= t were preferable???), and restores code generation to the original form on= your example above; it bootstraps OK but I'm still running check-gcc on th= e Compile Farm... >=20 >=20 > However, again on your example above, I note that if I *remove* the r= educ_plus_scal_v2df pattern altogether, I get: >=20 >=20 > .sum: > li 10,512 # 52 *movdi_internal64/4 [length =3D = 4] > ld 9,.LC2@toc(2) # 20 *movdi_internal64/2 [len= gth =3D 4] > xxlxor 0,0,0 # 17 *vsx_movv2df/12 [length =3D 4] > mtctr 10 # 48 *movdi_internal64/11 [length =3D = 4] > .align 4 > .L2: > lxvd2x 12,0,9 # 23 *vsx_movv2df/2 [length =3D 4] > addi 9,9,16 # 25 *adddi3_internal1/2 [length =3D = 4] > xvadddp 0,0,12 # 24 *vsx_addv2df3/1 [length =3D 4] > bdnz .L2 # 47 *ctrdi_internal1/1 [length =3D = 4] > xxsldwi 12,0,0,2 # 30 vsx_xxsldwi_v2df [len= gth =3D 4] > xvadddp 1,0,12 # 31 *vsx_addv2df3/1 [length =3D 4] > nop # 37 *vsx_extract_v2df_internal2/1 [length =3D = 4] > blr # 55 return [length =3D 4] >=20 > this is presumably using gcc's scalar reduction code, but (to my untr= ained eye on powerpc!) it looks even better than the first form above (the = same in the loop, and in the reduction, an xxpermdi is replaced by a nop !)= ... >=20 >=20 > --Alan >=20 >=20 > Segher Boessenkool wrote: >=20 > On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote: >=20 > However, the double pattern is completely broken. This canno= t go in. >=20 > [snip] >=20 > It is unacceptable to have to do the inner loop doing a load,= vector add, and > store in the loop. >=20 > Before the patch, the final reduction used *vsx_reduc_splus_v2df;= after > the patch, it is *vsx_reduc_plus_v2df_scalar. The former does a = vector > add, the latter a float add. And it uses the same pseudoregister= for the > accumulator throughout. IRA decides a register is more expensive= than > memory for this, I suppose because it wants both V2DF and DF? It= doesn't > seem to like the subreg very much. >=20 > The new code does look nicer otherwise :-) >=20 >=20 > Segher --------------020100090706070703040306 Content-Type: text/x-patch; name=vec_rs6000_combined.patch Content-Transfer-Encoding: quoted-printable Content-Disposition: inline; filename="vec_rs6000_combined.patch" Content-length: 4661 diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md index d46ef191409211a0a9d213b57fb4e657cd5a3cb4..b79f7aa477ec700743ae7c57346= 72571e395b79d 100644 --- a/gcc/config/rs6000/altivec.md +++ b/gcc/config/rs6000/altivec.md @@ -2596,35 +2596,22 @@ operands[3] =3D gen_reg_rtx (GET_MODE (operands[0])); }) =20 -(define_expand "reduc_splus_" - [(set (match_operand:VIshort 0 "register_operand" "=3Dv") +(define_expand "reduc_plus_scal_" + [(set (match_operand: 0 "register_operand" "=3Dv") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")] UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" { rtx vzero =3D gen_reg_rtx (V4SImode); rtx vtmp1 =3D gen_reg_rtx (V4SImode); - rtx dest =3D gen_lowpart (V4SImode, operands[0]); + rtx vtmp2 =3D gen_reg_rtx (mode); + rtx dest =3D gen_lowpart (V4SImode, vtmp2); + HOST_WIDE_INT last_elem =3D GET_MODE_NUNITS (mode) - 1; =20 emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4ss (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero)); - DONE; -}) - -(define_expand "reduc_uplus_v16qi" - [(set (match_operand:V16QI 0 "register_operand" "=3Dv") - (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")] - UNSPEC_REDUC_PLUS))] - "TARGET_ALTIVEC" -{ - rtx vzero =3D gen_reg_rtx (V4SImode); - rtx vtmp1 =3D gen_reg_rtx (V4SImode); - rtx dest =3D gen_lowpart (V4SImode, operands[0]); - - emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); - emit_insn (gen_altivec_vsum4ubs (vtmp1, operands[1], vzero)); - emit_insn (gen_altivec_vsumsws_direct (dest, vtmp1, vzero)); + rs6000_expand_vector_extract (operands[0], vtmp2, last_elem); DONE; }) =20 diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md index e2946bd6e312e909471253fc2d75a4b25e050f82..cc01a96cb681a61b0708b34ffa4= 339529f3c1b9e 100644 --- a/gcc/config/rs6000/vector.md +++ b/gcc/config/rs6000/vector.md @@ -77,7 +77,7 @@ ;; Vector reduction code iterators (define_code_iterator VEC_reduc [plus smin smax]) =20 -(define_code_attr VEC_reduc_name [(plus "splus") +(define_code_attr VEC_reduc_name [(plus "plus") (smin "smin") (smax "smax")]) =20 @@ -990,20 +990,16 @@ =0C ;; Vector reduction expanders for VSX =20 -(define_expand "reduc__v2df" - [(parallel [(set (match_operand:V2DF 0 "vfloat_operand" "") - (VEC_reduc:V2DF - (vec_concat:V2DF - (vec_select:DF - (match_operand:V2DF 1 "vfloat_operand" "") - (parallel [(const_int 1)])) - (vec_select:DF - (match_dup 1) - (parallel [(const_int 0)]))) - (match_dup 1))) - (clobber (match_scratch:V2DF 2 ""))])] +(define_expand "reduc__scal_v2df" + [(match_operand:DF 0 "register_operand" "") + (VEC_reduc:V2DF (match_operand:V2DF 1 "vfloat_operand" "") (const_int 0= ))] "VECTOR_UNIT_VSX_P (V2DFmode)" - "") + { + rtx vec =3D gen_reg_rtx (V2DFmode); + emit_insn (gen_vsx_reduc__v2df (vec, operand1)); + emit_insn (gen_vsx_extract_v2df (operand0, vec, const1_rtx)); + DONE; + }) =20 ; The (VEC_reduc:V4SF ; (op1) @@ -1012,13 +1008,16 @@ ; is to allow us to use a code iterator, but not completely list all of the ; vector rotates, etc. to prevent canonicalization =20 -(define_expand "reduc__v4sf" - [(parallel [(set (match_operand:V4SF 0 "vfloat_operand" "") - (VEC_reduc:V4SF - (unspec:V4SF [(const_int 0)] UNSPEC_REDUC) - (match_operand:V4SF 1 "vfloat_operand" ""))) +(define_expand "reduc__scal_v4sf" + [(parallel [(set (match_operand:SF 0 "vfloat_operand" "") + (vec_select:SF + (VEC_reduc:V4SF + (unspec:V4SF [(const_int 0)] UNSPEC_REDUC) + (match_operand:V4SF 1 "vfloat_operand" "")) + (parallel [(const_int 3)]))) (clobber (match_scratch:V4SF 2 "")) - (clobber (match_scratch:V4SF 3 ""))])] + (clobber (match_scratch:V4SF 3 "")) + (clobber (match_scratch:V4SF 4 ""))])] "VECTOR_UNIT_VSX_P (V4SFmode)" "") =20 diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md index 27d464e07f7b77166047dd9ba41966aef411c029..2a30039c2bf415ab92fbd0b8067= 87a74b15d7dcb 100644 --- a/gcc/config/rs6000/vsx.md +++ b/gcc/config/rs6000/vsx.md @@ -2150,7 +2150,7 @@ =0C ;; Vector reduction insns and splitters =20 -(define_insn_and_split "*vsx_reduc__v2df" +(define_insn_and_split "vsx_reduc__v2df" [(set (match_operand:V2DF 0 "vfloat_operand" "=3D&wd,&?wa,wd,?wa") (VEC_reduc:V2DF (vec_concat:V2DF= --------------020100090706070703040306--