Hi Vladimir, on 2021/6/30 下午11:24, Vladimir Makarov wrote: > > On 2021-06-28 2:26 a.m., Kewen.Lin wrote: >> Hi! >> >> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote: >>> Hi, >>> >>> PR100328 has some details about this issue, I am trying to >>> brief it here.  In the hottest function LBM_performStreamCollideTRT >>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions >>> (27 FMA, 19 FMS, 11 FNMA).  On rs6000, this kind of FMA style >>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg >>> class have 64 registers whose foregoing 32 ones make up the >>> whole FLOAT_REG.  There are some differences for these two >>> flavors, taking "*fma4_fpr" as example: >>> >>> (define_insn "*fma4_fpr" >>>    [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa,wa") >>>     (fma:SFDF >>>       (match_operand:SFDF 1 "gpc_reg_operand" "%,wa,wa") >>>       (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0") >>>       (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa")))] >>> >>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG. >>> // (f/d) => A floating point register, aka. FLOAT_REG. >>> >>> So for VSX_REG, we only have the destructive form, when VSX_REG >>> alternative being used, the operand 2 or operand 3 is required >>> to be the same as operand 0.  reload has to take care of this >>> constraint and create some non-free register copies if required. >>> >>> Assuming one fma insn looks like: >>>    op0 = FMA (op1, op2, op3) >>> >>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead, >>> IRA simply creates three shuffle copies for them (here the operand >>> order matters, since with the same freq, the one with smaller number >>> takes preference), but IMO both op2 and op3 should take higher priority >>> in copy queue due to the matching constraint. >>> >>> I noticed that there is one function ira_get_dup_out_num, which meant >>> to create this kind of constraint copy, but the below code looks to >>> refuse to create if there is an alternative which has valid regclass >>> without spilled need. >>> >>>        default: >>>     { >>>       enum constraint_num cn = lookup_constraint (str); >>>       enum reg_class cl = reg_class_for_constraint (cn); >>>       if (cl != NO_REGS >>>           && !targetm.class_likely_spilled_p (cl)) >>>         goto fail >>> >>>      ... >>> >>> I cooked one patch attached to make ira respect this kind of matching >>> constraint guarded with one parameter.  As I stated in the PR, I was >>> not sure this is on the right track.  The RFC patch is to check the >>> matching constraint in all alternatives, if there is one alternative >>> with matching constraint and matches the current preferred regclass >>> (or best of allocno?), it will record the output operand number and >>> further create one constraint copy for it.  Normally it can get the >>> priority against shuffle copies and the matching constraint will get >>> satisfied with higher possibility, reload doesn't create extra copies >>> to meet the matching constraint or the desirable register class when >>> it has to. >>> >>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay >>> as shuffle copies, and later any of A,B,C,D gets assigned by one >>> hardware register which is a VSX register (VSX_REG) but not a FP >>> register (FLOAT_REG), which means it has to pay costs once we can NOT >>> go with VSX alternatives, so at that time it's important to respect >>> the matching constraint then we can increase the freq for the remaining >>> copies related to this (A/B, A/C, A/D).  This idea requires some side >>> tables to record some information and seems a bit complicated in the >>> current framework, so the proposed patch aggressively emphasizes the >>> matching constraint at the time of creating copies. >>> >> Comparing with the original patch (v1), this patch v3 has >> considered: (this should be v2 for this mail list, but bump >> it to be consistent as PR's). >> >>    - Excluding the case where for one preferred register class >>      there can be two or more alternatives, one of them has the >>      matching constraint, while another doesn't have.  So for >>      the given operand, even if it's assigned by a hardware reg >>      which doesn't meet the matching constraint, it can simply >>      use the alternative which doesn't have matching constraint >>      so no register move is needed.  One typical case is >>      define_insn *mov_internal2 on rs6000.  So we >>      shouldn't create constraint copy for it. >> >>    - The possible free register move in the same register class, >>      disable this if so since the register move to meet the >>      constraint is considered as free. >> >>    - Making it on by default, suggested by Segher & Vladimir, we >>      hope to get rid of the parameter if the benchmarking result >>      looks good on major targets. >> >>    - Tweaking cost when either of matching constraint two sides >>      is hardware register.  Before this patch, the constraint >>      copy is simply taken as a real move insn for pref and >>      conflict cost with one hardware register, after this patch, >>      it's allowed that there are several input operands >>      respecting the same matching constraint (but in different >>      alternatives), so we should take it to be like shuffle copy >>      for some cases to avoid over preferring/disparaging. >> >> Please check the PR comments for more details. >> >> This patch can be bootstrapped & regtested on >> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some >> "XFAIL->XPASS" failures on aarch64-linux-gnu.  The failure list >> was attached in the PR and thought the new assembly looks >> improved (expected). >> >> With option Ofast unroll, this patch can help to improve SPEC2017 >> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while >> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any >> remarkable degradations. >> >> Since this patch likely benefits x86_64 and aarch64, but I don't >> have performance machines with these arches at hand, could >> someone kindly help to benchmark it if possible? >> >> Many thanks in advance! >> >> btw, you can simply ignore the part about parameter >> ira-consider-dup-in-all-alts (its name/description), it's sort of >> stale, I let it be for now as we will likely get rid of it. > > Kewen, thank you for addressing remarks for the previous version of the patch.  The patch is ok to commit with some minor changes: > > o In a comment for function ira_get_dup_out_num there is no mention of effect of the param on the function returned value and returned value of single_input_op_has_cstr_p and this imho creates wrong function interface description. > > o It would be still nice to change name op_no to op_regno in ira_get_dup_out_num. > > It is ok to commit the patch to the mainline with condition that you submit the patch switching off the parameter for x86-64 right after that as Hongtao Liu has shown its negative effect on x86-64 SPEC2017. > Many thanks for your review! I've updated the patch according to your comments and also polished some comments and document words a bit. Does it look better to you? BR, Kewen