on 2021/6/28 下午3:20, Hongtao Liu wrote: > On Mon, Jun 28, 2021 at 3:12 PM Hongtao Liu wrote: >> >> On Mon, Jun 28, 2021 at 2:50 PM Kewen.Lin wrote: >>> >>> Hi! >>> >>> on 2021/6/9 下午1:18, Kewen.Lin via Gcc-patches wrote: >>>> Hi, >>>> >>>> PR100328 has some details about this issue, I am trying to >>>> brief it here. In the hottest function LBM_performStreamCollideTRT >>>> of SPEC2017 bmk 519.lbm_r, there are many FMA style expressions >>>> (27 FMA, 19 FMS, 11 FNMA). On rs6000, this kind of FMA style >>>> insn has two flavors: FLOAT_REG and VSX_REG, the VSX_REG reg >>>> class have 64 registers whose foregoing 32 ones make up the >>>> whole FLOAT_REG. There are some differences for these two >>>> flavors, taking "*fma4_fpr" as example: >>>> >>>> (define_insn "*fma4_fpr" >>>> [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa,wa") >>>> (fma:SFDF >>>> (match_operand:SFDF 1 "gpc_reg_operand" "%,wa,wa") >>>> (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0") >>>> (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa")))] >>>> >>>> // wa => A VSX register (VSR), vs0…vs63, aka. VSX_REG. >>>> // (f/d) => A floating point register, aka. FLOAT_REG. >>>> >>>> So for VSX_REG, we only have the destructive form, when VSX_REG >>>> alternative being used, the operand 2 or operand 3 is required >>>> to be the same as operand 0. reload has to take care of this >>>> constraint and create some non-free register copies if required. >>>> >>>> Assuming one fma insn looks like: >>>> op0 = FMA (op1, op2, op3) >>>> >>>> The best regclass of them are VSX_REG, when op1,op2,op3 are all dead, >>>> IRA simply creates three shuffle copies for them (here the operand >>>> order matters, since with the same freq, the one with smaller number >>>> takes preference), but IMO both op2 and op3 should take higher priority >>>> in copy queue due to the matching constraint. >>>> >>>> I noticed that there is one function ira_get_dup_out_num, which meant >>>> to create this kind of constraint copy, but the below code looks to >>>> refuse to create if there is an alternative which has valid regclass >>>> without spilled need. >>>> >>>> default: >>>> { >>>> enum constraint_num cn = lookup_constraint (str); >>>> enum reg_class cl = reg_class_for_constraint (cn); >>>> if (cl != NO_REGS >>>> && !targetm.class_likely_spilled_p (cl)) >>>> goto fail >>>> >>>> ... >>>> >>>> I cooked one patch attached to make ira respect this kind of matching >>>> constraint guarded with one parameter. As I stated in the PR, I was >>>> not sure this is on the right track. The RFC patch is to check the >>>> matching constraint in all alternatives, if there is one alternative >>>> with matching constraint and matches the current preferred regclass >>>> (or best of allocno?), it will record the output operand number and >>>> further create one constraint copy for it. Normally it can get the >>>> priority against shuffle copies and the matching constraint will get >>>> satisfied with higher possibility, reload doesn't create extra copies >>>> to meet the matching constraint or the desirable register class when >>>> it has to. >>>> >>>> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay >>>> as shuffle copies, and later any of A,B,C,D gets assigned by one >>>> hardware register which is a VSX register (VSX_REG) but not a FP >>>> register (FLOAT_REG), which means it has to pay costs once we can NOT >>>> go with VSX alternatives, so at that time it's important to respect >>>> the matching constraint then we can increase the freq for the remaining >>>> copies related to this (A/B, A/C, A/D). This idea requires some side >>>> tables to record some information and seems a bit complicated in the >>>> current framework, so the proposed patch aggressively emphasizes the >>>> matching constraint at the time of creating copies. >>>> >>> >>> Comparing with the original patch (v1), this patch v3 has >>> considered: (this should be v2 for this mail list, but bump >>> it to be consistent as PR's). >>> >>> - Excluding the case where for one preferred register class >>> there can be two or more alternatives, one of them has the >>> matching constraint, while another doesn't have. So for >>> the given operand, even if it's assigned by a hardware reg >>> which doesn't meet the matching constraint, it can simply >>> use the alternative which doesn't have matching constraint >>> so no register move is needed. One typical case is >>> define_insn *mov_internal2 on rs6000. So we >>> shouldn't create constraint copy for it. >>> >>> - The possible free register move in the same register class, >>> disable this if so since the register move to meet the >>> constraint is considered as free. >>> >>> - Making it on by default, suggested by Segher & Vladimir, we >>> hope to get rid of the parameter if the benchmarking result >>> looks good on major targets. >>> >>> - Tweaking cost when either of matching constraint two sides >>> is hardware register. Before this patch, the constraint >>> copy is simply taken as a real move insn for pref and >>> conflict cost with one hardware register, after this patch, >>> it's allowed that there are several input operands >>> respecting the same matching constraint (but in different >>> alternatives), so we should take it to be like shuffle copy >>> for some cases to avoid over preferring/disparaging. >>> >>> Please check the PR comments for more details. >>> >>> This patch can be bootstrapped & regtested on >>> powerpc64le-linux-gnu P9 and x86_64-redhat-linux, but have some >>> "XFAIL->XPASS" failures on aarch64-linux-gnu. The failure list >>> was attached in the PR and thought the new assembly looks >>> improved (expected). >>> >>> With option Ofast unroll, this patch can help to improve SPEC2017 >>> bmk 508.namd_r +2.42% and 519.lbm_r +2.43% on Power8 while >>> 508.namd_r +3.02% and 519.lbm_r +3.85% on Power9 without any >>> remarkable degradations. >>> >>> Since this patch likely benefits x86_64 and aarch64, but I don't >>> have performance machines with these arches at hand, could >>> someone kindly help to benchmark it if possible? >> I can help test it on Intel cascade lake and AMD milan. Thanks for your help, Hongtao! > And could you rebase your patch on the lastest trunk, i got several > failures when applying the patch > ~ git apply ira-v3.diff > error: patch failed: gcc/doc/invoke.texi:13845 > error: gcc/doc/invoke.texi: patch does not apply > error: patch failed: gcc/ira-conflicts.c:233 > error: gcc/ira-conflicts.c: patch does not apply > error: patch failed: gcc/ira-int.h:971 > error: gcc/ira-int.h: patch does not apply > error: patch failed: gcc/ira.c:1922 > error: gcc/ira.c: patch does not apply > error: patch failed: gcc/params.opt:330 > error: gcc/params.opt: patch does not apply > I think it's due to unexpected git stat lines in previously attached diff. I have attached the format-patch file. Please have a check. Thanks! BR, Kewen