On Mon, Jan 28, 2019 at 9:08 AM H.J. Lu wrote: > > On Tue, Jan 22, 2019 at 5:28 AM H.J. Lu wrote: > > > > On Tue, Jan 22, 2019 at 4:08 AM Richard Biener > > wrote: > > > > > > On Mon, Jan 21, 2019 at 10:27 PM H.J. Lu wrote: > > > > > > > > On Mon, Jan 21, 2019 at 10:54 AM Jeff Law wrote: > > > > > > > > > > On 1/7/19 6:55 AM, H.J. Lu wrote: > > > > > > On Sun, Dec 30, 2018 at 8:40 AM H.J. Lu wrote: > > > > > >> On Wed, Nov 28, 2018 at 12:17 PM Jeff Law wrote: > > > > > >>> On 11/28/18 12:48 PM, H.J. Lu wrote: > > > > > >>>> On Mon, Nov 5, 2018 at 7:29 AM Jan Hubicka wrote: > > > > > >>>>>> On 11/5/18 7:21 AM, Jan Hubicka wrote: > > > > > >>>>>>>> Did you mean "the nearest common dominator"? > > > > > >>>>>>> If the nearest common dominator appears in the loop while all uses are > > > > > >>>>>>> out of loops, this will result in suboptimal xor placement. > > > > > >>>>>>> In this case you want to split edges out of the loop. > > > > > >>>>>>> > > > > > >>>>>>> In general this is what the LCM framework will do for you if the problem > > > > > >>>>>>> is modelled siimlar way as in mode_swtiching. At entry function mode is > > > > > >>>>>>> "no zero register needed" and all conversions need mode "zero register > > > > > >>>>>>> needed". Mode switching should then do the correct placement decisions > > > > > >>>>>>> (reaching minimal number of executions of xor). > > > > > >>>>>>> > > > > > >>>>>>> Jeff, whan is your optinion on the approach taken by the patch? > > > > > >>>>>>> It seems like a special case of more general issue, but I do not see > > > > > >>>>>>> very elegant way to solve it at least in the GCC 9 horisont, so if > > > > > >>>>>>> the placement is correct we can probalby go either with new pass or > > > > > >>>>>>> making this part of mode swithcing (which is anyway run by x86 backend) > > > > > >>>>>> So I haven't followed this discussion at all, but did touch on this > > > > > >>>>>> issue with some patch a month or two ago with a target patch that was > > > > > >>>>>> trying to avoid the partial stalls. > > > > > >>>>>> > > > > > >>>>>> My assumption is that we're trying to find one or more places to > > > > > >>>>>> initialize the upper half of an avx register so as to avoid partial > > > > > >>>>>> register stall at existing sites that set the upper half. > > > > > >>>>>> > > > > > >>>>>> This sounds like a classic PRE/LCM style problem (of which mode > > > > > >>>>>> switching is just another variant). A common-dominator approach is > > > > > >>>>>> closer to a classic GCSE and is going to result is more initializations > > > > > >>>>>> at sub-optimal points than a PRE/LCM style. > > > > > >>>>> yes, it is usual code placement problem. It is special case because the > > > > > >>>>> zero register is not modified by the conversion (just we need to have > > > > > >>>>> zero somewhere). So basically we do not have kills to the zero except > > > > > >>>>> for entry block. > > > > > >>>>> > > > > > >>>> Do you have testcase to show thatf the nearest common dominator > > > > > >>>> in the loop, while all uses areout of loops, leads to suboptimal xor > > > > > >>>> placement? > > > > > >>> I don't have a testcase, but it's all but certain nearest common > > > > > >>> dominator is going to be a suboptimal placement. That's going to create > > > > > >>> paths where you're going to emit the xor when it's not used. > > > > > >>> > > > > > >>> The whole point of the LCM algorithms is they are optimal in terms of > > > > > >>> expression evaluations. > > > > > >> We tried LCM and it didn't work well for this case. LCM places a single > > > > > >> VXOR close to the location where it is needed, which can be inside a > > > > > >> loop. There is nothing wrong with the LCM algorithms. But this doesn't > > > > > >> solve > > > > > >> > > > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 > > > > > >> > > > > > >> where VXOR is executed multiple times inside of a function, instead of > > > > > >> just once. We are investigating to generate a single VXOR at entry of the > > > > > >> nearest dominator for basic blocks with SF/DF conversions, which is in > > > > > >> the the fake loop that contains the whole function: > > > > > >> > > > > > >> bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > > > > > >> convert_bbs); > > > > > >> while (bb->loop_father->latch > > > > > >> != EXIT_BLOCK_PTR_FOR_FN (cfun)) > > > > > >> bb = get_immediate_dominator (CDI_DOMINATORS, > > > > > >> bb->loop_father->header); > > > > > >> > > > > > >> insn = BB_HEAD (bb); > > > > > >> if (!NONDEBUG_INSN_P (insn)) > > > > > >> insn = next_nonnote_nondebug_insn (insn); > > > > > >> set = gen_rtx_SET (v4sf_const0, CONST0_RTX (V4SFmode)); > > > > > >> set_insn = emit_insn_before (set, insn); > > > > > >> > > > > > > Here is the updated patch. OK for trunk? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -- H.J. > > > > > > > > > > > > > > > > > > 0001-i386-Add-pass_remove_partial_avx_dependency.patch > > > > > > > > > > > > From 6eca7dbf282d7e2a5cde41bffeca66195d72d48e Mon Sep 17 00:00:00 2001 > > > > > > From: "H.J. Lu" > > > > > > Date: Mon, 7 Jan 2019 05:44:59 -0800 > > > > > > Subject: [PATCH] i386: Add pass_remove_partial_avx_dependency > > > > > > > > > > > > With -mavx, for > > > > > > > > > > > > $ cat foo.i > > > > > > extern float f; > > > > > > extern double d; > > > > > > extern int i; > > > > > > > > > > > > void > > > > > > foo (void) > > > > > > { > > > > > > d = f; > > > > > > f = i; > > > > > > } > > > > > > > > > > > > we need to generate > > > > > > > > > > > > vxorp[ds] %xmmN, %xmmN, %xmmN > > > > > > ... > > > > > > vcvtss2sd f(%rip), %xmmN, %xmmX > > > > > > ... > > > > > > vcvtsi2ss i(%rip), %xmmN, %xmmY > > > > > > > > > > > > to avoid partial XMM register stall. This patch adds a pass to generate > > > > > > a single > > > > > > > > > > > > vxorps %xmmN, %xmmN, %xmmN > > > > > > > > > > > > at entry of the nearest dominator for basic blocks with SF/DF conversions, > > > > > > which is in the fake loop that contains the whole function, instead of > > > > > > generating one > > > > > > > > > > > > vxorp[ds] %xmmN, %xmmN, %xmmN > > > > > > > > > > > > for each SF/DF conversion. > > > > > > > > > > > > NB: The LCM algorithm isn't appropriate here since it may place a vxorps > > > > > > inside the loop. Simple testcase show this: > > > > > > > > > > > > $ cat badcase.c > > > > > > > > > > > > extern float f; > > > > > > extern double d; > > > > > > > > > > > > void > > > > > > foo (int n, int k) > > > > > > { > > > > > > for (int j = 0; j != n; j++) > > > > > > if (j < k) > > > > > > d = f; > > > > > > } > > > > > > > > > > > > It generates > > > > > > > > > > > > ... > > > > > > loop: > > > > > > if(j < k) > > > > > > vxorps %xmm0, %xmm0, %xmm0 > > > > > > vcvtss2sd %xmm1, %xmm0, %xmm0 > > > > > > ... > > > > > > loopend > > > > > > ... > > > > > > > > > > > > This is because LCM only works when there is a certain benifit. But for > > > > > > conditional branch, LCM wouldn't move > > > > > > > > > > > > vxorps %xmm0, %xmm0, %xmm0 > > > > > It works this way for a reason. There are obviously paths through the > > > > > loop where the conversion does not happen and thus the vxor is not > > > > > needed or desirable on those paths. > > > > > > > > > > That's a fundamental property of the LCM algorithm -- it never inserts > > > > > an evaluation on a path through the CFG where it will not be used. > > > > > > > > > > Your algorithm of inserting into the dominator block will introduce > > > > > runtime executions of the vxor on paths where it is not needed. > > > > > > > > > > It's well known that relaxing that property of LCM can result in better > > > > > code generation in some circumstances. Block copying and loop > > > > > restructuring are the gold standard for dealing with this kind of problem. > > > > > > > > > > In this case you could split the iteration space so that you have two > > > > > loops. one for 0..k and the other for k..n. Note that GCC has support > > > > > for this kind of loop restructuring. This has the advantage that the j > > > > > < k test does not happen each iteration of the loop and the vxor stuff > > > > > via LCM would be optimal. > > > > > > > > > > There's many other cases where copying and restructuring results in > > > > > better common subexpression elimination (which is what you're doing). > > > > > Probably the best work I've seen in this space is Bodik's thesis. > > > > > Click's work from '95 touches on some of this as well, but isn't as > > > > > relevant to this specific instance. > > > > > > > > > > Anyway, whether or not the patch should move forward is really up to Jan > > > > > (and Uros if he wants to be involved) I think. I'm not fundamentally > > > > > opposed to HJ's approach as I'm aware of the different tradeoffs. > > > > > > > > > > HJ's approach of pulling into the dominator block can result in > > > > > unnecessary evaluations. But it can also reduce the number of > > > > > evaluations in other cases. It really depends on the runtime behavior > > > > > of the code. One could argue that the vxor stuff we're talking about > > > > > is most likely happening in loops, and probably not in deeply nested > > > > > control structures within those loops. Thus pulling them out more > > > > > aggressively ala LICM may be better than LCM. > > > > > > > > True, there is a trade-off. My approach inserts a vxorps at the last possible > > > > position. Yes, vxorps will always be executed even if it may not be required. > > > > Since it is executed only once in all cases, it is a win overall. > > > > > > Hopefully a simple vpxor won't end up powering up the other AVX512 > > > unit if it lay dormant ... > > > > A 128-bit AVX vpxor won't touch AVX512. > > > > > And if we ever get to the state of having two separate ISAs in the same > > > function then you'd need to make sure you can execute vpxor in the > > > place you are inserting since it may now be executed when it wasn't > > > before (and I assume you already check that you do not zero the > > > reg if there's a value life in it if the conditional def you are instrumenting > > > is not executed). > > > > A dedicated pseudo register is allocated and zeroed for INT->FP and > > FP->FP conversions. IRA/LRA take care of the rest. > > > > PING: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > Here is the updated patch adjusted after PR target/89071 fix. OK for trunk? Thanks. -- H.J.