From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 61236 invoked by alias); 29 Jul 2015 09:18:07 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 61227 invoked by uid 89); 29 Jul 2015 09:18:07 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ig0-f178.google.com Received: from mail-ig0-f178.google.com (HELO mail-ig0-f178.google.com) (209.85.213.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Wed, 29 Jul 2015 09:18:05 +0000 Received: by igbpg9 with SMTP id pg9so8519463igb.0 for ; Wed, 29 Jul 2015 02:18:03 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.50.110.72 with SMTP id hy8mr3902047igb.36.1438161483352; Wed, 29 Jul 2015 02:18:03 -0700 (PDT) Received: by 10.107.32.140 with HTTP; Wed, 29 Jul 2015 02:18:03 -0700 (PDT) In-Reply-To: <55B7C27B.9000406@yahoo.com> References: <55A961C1.1070206@yahoo.com> <55B7C27B.9000406@yahoo.com> Date: Wed, 29 Jul 2015 09:51:00 -0000 Message-ID: Subject: Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code From: Richard Biener To: Abe Cc: Alan Lawrence , "gcc-patches@gcc.gnu.org" , Sebastian Pop Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2015-07/txt/msg02439.txt.bz2 On Tue, Jul 28, 2015 at 7:57 PM, Abe wrote: > [Richard wrote:] >> >> Note the store to *pointer can be done unconditionally > > > Yes; if I`m mapping things correctly in my mind, this is > something that Sebastian [and Alan, via email?] and I have > already discussed and which we plan to fix in good time. > > Please note that this is a minor problem at most, > if/when it it safe to assume that the target can handle > two vectorized conditional operations in the same loop, > since anything remotely resembling an expensive > operation in the [pure] condition should be [is being?] > computed once per index and stored in a temporary. > For example: if the source code looks something like: > > if ( condition(index) ) A[index] =3D foo(index); > // important: no else here > > ... then the new converter currently converts it to something like: > > /* an appropriate type goes here */ __compiler_temp_1 =3D condition(ind= ex); > /* the type of the scalar goes here */ * pointer =3D __compiler_temp_1 ? > &A[index] : &scratchpad; > /* an appropriate type goes here */ __compiler_temp_2 =3D foo(index); > *pointer =3D __compiler_temp_1 ? __compiler_temp_2 : scratchpad; > > =E2=80=A6 so "condition(index)" is being evaluated only once > per {evaluation that exists in the source code}. > > The fix for this would/will therefor be a minor optimization IMO; > the benefit would/will be that in/for iterations/columns > for which the condition is false, the scratchpad will not be > needlessly read from in order to derive the value to throw away. > Always throwing away the unneeded result of evaluating "foo(index)" > is good enough, and by removing an unneeded conditional expression > the burden on the vectorizer is reduced: it now only needs: > {vectorized decision followed by vectorized store} > in each such loop, not: > {vectorized decision followed by vectorized decision followed by > vectorized store}. > [intentionally omitting whatever else it must do > in a vectorized manner in the same loop] > > This is something we [Sebastian and I] plan on fixing eventually anyway, > i.e. regardless of whether or not it fixes a test case we already have. > > > [Richard wrote:] >> >> and note that another performance issue of if-conversion >> is that foo(bar) is executed unconditionally. > > > AFAIK this is a fundamental limitation/necessity of if conversion. > > A fundamental assumption/requirement here is that "foo(bar)"/"foo(index)" > is/are both pure and low-cost. [I`ve renamed the example to "foo(index)" > to show that it`s not loop-invariant, since if it were then LICM should > make multiple evaluations of it unneeded and probably not going to happen > unless you are targeting a VLIW ISA and have an unused slot in the > instruction word if you do LICM on the sub-instruction in question.] > > If "foo(index)" is not being checked for purity, > then we have a correctness bug. > > If "foo(index)" is not being checked for low evaluation cost, > then we have a performance bug IMO. The compiler should use its > existing estimation mechanism[s] to make an educated guess on > the cost of "foo(index)" and intentionally not do if conversion > if/when {the predicted cost of evaluating "foo(index)" > for each iteration regardless of the condition bits} > is too high even in the presence of vectorization. > > > [Richard wrote:] >> >> We have a bugreport that >> if (C[index]) A[index] =3D exp (x); >> massively slows down things if C[index] is almost never true. > > > Quite understandable. However, unfortunately I cannot think of > any mechanism that already exists in GCC [or any other compiler > the internals of which I am even slightly familiar] to estimate > the probability of the elements of an arbitrary array -- > or [worse yet] of the probability of an arbitrary expression`s > evaluation result -- being convertible to either particular > Boolean value. Perhaps this is feasible if/when "C[...]" is > truly an array, i.e. not a pointer, and the array`s contents > are known at compile time. Otherwise, it seems to require > pointer analysis at best, and is infeasible at worst > [e.g. a pointer received from another translation unit]. > > I think the only thing we can do about this, other than alter our > plans for defaulting the if conversion, is to endeavor to make profiling > [e.g. "gprof"] able to "understand" that a certain piece of code has been > if-converted and able to suggest -- based on profiling -- that the > conversion should be undone b/c it is "costing" more than it is "saving", > even with vectorization, which IMO should be an extremely rare occurrence > if/once we are checking e.g. "exp(x)" [assuming it`s not loop-invariant] > for low cost of evaluation. > > IOW, whatever we have [or will set] the threshold on evaluation cost of > the RHS expression for if conversion of source code like the above example > should, IMO, solve most instances of the abovementioned problem. > The remaining problem cases will likely be something like: > {"exp(x)" is _not_ loop-invariant, > the probability of C[index] being convertible to true is very low, > _and_ the statically-estimated evaluation cost of "exp(x)" > is both under the maximum and too close to that maximum}. > Arguably, barring bugs in the cost estimation, > if this happens too often in real-world code, > then we have set the maximum too high and should adjust it. > > >> Dependent on whether the scratchpad is shared for different sized access= es >> or not the scratchpad may also introduce store hazards in the CPU >> because if scratchpad-as-int =3D ...; follows scratchpad-as-short =3D ..= .; >> the latter only partially kills the preceeding store to the scratchpad >> (usually CPUs can't merge the two requests in the store buffer). > > > It seems to me that the worst that can happen as a result of the preceding > is that non-{in-order} CPUs will be somewhat restricted in how much they > can re-order some of the internal operations that result from the decoding > of the machine-code instructions. This seems to me to not be a terribly = big > problem since the code in question should be in a vectorized loop anyway. > If the preceding causes a long stall on a large # of CPUs, then we > should do something about it, of course. Otherwise, I propose that > the preceding may be a largely-theoretical performance problem > the solving of which doesn`t need to be a high priority > until/unless it is proven to be a very real-world problem. > > >> Not sure which approach to allocating the scratchpad you are using >> currently. > > > We are currently using something at least "morally equivalent" to a singl= e: > > char scratchpad[64]; > > ... for each routine with a conversion that triggers scratchpad generatio= n. > > >> I still have to review the patch itself ... (and so many others) > > > The work of a mother or a GCC maintainer is never done. ;-) > > >> How do you expect to fix the performance regressions? > > > Mostly by the fruit of my labor with help from Sebastian. I`d _really_ l= ike > to get the new converter into trunk ASAP so that there will be many more > "eyes" on the code. Also, it is important IMO that this code should not > languish for months/years [again], since that is what happened approx. 5 > years > ago when Sebastian wrote it in the first place. I don`t think a branch is > the right place for this code, since the branch would allow the code to d= ie. Well. We don't generally introduce regressions with changes. Iff there are no regressions with --param allow-store-data-races=3D1 then I guess it's fine as a wrong-code fix (well, the patch still needs review - I hope to get to that this week). Richard. > Regards, > > Abe