From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 39836 invoked by alias); 28 Jul 2015 17:57:23 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 39822 invoked by uid 89); 28 Jul 2015 17:57:23 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 X-HELO: nm7-vm0.bullet.mail.ne1.yahoo.com Received: from nm7-vm0.bullet.mail.ne1.yahoo.com (HELO nm7-vm0.bullet.mail.ne1.yahoo.com) (98.138.91.66) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 28 Jul 2015 17:57:21 +0000 Received: from [98.138.100.102] by nm7.bullet.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000 Received: from [98.138.84.46] by tm101.bullet.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000 Received: from [127.0.0.1] by smtp114.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000 X-Yahoo-SMTP: RhyaqECswBCSKHdmagqyBBwGHjobejNv Message-ID: <55B7C27B.9000406@yahoo.com> Date: Tue, 28 Jul 2015 18:18:00 -0000 From: Abe User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Richard Biener CC: Alan Lawrence , "gcc-patches@gcc.gnu.org" , Sebastian Pop Subject: Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code References: <55A961C1.1070206@yahoo.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2015-07/txt/msg02396.txt.bz2 [Richard wrote:] > Note the store to *pointer can be done unconditionally Yes; if I`m mapping things correctly in my mind, this is something that Sebastian [and Alan, via email?] and I have already discussed and which we plan to fix in good time. Please note that this is a minor problem at most, if/when it it safe to assume that the target can handle two vectorized conditional operations in the same loop, since anything remotely resembling an expensive operation in the [pure] condition should be [is being?] computed once per index and stored in a temporary. For example: if the source code looks something like: if ( condition(index) ) A[index] = foo(index); // important: no else here ... then the new converter currently converts it to something like: /* an appropriate type goes here */ __compiler_temp_1 = condition(index); /* the type of the scalar goes here */ * pointer = __compiler_temp_1 ? &A[index] : &scratchpad; /* an appropriate type goes here */ __compiler_temp_2 = foo(index); *pointer = __compiler_temp_1 ? __compiler_temp_2 : scratchpad; … so "condition(index)" is being evaluated only once per {evaluation that exists in the source code}. The fix for this would/will therefor be a minor optimization IMO; the benefit would/will be that in/for iterations/columns for which the condition is false, the scratchpad will not be needlessly read from in order to derive the value to throw away. Always throwing away the unneeded result of evaluating "foo(index)" is good enough, and by removing an unneeded conditional expression the burden on the vectorizer is reduced: it now only needs: {vectorized decision followed by vectorized store} in each such loop, not: {vectorized decision followed by vectorized decision followed by vectorized store}. [intentionally omitting whatever else it must do in a vectorized manner in the same loop] This is something we [Sebastian and I] plan on fixing eventually anyway, i.e. regardless of whether or not it fixes a test case we already have. [Richard wrote:] > and note that another performance issue of if-conversion > is that foo(bar) is executed unconditionally. AFAIK this is a fundamental limitation/necessity of if conversion. A fundamental assumption/requirement here is that "foo(bar)"/"foo(index)" is/are both pure and low-cost. [I`ve renamed the example to "foo(index)" to show that it`s not loop-invariant, since if it were then LICM should make multiple evaluations of it unneeded and probably not going to happen unless you are targeting a VLIW ISA and have an unused slot in the instruction word if you do LICM on the sub-instruction in question.] If "foo(index)" is not being checked for purity, then we have a correctness bug. If "foo(index)" is not being checked for low evaluation cost, then we have a performance bug IMO. The compiler should use its existing estimation mechanism[s] to make an educated guess on the cost of "foo(index)" and intentionally not do if conversion if/when {the predicted cost of evaluating "foo(index)" for each iteration regardless of the condition bits} is too high even in the presence of vectorization. [Richard wrote:] > We have a bugreport that > if (C[index]) A[index] = exp (x); > massively slows down things if C[index] is almost never true. Quite understandable. However, unfortunately I cannot think of any mechanism that already exists in GCC [or any other compiler the internals of which I am even slightly familiar] to estimate the probability of the elements of an arbitrary array -- or [worse yet] of the probability of an arbitrary expression`s evaluation result -- being convertible to either particular Boolean value. Perhaps this is feasible if/when "C[...]" is truly an array, i.e. not a pointer, and the array`s contents are known at compile time. Otherwise, it seems to require pointer analysis at best, and is infeasible at worst [e.g. a pointer received from another translation unit]. I think the only thing we can do about this, other than alter our plans for defaulting the if conversion, is to endeavor to make profiling [e.g. "gprof"] able to "understand" that a certain piece of code has been if-converted and able to suggest -- based on profiling -- that the conversion should be undone b/c it is "costing" more than it is "saving", even with vectorization, which IMO should be an extremely rare occurrence if/once we are checking e.g. "exp(x)" [assuming it`s not loop-invariant] for low cost of evaluation. IOW, whatever we have [or will set] the threshold on evaluation cost of the RHS expression for if conversion of source code like the above example should, IMO, solve most instances of the abovementioned problem. The remaining problem cases will likely be something like: {"exp(x)" is _not_ loop-invariant, the probability of C[index] being convertible to true is very low, _and_ the statically-estimated evaluation cost of "exp(x)" is both under the maximum and too close to that maximum}. Arguably, barring bugs in the cost estimation, if this happens too often in real-world code, then we have set the maximum too high and should adjust it. > Dependent on whether the scratchpad is shared for different sized accesses > or not the scratchpad may also introduce store hazards in the CPU > because if scratchpad-as-int = ...; follows scratchpad-as-short = ...; > the latter only partially kills the preceeding store to the scratchpad > (usually CPUs can't merge the two requests in the store buffer). It seems to me that the worst that can happen as a result of the preceding is that non-{in-order} CPUs will be somewhat restricted in how much they can re-order some of the internal operations that result from the decoding of the machine-code instructions. This seems to me to not be a terribly big problem since the code in question should be in a vectorized loop anyway. If the preceding causes a long stall on a large # of CPUs, then we should do something about it, of course. Otherwise, I propose that the preceding may be a largely-theoretical performance problem the solving of which doesn`t need to be a high priority until/unless it is proven to be a very real-world problem. > Not sure which approach to allocating the scratchpad you are using currently. We are currently using something at least "morally equivalent" to a single: char scratchpad[64]; ... for each routine with a conversion that triggers scratchpad generation. > I still have to review the patch itself ... (and so many others) The work of a mother or a GCC maintainer is never done. ;-) > How do you expect to fix the performance regressions? Mostly by the fruit of my labor with help from Sebastian. I`d _really_ like to get the new converter into trunk ASAP so that there will be many more "eyes" on the code. Also, it is important IMO that this code should not languish for months/years [again], since that is what happened approx. 5 years ago when Sebastian wrote it in the first place. I don`t think a branch is the right place for this code, since the branch would allow the code to die. Regards, Abe