From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18254 invoked by alias); 17 Jul 2015 20:12:56 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 18237 invoked by uid 89); 17 Jul 2015 20:12:56 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.1 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS,UNSUBSCRIBE_BODY autolearn=ham version=3.3.2 X-HELO: nm4-vm6.bullet.mail.ne1.yahoo.com Received: from nm4-vm6.bullet.mail.ne1.yahoo.com (HELO nm4-vm6.bullet.mail.ne1.yahoo.com) (98.138.91.97) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Fri, 17 Jul 2015 20:12:54 +0000 Received: from [98.138.101.130] by nm4.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000 Received: from [98.138.226.62] by tm18.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000 Received: from [127.0.0.1] by smtp213.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000 X-Yahoo-SMTP: RhyaqECswBCSKHdmagqyBBwGHjobejNv Message-ID: <55A961C1.1070206@yahoo.com> Date: Fri, 17 Jul 2015 20:33:00 -0000 From: Abe User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Richard Biener , Alan Lawrence , "gcc-patches@gcc.gnu.org" Subject: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-SW-Source: 2015-07/txt/msg01579.txt.bz2 Dear all, Another benefit of the new if converter that perhaps I neglected to mention/explain... TLDR: for some source code which can reasonably be expected to exist in "real-world code", when the false/true values of the condition bits of a given "if" in a given loop are very well-clustered, the code produced by the new converter runs _much_ faster for the same inputs than the code produced by the old converter when write-back caching is in effect. The long explanation follows. In the case of a loop with a "half-hammock" that looks something like this: if (C[index]) A[index] = foo(bar); // important: no else here ... or problem-wise-equivalently: if (C[index]) ; // empty "then" section else B[index] = foo(bar); ... the latter of which is semantically equivalent to: if (! C[index]) B[index] = foo(bar); // important: no else here ... the old if converter does something that may massively damage performance. Basically, the old if converter converts... if (C[index]) A[index] = foo(bar); // important: no else here ... to the equivalent of: __compiler_temp = foo(bar); A[index] = C[index] ? __compiler_temp : A[index]; For now, let`s assume the preceding conversion is valid even in the face of multithreading, since multithreading bugs introduced by an "optimization" are a whole other ball of wax than what this message is all about; for now, let`s assume that all of A[] is thread-local and no nasty, sneaky pointer-passing has occurred. The problem is this: the compiler cannot, in the general case, predict what the values of C[] will be at runtime. Therefor, it cannot [again, in the general case] arrive at a conclusion "this is definitely worth it" or "this is definitely _not_ worth it". All the compiler can do statically without profiling information is to say "I guess a probability of 50% on the elements of C[] being equivalent to true", which -- under an assumption of vectorization -- means that the vectorization factor is going to make the transformation worthwhile. However: what if the values of C[] are mostly equivalent to false, not to true? For such cases, the old if converter yielded code that may cause a big performance degradation due to the if conversion, even in the presence of vectorization. If we assume that the CPU hardware is not checking to see whether writes to an address change the contents, then each execution of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur *_no matter what the value of "C[index]" is/was_*. Now, instead of reading the whole A[] and writing a tiny fraction of it, the program is reading all of A[] and also (re)writing at least almost all of A[] (possibly writing all of it even when the probability of the relevant elements of C[] is _not_ 100%, due to cache-line granularity of writes: when every cache line from A[] is modified, all of A[] will be rewritten). The preceding problem could be somewhat ameliorated by profiling, providing that the data you run through your program while profiling is a good representation of the data run through the same program by "real executions", but there is no need for that extra work or extra qualification given the existence of the new if converter. Plus, the profiling approach to "fixing" this problem with the old converter would only result in a binary decision -- "do convert this if" vs. "don`t convert this if" -- which in cases where the decision is to do/retain the conversion, the converted code is going to rewrite the whole array. The new converter`s conversion, OTOH, can produce better performance than the conversion from the old converter in cases where the elements of C[] in our example are very clustered: in other words, the _overall_ probability can still be close [or =] to the hardest-to-deal-with challenge of 50%, but there is a large degree of clustering of the "true" values and the "false" values. For example, all the "false" values come first in C[]. In this case, if a profiling-based approach using the old converter decides to do/keep the conversion, then there are still lots of wasted writes that the new converter would avoid, assuming at least one level of write-back cache in the relevant data path. The key factor to understand for understanding how/why the new converter`s resulting code is better than that of the old converter is this: the new converter uses a scratchpad to "throw away" useless writes. This not only fixes problems with speculative writes through a null pointer that the pre-conversion code never actually does, it also fixes the above-described potential performance problem, at least on architectures with write-back data cache, which AFAIK covers most/all modern high-speed CPUs. The new converter converts something like (same example as one of the above): if (C[index]) A[index] = foo(bar); // important: no else here ... into something like: /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] : &scratchpad; __compiler_temp = foo(bar); *pointer = C[index] ? __compiler_temp : scratchpad; ... so that in the event of C[] being mostly false, most of the reads are _from_ the scratchpad and most of the writes are _to_ the scratchpad. This not only probably* saves a lot of the potentially-massive number of wasted writes due to the old conversion rewriting all of A[] no matter what the values of C[]`s elements are, it also saves a lot of reads when there is enough clustering of false values in C[] that entire cache lines worth of A[] can be not read from main RAM. Caching and the "chunking effect" of cache lines, besides qualifying how much reading is saved by the new conversion relative to the old one, also is the reason I needed to qualify "probably* saves most/all the [...] wasted writes": it depends on how well or how poorly the values in C[] are clustered. In the case that is best for the new converter, the new converter saves at least almost all the waste of the old converter`s code on both the read path _and_ the write path [modulo at most one cache line of both read and write]. I hope the above helps to explain why the new converter is worth merging to trunk, even though for now it still has performance regressions which we expect to fix before the first GCC 6.x is tagged for release. Regards, Abe