From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-403318-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 18254 invoked by alias); 17 Jul 2015 20:12:56 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 18237 invoked by uid 89); 17 Jul 2015 20:12:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.1 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS,UNSUBSCRIBE_BODY autolearn=ham version=3.3.2
X-HELO: nm4-vm6.bullet.mail.ne1.yahoo.com
Received: from nm4-vm6.bullet.mail.ne1.yahoo.com (HELO nm4-vm6.bullet.mail.ne1.yahoo.com) (98.138.91.97) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Fri, 17 Jul 2015 20:12:54 +0000
Received: from [98.138.101.130] by nm4.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000
Received: from [98.138.226.62] by tm18.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000
Received: from [127.0.0.1] by smtp213.mail.ne1.yahoo.com with NNFMP; 17 Jul 2015 20:12:52 -0000
X-Yahoo-SMTP: RhyaqECswBCSKHdmagqyBBwGHjobejNv
Message-ID: <55A961C1.1070206@yahoo.com>
Date: Fri, 17 Jul 2015 20:33:00 -0000
From: Abe <abe_skolnik@yahoo.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Richard Biener <richard.guenther@gmail.com>,  Alan Lawrence <alan.lawrence@arm.com>, "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2015-07/txt/msg01579.txt.bz2

Dear all,

Another benefit of the new if converter that perhaps I neglected to mention/explain...

TLDR: for some source code which can reasonably be expected to exist in "real-world code",
when the false/true values of the condition bits of a given "if" in a given loop are very
well-clustered, the code produced by the new converter runs _much_ faster for the same
inputs than the code produced by the old converter when write-back caching is in effect.

The long explanation follows.


In the case of a loop with a "half-hammock" that looks something like this:

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... or problem-wise-equivalently:

   if (C[index])  ; // empty "then" section
   else           B[index] = foo(bar);

... the latter of which is semantically equivalent to:

   if (! C[index])  B[index] = foo(bar);
   // important: no else here


... the old if converter does something that may massively damage performance.

Basically, the old if converter converts...

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... to the equivalent of:

   __compiler_temp = foo(bar);
   A[index] = C[index] ? __compiler_temp : A[index];


For now, let`s assume the preceding conversion is valid even in the face of multithreading,
since multithreading bugs introduced by an "optimization" are a whole other
ball of wax than what this message is all about; for now, let`s assume that
all of A[] is thread-local and no nasty, sneaky pointer-passing has occurred.

The problem is this: the compiler cannot, in the general case, predict what the values of C[]
will be at runtime.  Therefor, it cannot [again, in the general case] arrive at a conclusion
"this is definitely worth it" or "this is definitely _not_ worth it".  All the compiler can do
statically without profiling information is to say "I guess a probability of 50% on the
elements of C[] being equivalent to true", which -- under an assumption of vectorization --
means that the vectorization factor is going to make the transformation worthwhile.

However: what if the values of C[] are mostly equivalent to false, not to true?  For such
cases, the old if converter yielded code that may cause a big performance degradation due to
the if conversion, even in the presence of vectorization.  If we assume that the CPU hardware
is not checking to see whether writes to an address change the contents, then each execution
of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur *_no matter
what the value of "C[index]" is/was_*.  Now, instead of reading the whole A[] and writing
a tiny fraction of it, the program is reading all of A[] and also (re)writing at least
almost all of A[] (possibly writing all of it even when the probability of the
relevant elements of C[] is _not_ 100%, due to cache-line granularity of writes:
when every cache line from A[] is modified, all of A[] will be rewritten).

The preceding problem could be somewhat ameliorated by profiling, providing that the data
you run through your program while profiling is a good representation of the data run
through the same program by "real executions", but there is no need for that extra work
or extra qualification given the existence of the new if converter.  Plus, the profiling
approach to "fixing" this problem with the old converter would only result in a binary
decision -- "do convert this if" vs. "don`t convert this if" -- which in cases where the
decision is to do/retain the conversion, the converted code is going to rewrite the whole
array.  The new converter`s conversion, OTOH, can produce better performance than the
conversion from the old converter in cases where the elements of C[] in our example are
very clustered: in other words, the _overall_ probability can still be close [or =] to the
hardest-to-deal-with challenge of 50%, but there is a large degree of clustering of the
"true" values and the "false" values.  For example, all the "false" values come first in C[].
In this case, if a profiling-based approach using the old converter decides to do/keep
the conversion, then there are still lots of wasted writes that the new converter
would avoid, assuming at least one level of write-back cache in the relevant data path.

The key factor to understand for understanding how/why the new converter`s resulting code
is better than that of the old converter is this: the new converter uses a scratchpad
to "throw away" useless writes.  This not only fixes problems with speculative writes
through a null pointer that the pre-conversion code never actually does, it also fixes
the above-described potential performance problem, at least on architectures with
write-back data cache, which AFAIK covers most/all modern high-speed CPUs.
The new converter converts something like (same example as one of the above):

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... into something like:

   /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] : &scratchpad;
   __compiler_temp = foo(bar);
   *pointer = C[index] ? __compiler_temp : scratchpad;

... so that in the event of C[] being mostly false, most of the reads are _from_ the scratchpad
and most of the writes are _to_ the scratchpad.  This not only probably* saves a lot of the
potentially-massive number of wasted writes due to the old conversion rewriting all of A[]
no matter what the values of C[]`s elements are, it also saves a lot of reads when there is
enough clustering of false values in C[] that entire cache lines worth of A[] can be not
read from main RAM.  Caching and the "chunking effect" of cache lines, besides qualifying how
much reading is saved by the new conversion relative to the old one, also is the reason I needed
to qualify "probably* saves most/all the [...] wasted writes": it depends on how well or
how poorly the values in C[] are clustered.  In the case that is best for the new converter,
the new converter saves at least almost all the waste of the old converter`s code on both
the read path _and_ the write path [modulo at most one cache line of both read and write].


I hope the above helps to explain why the new converter is worth merging to trunk,
even though for now it still has performance regressions which
we expect to fix before the first GCC 6.x is tagged for release.


Regards,

Abe