From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-404135-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 39836 invoked by alias); 28 Jul 2015 17:57:23 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 39822 invoked by uid 89); 28 Jul 2015 17:57:23 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2
X-HELO: nm7-vm0.bullet.mail.ne1.yahoo.com
Received: from nm7-vm0.bullet.mail.ne1.yahoo.com (HELO nm7-vm0.bullet.mail.ne1.yahoo.com) (98.138.91.66) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 28 Jul 2015 17:57:21 +0000
Received: from [98.138.100.102] by nm7.bullet.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000
Received: from [98.138.84.46] by tm101.bullet.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000
Received: from [127.0.0.1] by smtp114.mail.ne1.yahoo.com with NNFMP; 28 Jul 2015 17:57:19 -0000
X-Yahoo-SMTP: RhyaqECswBCSKHdmagqyBBwGHjobejNv
Message-ID: <55B7C27B.9000406@yahoo.com>
Date: Tue, 28 Jul 2015 18:18:00 -0000
From: Abe <abe_skolnik@yahoo.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Richard Biener <richard.guenther@gmail.com>
CC: Alan Lawrence <alan.lawrence@arm.com>,  "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, Sebastian Pop <sebpop@gmail.com>
Subject: Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
References: <55A961C1.1070206@yahoo.com> <CAFiYyc0MTHVT_7MBMpbx2K6vYLgZ5L37OPEaHPG_ey4qZAwnEw@mail.gmail.com>
In-Reply-To: <CAFiYyc0MTHVT_7MBMpbx2K6vYLgZ5L37OPEaHPG_ey4qZAwnEw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SW-Source: 2015-07/txt/msg02396.txt.bz2

[Richard wrote:]
> Note the store to *pointer can be done unconditionally

Yes; if I`m mapping things correctly in my mind, this is
something that Sebastian [and Alan, via email?] and I have
already discussed and which we plan to fix in good time.

Please note that this is a minor problem at most,
if/when it it safe to assume that the target can handle
two vectorized conditional operations in the same loop,
since anything remotely resembling an expensive
operation in the [pure] condition should be [is being?]
computed once per index and stored in a temporary.
For example: if the source code looks something like:

   if ( condition(index) )  A[index] = foo(index);
   // important: no else here

... then the new converter currently converts it to something like:

   /* an appropriate type goes here */ __compiler_temp_1 = condition(index);
   /* the type of the scalar goes here */ * pointer = __compiler_temp_1 ? &A[index] : &scratchpad;
   /* an appropriate type goes here */ __compiler_temp_2 = foo(index);
   *pointer = __compiler_temp_1 ? __compiler_temp_2 : scratchpad;

â¦ so "condition(index)" is being evaluated only once
per {evaluation that exists in the source code}.

The fix for this would/will therefor be a minor optimization IMO;
the benefit would/will be that in/for iterations/columns
for which the condition is false, the scratchpad will not be
needlessly read from in order to derive the value to throw away.
Always throwing away the unneeded result of evaluating "foo(index)"
is good enough, and by removing an unneeded conditional expression
the burden on the vectorizer is reduced: it now only needs:
   {vectorized decision followed by vectorized store}
in each such loop, not:
   {vectorized decision followed by vectorized decision followed by vectorized store}.
[intentionally omitting whatever else it must do
  in a vectorized manner in the same loop]

This is something we [Sebastian and I] plan on fixing eventually anyway,
i.e. regardless of whether or not it fixes a test case we already have.


[Richard wrote:]
> and note that another performance issue of if-conversion
> is  that foo(bar) is executed unconditionally.

AFAIK this is a fundamental limitation/necessity of if conversion.

A fundamental assumption/requirement here is that "foo(bar)"/"foo(index)"
is/are both pure and low-cost.  [I`ve renamed the example to "foo(index)"
to show that it`s not loop-invariant, since if it were then LICM should
make multiple evaluations of it unneeded and probably not going to happen
unless you are targeting a VLIW ISA and have an unused slot in the
instruction word if you do LICM on the sub-instruction in question.]

If "foo(index)" is not being checked for purity,
then we have a correctness bug.

If "foo(index)" is not being checked for low evaluation cost,
then we have a performance bug IMO.  The compiler should use its
existing estimation mechanism[s] to make an educated guess on
the cost of "foo(index)" and intentionally not do if conversion
if/when {the predicted cost of evaluating "foo(index)"
          for each iteration regardless of the condition bits}
is too high even in the presence of vectorization.


[Richard wrote:]
> We have a bugreport that
>    if (C[index]) A[index] = exp (x);
> massively slows down things if C[index] is almost never true.

Quite understandable.  However, unfortunately I cannot think of
any mechanism that already exists in GCC [or any other compiler
the internals of which I am even slightly familiar] to estimate
the probability of the elements of an arbitrary array --
or [worse yet] of the probability of an arbitrary expression`s
evaluation result -- being convertible to either particular
Boolean value.  Perhaps this is feasible if/when "C[...]" is
truly an array, i.e. not a pointer, and the array`s contents
are known at compile time.  Otherwise, it seems to require
pointer analysis at best, and is infeasible at worst
[e.g. a pointer received from another translation unit].

I think the only thing we can do about this, other than alter our
plans for defaulting the if conversion, is to endeavor to make profiling
[e.g. "gprof"] able to "understand" that a certain piece of code has been
if-converted and able to suggest -- based on profiling -- that the
conversion should be undone b/c it is "costing" more than it is "saving",
even with vectorization, which IMO should be an extremely rare occurrence
if/once we are checking e.g. "exp(x)" [assuming it`s not loop-invariant]
for low cost of evaluation.

IOW, whatever we have [or will set] the threshold on evaluation cost of
the RHS expression for if conversion of source code like the above example
should, IMO, solve most instances of the abovementioned problem.
The remaining problem cases will likely be something like:
   {"exp(x)" is _not_ loop-invariant,
    the probability of C[index] being convertible to true is very low,
    _and_ the statically-estimated evaluation cost of "exp(x)"
    is both under the maximum and too close to that maximum}.
Arguably, barring bugs in the cost estimation,
if this happens too often in real-world code,
then we have set the maximum too high and should adjust it.


> Dependent on whether the scratchpad is shared for different sized accesses
> or not the scratchpad may also introduce store hazards in the CPU
> because if scratchpad-as-int = ...; follows scratchpad-as-short = ...;
> the latter only partially kills the preceeding store to the scratchpad
> (usually CPUs can't merge the two requests in the store buffer).

It seems to me that the worst that can happen as a result of the preceding
is that non-{in-order} CPUs will be somewhat restricted in how much they
can re-order some of the internal operations that result from the decoding
of the machine-code instructions.  This seems to me to not be a terribly big
problem since the code in question should be in a vectorized loop anyway.
If the preceding causes a long stall on a large # of CPUs, then we
should do something about it, of course.  Otherwise, I propose that
the preceding may be a largely-theoretical performance problem
the solving of which doesn`t need to be a high priority
until/unless it is proven to be a very real-world problem.


> Not sure which approach to allocating the scratchpad you are using currently.

We are currently using something at least "morally equivalent" to a single:

   char scratchpad[64];

... for each routine with a conversion that triggers scratchpad generation.


> I still have to review the patch itself ... (and so many others)

The work of a mother or a GCC maintainer is never done.  ;-)


> How do you expect to fix the performance regressions?

Mostly by the fruit of my labor with help from Sebastian.  I`d _really_ like
to get the new converter into trunk ASAP so that there will be many more
"eyes" on the code.  Also, it is important IMO that this code should not
languish for months/years [again], since that is what happened approx. 5 years
ago when Sebastian wrote it in the first place.  I don`t think a branch is
the right place for this code, since the branch would allow the code to die.

Regards,

Abe