Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
@ 2015-07-17 20:33 Abe
  2015-07-28 10:27 ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Abe @ 2015-07-17 20:33 UTC (permalink / raw)
  To: Richard Biener, Alan Lawrence, gcc-patches

Dear all,

Another benefit of the new if converter that perhaps I neglected to mention/explain...

TLDR: for some source code which can reasonably be expected to exist in "real-world code",
when the false/true values of the condition bits of a given "if" in a given loop are very
well-clustered, the code produced by the new converter runs _much_ faster for the same
inputs than the code produced by the old converter when write-back caching is in effect.

The long explanation follows.

In the case of a loop with a "half-hammock" that looks something like this:

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... or problem-wise-equivalently:

   if (C[index])  ; // empty "then" section
   else           B[index] = foo(bar);

... the latter of which is semantically equivalent to:

   if (! C[index])  B[index] = foo(bar);
   // important: no else here

... the old if converter does something that may massively damage performance.

Basically, the old if converter converts...

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... to the equivalent of:

   __compiler_temp = foo(bar);
   A[index] = C[index] ? __compiler_temp : A[index];

For now, let`s assume the preceding conversion is valid even in the face of multithreading,
since multithreading bugs introduced by an "optimization" are a whole other
ball of wax than what this message is all about; for now, let`s assume that
all of A[] is thread-local and no nasty, sneaky pointer-passing has occurred.

The problem is this: the compiler cannot, in the general case, predict what the values of C[]
will be at runtime.  Therefor, it cannot [again, in the general case] arrive at a conclusion
"this is definitely worth it" or "this is definitely _not_ worth it".  All the compiler can do
statically without profiling information is to say "I guess a probability of 50% on the
elements of C[] being equivalent to true", which -- under an assumption of vectorization --
means that the vectorization factor is going to make the transformation worthwhile.

However: what if the values of C[] are mostly equivalent to false, not to true?  For such
cases, the old if converter yielded code that may cause a big performance degradation due to
the if conversion, even in the presence of vectorization.  If we assume that the CPU hardware
is not checking to see whether writes to an address change the contents, then each execution
of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur *_no matter
what the value of "C[index]" is/was_*.  Now, instead of reading the whole A[] and writing
a tiny fraction of it, the program is reading all of A[] and also (re)writing at least
almost all of A[] (possibly writing all of it even when the probability of the
relevant elements of C[] is _not_ 100%, due to cache-line granularity of writes:
when every cache line from A[] is modified, all of A[] will be rewritten).

The preceding problem could be somewhat ameliorated by profiling, providing that the data
you run through your program while profiling is a good representation of the data run
through the same program by "real executions", but there is no need for that extra work
or extra qualification given the existence of the new if converter.  Plus, the profiling
approach to "fixing" this problem with the old converter would only result in a binary
decision -- "do convert this if" vs. "don`t convert this if" -- which in cases where the
decision is to do/retain the conversion, the converted code is going to rewrite the whole
array.  The new converter`s conversion, OTOH, can produce better performance than the
conversion from the old converter in cases where the elements of C[] in our example are
very clustered: in other words, the _overall_ probability can still be close [or =] to the
hardest-to-deal-with challenge of 50%, but there is a large degree of clustering of the
"true" values and the "false" values.  For example, all the "false" values come first in C[].
In this case, if a profiling-based approach using the old converter decides to do/keep
the conversion, then there are still lots of wasted writes that the new converter
would avoid, assuming at least one level of write-back cache in the relevant data path.

The key factor to understand for understanding how/why the new converter`s resulting code
is better than that of the old converter is this: the new converter uses a scratchpad
to "throw away" useless writes.  This not only fixes problems with speculative writes
through a null pointer that the pre-conversion code never actually does, it also fixes
the above-described potential performance problem, at least on architectures with
write-back data cache, which AFAIK covers most/all modern high-speed CPUs.
The new converter converts something like (same example as one of the above):

   if (C[index])  A[index] = foo(bar);
   // important: no else here

... into something like:

   /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] : &scratchpad;
   __compiler_temp = foo(bar);
   *pointer = C[index] ? __compiler_temp : scratchpad;

... so that in the event of C[] being mostly false, most of the reads are _from_ the scratchpad
and most of the writes are _to_ the scratchpad.  This not only probably* saves a lot of the
potentially-massive number of wasted writes due to the old conversion rewriting all of A[]
no matter what the values of C[]`s elements are, it also saves a lot of reads when there is
enough clustering of false values in C[] that entire cache lines worth of A[] can be not
read from main RAM.  Caching and the "chunking effect" of cache lines, besides qualifying how
much reading is saved by the new conversion relative to the old one, also is the reason I needed
to qualify "probably* saves most/all the [...] wasted writes": it depends on how well or
how poorly the values in C[] are clustered.  In the case that is best for the new converter,
the new converter saves at least almost all the waste of the old converter`s code on both
the read path _and_ the write path [modulo at most one cache line of both read and write].

I hope the above helps to explain why the new converter is worth merging to trunk,
even though for now it still has performance regressions which
we expect to fix before the first GCC 6.x is tagged for release.

Regards,

Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-17 20:33 Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code Abe
@ 2015-07-28 10:27 ` Richard Biener
  2015-07-28 18:18   ` Abe
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2015-07-28 10:27 UTC (permalink / raw)
  To: Abe; +Cc: Alan Lawrence, gcc-patches

On Fri, Jul 17, 2015 at 10:12 PM, Abe <abe_skolnik@yahoo.com> wrote:
> Dear all,
>
> Another benefit of the new if converter that perhaps I neglected to
> mention/explain...
>
> TLDR: for some source code which can reasonably be expected to exist in
> "real-world code",
> when the false/true values of the condition bits of a given "if" in a given
> loop are very
> well-clustered, the code produced by the new converter runs _much_ faster
> for the same
> inputs than the code produced by the old converter when write-back caching
> is in effect.
>
> The long explanation follows.
>
>
>
> In the case of a loop with a "half-hammock" that looks something like this:
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... or problem-wise-equivalently:
>
>   if (C[index])  ; // empty "then" section
>   else           B[index] = foo(bar);
>
> ... the latter of which is semantically equivalent to:
>
>   if (! C[index])  B[index] = foo(bar);
>   // important: no else here
>
>
> ... the old if converter does something that may massively damage
> performance.
>
> Basically, the old if converter converts...
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... to the equivalent of:
>
>   __compiler_temp = foo(bar);
>   A[index] = C[index] ? __compiler_temp : A[index];
>
>
> For now, let`s assume the preceding conversion is valid even in the face of
> multithreading,
> since multithreading bugs introduced by an "optimization" are a whole other
> ball of wax than what this message is all about; for now, let`s assume that
> all of A[] is thread-local and no nasty, sneaky pointer-passing has
> occurred.
>
> The problem is this: the compiler cannot, in the general case, predict what
> the values of C[]
> will be at runtime.  Therefor, it cannot [again, in the general case] arrive
> at a conclusion
> "this is definitely worth it" or "this is definitely _not_ worth it".  All
> the compiler can do
> statically without profiling information is to say "I guess a probability of
> 50% on the
> elements of C[] being equivalent to true", which -- under an assumption of
> vectorization --
> means that the vectorization factor is going to make the transformation
> worthwhile.
>
> However: what if the values of C[] are mostly equivalent to false, not to
> true?  For such
> cases, the old if converter yielded code that may cause a big performance
> degradation due to
> the if conversion, even in the presence of vectorization.  If we assume that
> the CPU hardware
> is not checking to see whether writes to an address change the contents,
> then each execution
> of "A[index] = C[index] ? foo(bar) : A[index];" is causing a write to occur
> *_no matter
> what the value of "C[index]" is/was_*.  Now, instead of reading the whole
> A[] and writing
> a tiny fraction of it, the program is reading all of A[] and also
> (re)writing at least
> almost all of A[] (possibly writing all of it even when the probability of
> the
> relevant elements of C[] is _not_ 100%, due to cache-line granularity of
> writes:
> when every cache line from A[] is modified, all of A[] will be rewritten).
>
> The preceding problem could be somewhat ameliorated by profiling, providing
> that the data
> you run through your program while profiling is a good representation of the
> data run
> through the same program by "real executions", but there is no need for that
> extra work
> or extra qualification given the existence of the new if converter.  Plus,
> the profiling
> approach to "fixing" this problem with the old converter would only result
> in a binary
> decision -- "do convert this if" vs. "don`t convert this if" -- which in
> cases where the
> decision is to do/retain the conversion, the converted code is going to
> rewrite the whole
> array.  The new converter`s conversion, OTOH, can produce better performance
> than the
> conversion from the old converter in cases where the elements of C[] in our
> example are
> very clustered: in other words, the _overall_ probability can still be close
> [or =] to the
> hardest-to-deal-with challenge of 50%, but there is a large degree of
> clustering of the
> "true" values and the "false" values.  For example, all the "false" values
> come first in C[].
> In this case, if a profiling-based approach using the old converter decides
> to do/keep
> the conversion, then there are still lots of wasted writes that the new
> converter
> would avoid, assuming at least one level of write-back cache in the relevant
> data path.
>
> The key factor to understand for understanding how/why the new converter`s
> resulting code
> is better than that of the old converter is this: the new converter uses a
> scratchpad
> to "throw away" useless writes.  This not only fixes problems with
> speculative writes
> through a null pointer that the pre-conversion code never actually does, it
> also fixes
> the above-described potential performance problem, at least on architectures
> with
> write-back data cache, which AFAIK covers most/all modern high-speed CPUs.
> The new converter converts something like (same example as one of the
> above):
>
>   if (C[index])  A[index] = foo(bar);
>   // important: no else here
>
> ... into something like:
>
>   /* the type of the scalar goes here */ * pointer = C[index] ? &A[index] :
> &scratchpad;
>   __compiler_temp = foo(bar);
>   *pointer = C[index] ? __compiler_temp : scratchpad;

Note the store to *pointer can be done unconditionally

  *pointer = __compiler_temp;

and note that another performance issue of if-conversion is that foo(bar)
is executed unconditionally.  We have a bugreport that

  if (C[index]) A[index] = exp (x);

massively slows down things if C[index] is almost never true.  So it
is not only the store to A[] but in some cases even more so the
unconditional execution of foo(bar).  Of course the new if-converter
doesn't fix that.

Dependent on whether the scratchpad is shared for different sized accesses
or not the scratchpad may also introduce store hazards in the CPU
because if scratchpad-as-int = ...; follows scratchpad-as-short = ...;
the latter only partially kills the preceeding store to the scratchpad
(usually CPUs can't merge the two requests in the store buffer).

Not sure which approach to allocating the scratchpad you are using currently.
I still have to review the patch itself ... (and so many others)

> ... so that in the event of C[] being mostly false, most of the reads are
> _from_ the scratchpad
> and most of the writes are _to_ the scratchpad.  This not only probably*
> saves a lot of the
> potentially-massive number of wasted writes due to the old conversion
> rewriting all of A[]
> no matter what the values of C[]`s elements are, it also saves a lot of
> reads when there is
> enough clustering of false values in C[] that entire cache lines worth of
> A[] can be not
> read from main RAM.  Caching and the "chunking effect" of cache lines,
> besides qualifying how
> much reading is saved by the new conversion relative to the old one, also is
> the reason I needed
> to qualify "probably* saves most/all the [...] wasted writes": it depends on
> how well or
> how poorly the values in C[] are clustered.  In the case that is best for
> the new converter,
> the new converter saves at least almost all the waste of the old converter`s
> code on both
> the read path _and_ the write path [modulo at most one cache line of both
> read and write].
>
>
> I hope the above helps to explain why the new converter is worth merging to
> trunk,
> even though for now it still has performance regressions which
> we expect to fix before the first GCC 6.x is tagged for release.

How do you expect to fix the performance regressions?

Richard.

>
> Regards,
>
> Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-28 10:27 ` Richard Biener
@ 2015-07-28 18:18   ` Abe
  2015-07-29  9:51     ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Abe @ 2015-07-28 18:18 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alan Lawrence, gcc-patches, Sebastian Pop

[Richard wrote:]
> Note the store to *pointer can be done unconditionally

Yes; if I`m mapping things correctly in my mind, this is
something that Sebastian [and Alan, via email?] and I have
already discussed and which we plan to fix in good time.

Please note that this is a minor problem at most,
if/when it it safe to assume that the target can handle
two vectorized conditional operations in the same loop,
since anything remotely resembling an expensive
operation in the [pure] condition should be [is being?]
computed once per index and stored in a temporary.
For example: if the source code looks something like:

   if ( condition(index) )  A[index] = foo(index);
   // important: no else here

... then the new converter currently converts it to something like:

   /* an appropriate type goes here */ __compiler_temp_1 = condition(index);
   /* the type of the scalar goes here */ * pointer = __compiler_temp_1 ? &A[index] : &scratchpad;
   /* an appropriate type goes here */ __compiler_temp_2 = foo(index);
   *pointer = __compiler_temp_1 ? __compiler_temp_2 : scratchpad;

â€¦ so "condition(index)" is being evaluated only once
per {evaluation that exists in the source code}.

The fix for this would/will therefor be a minor optimization IMO;
the benefit would/will be that in/for iterations/columns
for which the condition is false, the scratchpad will not be
needlessly read from in order to derive the value to throw away.
Always throwing away the unneeded result of evaluating "foo(index)"
is good enough, and by removing an unneeded conditional expression
the burden on the vectorizer is reduced: it now only needs:
   {vectorized decision followed by vectorized store}
in each such loop, not:
   {vectorized decision followed by vectorized decision followed by vectorized store}.
[intentionally omitting whatever else it must do
  in a vectorized manner in the same loop]

This is something we [Sebastian and I] plan on fixing eventually anyway,
i.e. regardless of whether or not it fixes a test case we already have.

[Richard wrote:]
> and note that another performance issue of if-conversion
> is  that foo(bar) is executed unconditionally.

AFAIK this is a fundamental limitation/necessity of if conversion.

A fundamental assumption/requirement here is that "foo(bar)"/"foo(index)"
is/are both pure and low-cost.  [I`ve renamed the example to "foo(index)"
to show that it`s not loop-invariant, since if it were then LICM should
make multiple evaluations of it unneeded and probably not going to happen
unless you are targeting a VLIW ISA and have an unused slot in the
instruction word if you do LICM on the sub-instruction in question.]

If "foo(index)" is not being checked for purity,
then we have a correctness bug.

If "foo(index)" is not being checked for low evaluation cost,
then we have a performance bug IMO.  The compiler should use its
existing estimation mechanism[s] to make an educated guess on
the cost of "foo(index)" and intentionally not do if conversion
if/when {the predicted cost of evaluating "foo(index)"
          for each iteration regardless of the condition bits}
is too high even in the presence of vectorization.

[Richard wrote:]
> We have a bugreport that
>    if (C[index]) A[index] = exp (x);
> massively slows down things if C[index] is almost never true.

Quite understandable.  However, unfortunately I cannot think of
any mechanism that already exists in GCC [or any other compiler
the internals of which I am even slightly familiar] to estimate
the probability of the elements of an arbitrary array --
or [worse yet] of the probability of an arbitrary expression`s
evaluation result -- being convertible to either particular
Boolean value.  Perhaps this is feasible if/when "C[...]" is
truly an array, i.e. not a pointer, and the array`s contents
are known at compile time.  Otherwise, it seems to require
pointer analysis at best, and is infeasible at worst
[e.g. a pointer received from another translation unit].

I think the only thing we can do about this, other than alter our
plans for defaulting the if conversion, is to endeavor to make profiling
[e.g. "gprof"] able to "understand" that a certain piece of code has been
if-converted and able to suggest -- based on profiling -- that the
conversion should be undone b/c it is "costing" more than it is "saving",
even with vectorization, which IMO should be an extremely rare occurrence
if/once we are checking e.g. "exp(x)" [assuming it`s not loop-invariant]
for low cost of evaluation.

IOW, whatever we have [or will set] the threshold on evaluation cost of
the RHS expression for if conversion of source code like the above example
should, IMO, solve most instances of the abovementioned problem.
The remaining problem cases will likely be something like:
   {"exp(x)" is _not_ loop-invariant,
    the probability of C[index] being convertible to true is very low,
    _and_ the statically-estimated evaluation cost of "exp(x)"
    is both under the maximum and too close to that maximum}.
Arguably, barring bugs in the cost estimation,
if this happens too often in real-world code,
then we have set the maximum too high and should adjust it.

> Dependent on whether the scratchpad is shared for different sized accesses
> or not the scratchpad may also introduce store hazards in the CPU
> because if scratchpad-as-int = ...; follows scratchpad-as-short = ...;
> the latter only partially kills the preceeding store to the scratchpad
> (usually CPUs can't merge the two requests in the store buffer).

It seems to me that the worst that can happen as a result of the preceding
is that non-{in-order} CPUs will be somewhat restricted in how much they
can re-order some of the internal operations that result from the decoding
of the machine-code instructions.  This seems to me to not be a terribly big
problem since the code in question should be in a vectorized loop anyway.
If the preceding causes a long stall on a large # of CPUs, then we
should do something about it, of course.  Otherwise, I propose that
the preceding may be a largely-theoretical performance problem
the solving of which doesn`t need to be a high priority
until/unless it is proven to be a very real-world problem.

> Not sure which approach to allocating the scratchpad you are using currently.

We are currently using something at least "morally equivalent" to a single:

   char scratchpad[64];

... for each routine with a conversion that triggers scratchpad generation.

> I still have to review the patch itself ... (and so many others)

The work of a mother or a GCC maintainer is never done.  ;-)

> How do you expect to fix the performance regressions?

Mostly by the fruit of my labor with help from Sebastian.  I`d _really_ like
to get the new converter into trunk ASAP so that there will be many more
"eyes" on the code.  Also, it is important IMO that this code should not
languish for months/years [again], since that is what happened approx. 5 years
ago when Sebastian wrote it in the first place.  I don`t think a branch is
the right place for this code, since the branch would allow the code to die.

Regards,

Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-28 18:18   ` Abe
@ 2015-07-29  9:51     ` Richard Biener
  2015-07-29 17:19       ` Abe
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2015-07-29  9:51 UTC (permalink / raw)
  To: Abe; +Cc: Alan Lawrence, gcc-patches, Sebastian Pop

On Tue, Jul 28, 2015 at 7:57 PM, Abe <abe_skolnik@yahoo.com> wrote:
> [Richard wrote:]
>>
>> Note the store to *pointer can be done unconditionally
>
>
> Yes; if I`m mapping things correctly in my mind, this is
> something that Sebastian [and Alan, via email?] and I have
> already discussed and which we plan to fix in good time.
>
> Please note that this is a minor problem at most,
> if/when it it safe to assume that the target can handle
> two vectorized conditional operations in the same loop,
> since anything remotely resembling an expensive
> operation in the [pure] condition should be [is being?]
> computed once per index and stored in a temporary.
> For example: if the source code looks something like:
>
>   if ( condition(index) )  A[index] = foo(index);
>   // important: no else here
>
> ... then the new converter currently converts it to something like:
>
>   /* an appropriate type goes here */ __compiler_temp_1 = condition(index);
>   /* the type of the scalar goes here */ * pointer = __compiler_temp_1 ?
> &A[index] : &scratchpad;
>   /* an appropriate type goes here */ __compiler_temp_2 = foo(index);
>   *pointer = __compiler_temp_1 ? __compiler_temp_2 : scratchpad;
>
> … so "condition(index)" is being evaluated only once
> per {evaluation that exists in the source code}.
>
> The fix for this would/will therefor be a minor optimization IMO;
> the benefit would/will be that in/for iterations/columns
> for which the condition is false, the scratchpad will not be
> needlessly read from in order to derive the value to throw away.
> Always throwing away the unneeded result of evaluating "foo(index)"
> is good enough, and by removing an unneeded conditional expression
> the burden on the vectorizer is reduced: it now only needs:
>   {vectorized decision followed by vectorized store}
> in each such loop, not:
>   {vectorized decision followed by vectorized decision followed by
> vectorized store}.
> [intentionally omitting whatever else it must do
>  in a vectorized manner in the same loop]
>
> This is something we [Sebastian and I] plan on fixing eventually anyway,
> i.e. regardless of whether or not it fixes a test case we already have.
>
>
> [Richard wrote:]
>>
>> and note that another performance issue of if-conversion
>> is  that foo(bar) is executed unconditionally.
>
>
> AFAIK this is a fundamental limitation/necessity of if conversion.
>
> A fundamental assumption/requirement here is that "foo(bar)"/"foo(index)"
> is/are both pure and low-cost.  [I`ve renamed the example to "foo(index)"
> to show that it`s not loop-invariant, since if it were then LICM should
> make multiple evaluations of it unneeded and probably not going to happen
> unless you are targeting a VLIW ISA and have an unused slot in the
> instruction word if you do LICM on the sub-instruction in question.]
>
> If "foo(index)" is not being checked for purity,
> then we have a correctness bug.
>
> If "foo(index)" is not being checked for low evaluation cost,
> then we have a performance bug IMO.  The compiler should use its
> existing estimation mechanism[s] to make an educated guess on
> the cost of "foo(index)" and intentionally not do if conversion
> if/when {the predicted cost of evaluating "foo(index)"
>          for each iteration regardless of the condition bits}
> is too high even in the presence of vectorization.
>
>
> [Richard wrote:]
>>
>> We have a bugreport that
>>    if (C[index]) A[index] = exp (x);
>> massively slows down things if C[index] is almost never true.
>
>
> Quite understandable.  However, unfortunately I cannot think of
> any mechanism that already exists in GCC [or any other compiler
> the internals of which I am even slightly familiar] to estimate
> the probability of the elements of an arbitrary array --
> or [worse yet] of the probability of an arbitrary expression`s
> evaluation result -- being convertible to either particular
> Boolean value.  Perhaps this is feasible if/when "C[...]" is
> truly an array, i.e. not a pointer, and the array`s contents
> are known at compile time.  Otherwise, it seems to require
> pointer analysis at best, and is infeasible at worst
> [e.g. a pointer received from another translation unit].
>
> I think the only thing we can do about this, other than alter our
> plans for defaulting the if conversion, is to endeavor to make profiling
> [e.g. "gprof"] able to "understand" that a certain piece of code has been
> if-converted and able to suggest -- based on profiling -- that the
> conversion should be undone b/c it is "costing" more than it is "saving",
> even with vectorization, which IMO should be an extremely rare occurrence
> if/once we are checking e.g. "exp(x)" [assuming it`s not loop-invariant]
> for low cost of evaluation.
>
> IOW, whatever we have [or will set] the threshold on evaluation cost of
> the RHS expression for if conversion of source code like the above example
> should, IMO, solve most instances of the abovementioned problem.
> The remaining problem cases will likely be something like:
>   {"exp(x)" is _not_ loop-invariant,
>    the probability of C[index] being convertible to true is very low,
>    _and_ the statically-estimated evaluation cost of "exp(x)"
>    is both under the maximum and too close to that maximum}.
> Arguably, barring bugs in the cost estimation,
> if this happens too often in real-world code,
> then we have set the maximum too high and should adjust it.
>
>
>> Dependent on whether the scratchpad is shared for different sized accesses
>> or not the scratchpad may also introduce store hazards in the CPU
>> because if scratchpad-as-int = ...; follows scratchpad-as-short = ...;
>> the latter only partially kills the preceeding store to the scratchpad
>> (usually CPUs can't merge the two requests in the store buffer).
>
>
> It seems to me that the worst that can happen as a result of the preceding
> is that non-{in-order} CPUs will be somewhat restricted in how much they
> can re-order some of the internal operations that result from the decoding
> of the machine-code instructions.  This seems to me to not be a terribly big
> problem since the code in question should be in a vectorized loop anyway.
> If the preceding causes a long stall on a large # of CPUs, then we
> should do something about it, of course.  Otherwise, I propose that
> the preceding may be a largely-theoretical performance problem
> the solving of which doesn`t need to be a high priority
> until/unless it is proven to be a very real-world problem.
>
>
>> Not sure which approach to allocating the scratchpad you are using
>> currently.
>
>
> We are currently using something at least "morally equivalent" to a single:
>
>   char scratchpad[64];
>
> ... for each routine with a conversion that triggers scratchpad generation.
>
>
>> I still have to review the patch itself ... (and so many others)
>
>
> The work of a mother or a GCC maintainer is never done.  ;-)
>
>
>> How do you expect to fix the performance regressions?
>
>
> Mostly by the fruit of my labor with help from Sebastian.  I`d _really_ like
> to get the new converter into trunk ASAP so that there will be many more
> "eyes" on the code.  Also, it is important IMO that this code should not
> languish for months/years [again], since that is what happened approx. 5
> years
> ago when Sebastian wrote it in the first place.  I don`t think a branch is
> the right place for this code, since the branch would allow the code to die.

Well.  We don't generally introduce regressions with changes.  Iff there
are no regressions with --param allow-store-data-races=1 then I guess
it's fine as a wrong-code fix (well, the patch still needs review - I hope
to get to that this week).

Richard.

> Regards,
>
> Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-29  9:51     ` Richard Biener
@ 2015-07-29 17:19       ` Abe
  2015-07-31 10:24         ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Abe @ 2015-07-29 17:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alan Lawrence, gcc-patches, Sebastian Pop

> Well.  We don't generally introduce regressions with changes.

Understood.  Regressions are bad, of course.  TTBOMK the
regressions in question are temporary.  Once they are gone,
I think we can then look at whether or not we still
need to keep the old if converter in trunk.  Ideally,
it eventually becomes redundant and unneeded.

> (well, the patch still needs review -
 > I hope to get to that this week).

After I`ve done the SPEC-based analysis, my next planned steps
on this work are to disable the code that [in my WIP] currently
causes conversion to be enabled by default when autovectorization
is enabled, then to re-integrate the old converter and implement
the switches that will give GCC users access to the modes I described
in a recent email from me.  You might prefer to delay your code review
until I have that all done and a new version of the patch submitted.

Regards,

Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-29 17:19       ` Abe
@ 2015-07-31 10:24         ` Richard Biener
  2015-07-31 18:29           ` Abe
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2015-07-31 10:24 UTC (permalink / raw)
  To: Abe; +Cc: Alan Lawrence, gcc-patches, Sebastian Pop

On Wed, Jul 29, 2015 at 6:25 PM, Abe <abe_skolnik@yahoo.com> wrote:
>> Well.  We don't generally introduce regressions with changes.
>
>
> Understood.  Regressions are bad, of course.  TTBOMK the
> regressions in question are temporary.  Once they are gone,
> I think we can then look at whether or not we still
> need to keep the old if converter in trunk.  Ideally,
> it eventually becomes redundant and unneeded.
>
>
>> (well, the patch still needs review -
>
>> I hope to get to that this week).
>
> After I`ve done the SPEC-based analysis, my next planned steps
> on this work are to disable the code that [in my WIP] currently
> causes conversion to be enabled by default when autovectorization
> is enabled, then to re-integrate the old converter and implement
> the switches that will give GCC users access to the modes I described
> in a recent email from me.  You might prefer to delay your code review
> until I have that all done and a new version of the patch submitted.

I'm not sure we want two if-converters.  What we do want is avoid
using a scratch-pad if it is safe to do (for loads and stores)
and if the user tells us he is fine with store data races (for stores).

Does the "new" if-converter get rid of the analysis code that
determined "safe"?  If so you should re-instantiate that.

Richard.

> Regards,
>
> Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code
  2015-07-31 10:24         ` Richard Biener
@ 2015-07-31 18:29           ` Abe
  0 siblings, 0 replies; 7+ messages in thread
From: Abe @ 2015-07-31 18:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alan Lawrence, gcc-patches, Sebastian Pop

[Abe wrote:]
>> After I`ve done the SPEC-based analysis, my next planned steps
>> on this work are to disable the code that [in my WIP] currently
>> causes conversion to be enabled by default when autovectorization
>> is enabled, then to re-integrate the old converter and implement
>> the switches that will give GCC users access to the modes I described
>> in a recent email from me.  You might prefer to delay your code review
>> until I have that all done and a new version of the patch submitted.

[Richard wrote:]
> I'm not sure we want two if-converters.

Not _permanently_, no.  However, having the old one available and be
the one that is activated -- [1] by default under conditions in which
current GCC trunk activates if conversion by default, and [2] when the
existing/old flags call for it -- allows us to have the old converter
still apply in all cases in which it currently applies, to have the
new converter in trunk so it has eyes on it and Sebastian and myself
won`t be the only people working on it anymore [I hope ;-)].

The preceding will also make comparison of the two converters easier
[use the same compiler build, just vary your flags] and more reliable
[all other compiler components will be identical, so we have
  a stronger guarantee of apples-to-apples comparison].  All of that
will help us make the new converter better and better until the old
one becomes completely unneeded, at which point we can remove it
and "massage" the user-visible flags and the semantics thereof --
ideally only doing this in a release just before the release of
a new GCC major version number so we don`t introduce breaking
changes in the middle of a major version number, of course.
[not pretending to "teach" anybody anything here --
  just showing that I, too, understand the basics of releasing
  decent software -- or at least _some_ of those basics  ;-)]

> What we do want is avoid using a scratch-pad
 > if it is safe to do (for loads and stores)

I strongly agree.  In fact, TTBOMK that is a big part of improving
the new converter so it no longer has performance regressions.

> and if the user tells us he is fine with store data races (for stores).

Wouldn`t it be nice -- if we can do so without killing ourselves over it --
to only take the flag in question as _permission_ to generate a data race?
IOW, it would be nice if a cost-model-based analysis would be able to tell
us e.g. "for this loop, it`s worth it due to the many saved machine cycles"
versus "for this loop, it`s not worth it: you only stand to save a cycle
or two, and you potentially sacrifice some correctness in the process".

> Does the "new" if-converter get rid of the analysis code that
> determined "safe"?  If so you should re-instantiate that.

I don`t have a good answer for that right now and don`t anticipate
having the correct answer soon enough that I should delay
this reply IMO.  I`ll get back to you on this question.

Regards,

Abe

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-07-31 18:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-17 20:33 Another benefit of the new if converter: better performance for half hammocks when running the generated code on a modern high-speed CPU with write-back caching, relative to the code produced by the old if converter given the same source code Abe
2015-07-28 10:27 ` Richard Biener
2015-07-28 18:18   ` Abe
2015-07-29  9:51     ` Richard Biener
2015-07-29 17:19       ` Abe
2015-07-31 10:24         ` Richard Biener
2015-07-31 18:29           ` Abe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).