Re: [PATCH] PR18002: Undo fold_single_bit_test in do

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] PR18002: Undo fold_single_bit_test in do_jump
       [not found] <BDDDEBEF.82E7%schlie@comcast.net>
@ 2004-12-09 17:26 ` Roger Sayle
  2004-12-09 17:53   ` Eric Botcazou
  2004-12-09 21:34   ` Paul Schlie
  0 siblings, 2 replies; 3+ messages in thread
From: Roger Sayle @ 2004-12-09 17:26 UTC (permalink / raw)
  To: Paul Schlie; +Cc: Ian Lance Taylor, gcc-patches, gcc

Hi Paul,
On Thu, 9 Dec 2004, Paul Schlie wrote:
> Thanks for the clarification of context, but the problem introduced
> by the initial transform of expressions of the form:
>
>  if (x & <pow2-const>) ... else ... ; // which is a typical bit test
> =>
>  if ((x >> <log2-const>) & 1) ... else ... ;
>
> Wreak havoc on targets with relatively inefficient shifts, and seems
> generally counter productive in circumstances where conditional jumps
> are inherently unavoidable, therefore would be nice, if not imperative,
> to have remedied. (which I had hoped the refinement was attempting to
> address)

I think you have a fundamental misunderstanding about how and why
code transformations are applied in modern optimizing compilers.
Many of the transformations that a compiler makes to a program do
not concern code generation at all and are often not optimizations.

The easiest to understand canonicalization transformation is for
commutative binary operators, where a compiler will transform the
tree "1 + x" into the equivalent "x + 1".  Typically this will not
improve performance or even affect the code we generate.  Instead
the motivation is that it greatly simplifies the task of recognizing
a tree that represents an addition by a constant.  This improves
compile-time performance and reduces the amount of code needed to
write a compiler.

The next level up, might be the canonicalization of "x + x" into
the equivalent "x * 2".  This transformation is critical for things
such as recognizing that "x + x + x + x" is the same as "x * 4".
However, notice that on most modern processors, a multiplication
is typically much slower than an addition.  This canonicalization
would appear to hurt performance and be a pessimization.  The important
thing to realize is that GCC's tree representation is an abstract
intermediate form that represents the semantics of a program, and
unlike RTL, it is not a faithful representation of the code that will
be executed.  The important bit to notice is that although at the tree
level we convert "x + x" to "x * 2", which is easier to analyse and
manipulate, at RTL expansion time, we convert "x * 2" back into a
single addition before the compiler's output is generated.

Internally, "if (x & C)" has the semantics "if ((x & C) != 0)" when
interpreted by language front-ends.  Which is a tree with two operators
a NE_EXPR and a BIT_AND_EXPR, which has identical semantics to
equivalent expression "(x >> C') & 1", which again has two operators
BIT_AND_EXPR and a RSHIFT_EXPR.  Both are equivalent, so we can
CSE/GCSE the value of one into the use of the other, so that it makes
sense at the conceptual level that both are internally represented
identically.

The important thing to remember is that this is a semantic issue
and not a performance one.  Which representation gets used for
trees is irrelevant provided that the final code we generate is
"optimal" for its context.  In this particular case, it was historically
preferable to canonicalize to the shift-form, as this simplified the
task of when to undo the transformation on expansion into RTL form.

It is vital to get beyond the belief that tree-ssa form is related
to the efficiency of the final machine code.  fold-const.c appears
to convert expressions into more expensive forms, gimplification
would appear to introduce large numbers of unnecessary variables and
trivial assignments between them, spliting loop headers and critical
blocks would appear to introduce large numbers of empty basic blocks
and unconditional jumps.

The truth is than none of the above should adversely affect a program's
performance, and indeed the goal of applying these transformations is
to improve the program globally, and not just tree-node by tree-node.

Hopefully, this explains why its not important that you're aware of
a tree transformation that appears counter-intuitive, only that the
code which is generated for the PRs is at least as efficient as it
was for 3.3.1, but also demonstrably more correct.

Roger
--

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] PR18002: Undo fold_single_bit_test in do_jump
  2004-12-09 17:26 ` [PATCH] PR18002: Undo fold_single_bit_test in do_jump Roger Sayle
@ 2004-12-09 17:53   ` Eric Botcazou
  2004-12-09 21:34   ` Paul Schlie
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Botcazou @ 2004-12-09 17:53 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc, Paul Schlie, Ian Lance Taylor, gcc-patches

> It is vital to get beyond the belief that tree-ssa form is related
> to the efficiency of the final machine code.  fold-const.c appears
> to convert expressions into more expensive forms, gimplification
> would appear to introduce large numbers of unnecessary variables and
> trivial assignments between them, spliting loop headers and critical
> blocks would appear to introduce large numbers of empty basic blocks
> and unconditional jumps.
>
> The truth is than none of the above should adversely affect a program's
> performance, and indeed the goal of applying these transformations is
> to improve the program globally, and not just tree-node by tree-node.

"Should" indeed, so this must be carefully monitored.  See PR tree-opt/18707 
for an example of a seamingly harmful form of canonicalization that ends up 
seriously hurting the performance of a loop.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] PR18002: Undo fold_single_bit_test in do_jump
  2004-12-09 17:26 ` [PATCH] PR18002: Undo fold_single_bit_test in do_jump Roger Sayle
  2004-12-09 17:53   ` Eric Botcazou
@ 2004-12-09 21:34   ` Paul Schlie
  1 sibling, 0 replies; 3+ messages in thread
From: Paul Schlie @ 2004-12-09 21:34 UTC (permalink / raw)
  To: Roger Sayle; +Cc: Ian Lance Taylor, gcc-patches, gcc

Hi, Roger, thanks for efforts and explanations, (annotations below)

> From: Roger Sayle <roger@eyesopen.com>
> Hi Paul,
> On Thu, 9 Dec 2004, Paul Schlie wrote:
>> Thanks for the clarification of context, but the problem introduced
>> by the initial transform of expressions of the form:
>> 
>>  if (x & <pow2-const>) ... else ... ; // which is a typical bit test
>> =>
>>  if ((x >> <log2-const>) & 1) ... else ... ;
>> 
>> Wreak havoc on targets with relatively inefficient shifts, and seems
>> generally counter productive in circumstances where conditional jumps
>> are inherently unavoidable, therefore would be nice, if not imperative,
>> to have remedied. (which I had hoped the refinement was attempting to
>> address)
> 
> I think you have a fundamental misunderstanding about how and why
> code transformations are applied in modern optimizing compilers.
> Many of the transformations that a compiler makes to a program do
> not concern code generation at all and are often not optimizations.
> 
> The easiest to understand canonicalization transformation is for
> commutative binary operators, where a compiler will transform the
> tree "1 + x" into the equivalent "x + 1".  Typically this will not
> improve performance or even affect the code we generate.  Instead
> the motivation is that it greatly simplifies the task of recognizing
> a tree that represents an addition by a constant.  This improves
> compile-time performance and reduces the amount of code needed to
> write a compiler.
> 
> The next level up, might be the canonicalization of "x + x" into
> the equivalent "x * 2".  This transformation is critical for things
> such as recognizing that "x + x + x + x" is the same as "x * 4".
> However, notice that on most modern processors, a multiplication
> is typically much slower than an addition.  This canonicalization
> would appear to hurt performance and be a pessimization.  The important
> thing to realize is that GCC's tree representation is an abstract
> intermediate form that represents the semantics of a program, and
> unlike RTL, it is not a faithful representation of the code that will
> be executed.  The important bit to notice is that although at the tree
> level we convert "x + x" to "x * 2", which is easier to analyse and
> manipulate, at RTL expansion time, we convert "x * 2" back into a
> single addition before the compiler's output is generated.
> 
> Internally, "if (x & C)" has the semantics "if ((x & C) != 0)" when
> interpreted by language front-ends.  Which is a tree with two operators
> a NE_EXPR and a BIT_AND_EXPR, which has identical semantics to
> equivalent expression "(x >> C') & 1", which again has two operators
> BIT_AND_EXPR and a RSHIFT_EXPR.  Both are equivalent, so we can
> CSE/GCSE the value of one into the use of the other, so that it makes
> sense at the conceptual level that both are internally represented
> identically.

- but unless I misunderstand, the operation if (x & C) typically does not
require an explicit test for != 0, as it's fairly common for condition codes
to be generated on all logical and arithmetic operations; but the expression
which although semantically equivalent to if ((x >> C') & 1) then has the
analogous semantics of if (((x >> C') & 1) != 0) given your explanation
above.  Which seems clearly more complex, and implies a tree of 3
operations, as opposed to 2 (but again the != 0 is typically not required
my most targets it typically is implied by the &); therefore in most typical
cases, (x & C) implies a single explicit operations, and ((x >> C') & 1)
implies the requirement for 2? (what am I missing?)

> The important thing to remember is that this is a semantic issue
> and not a performance one.  Which representation gets used for
> trees is irrelevant provided that the final code we generate is
> "optimal" for its context.  In this particular case, it was historically
> preferable to canonicalize to the shift-form, as this simplified the
> task of when to undo the transformation on expansion into RTL form.
> 
> It is vital to get beyond the belief that tree-ssa form is related
> to the efficiency of the final machine code.  fold-const.c appears
> to convert expressions into more expensive forms, gimplification
> would appear to introduce large numbers of unnecessary variables and
> trivial assignments between them, spliting loop headers and critical
> blocks would appear to introduce large numbers of empty basic blocks
> and unconditional jumps.
>
> The truth is than none of the above should adversely affect a program's
> performance, and indeed the goal of applying these transformations is
> to improve the program globally, and not just tree-node by tree-node.

- I believe I understand, however since back-ends presently look for
  particular patterns which it knows the target's instruction set can
  perform more efficiently, any transforms which affect final representation
  (although arguably semantically equivalent), aren't likely not recognized
  if syntactically different. (Would be nice if the GCC's envisioned
  idealized canonical forms were documented so that back-ends could be
  correspondingly refined, and/or possibly so that the back-end may be
  refined to automatically optimally map the canonical forms onto the
  target's supported forms?)

> Hopefully, this explains why its not important that you're aware of
> a tree transformation that appears counter-intuitive, only that the
> code which is generated for the PRs is at least as efficient as it
> was for 3.3.1, but also demonstrably more correct.

- your patch definitely improves the code from your earlier example:

  int foo(int a) { return (a & (1L<<23)) ? 1 : 2; }

(which semantically as noted is basically a sign test, although oddly coded)

But it's not clear that it helps if the bit is within the range of the
original operand (although hope it does) i.e.:
 
 char foo(char a) { return (a & (1 << 5)) ? 1 : 2; }

Which 3.3.1 I believe treats as:

 char foo(char a) { return (a & 32) ? 1 : 2; }

Where if (a & 32) is recognized by the avr .md as a conditional single
bit-test & skip:

        sbrs r24,5
        rjmp .L2
        ldi r24,lo8(1)
        ret
.L2:
        ldi r24,lo8(2)
        ret

Which would be nice to retain, but was lost as the back-end saw the
relatively more complex tree ((a >> 5) & 1)? (I guess one could argue that
just such a more complex pattern now needs to be now recognized by the back
end as being equivalent to (a & 32), as the literal execution of a shift by
5 requires as many cycles times the size of the operand, not a minor cost).

[ and although somewhat tangential, it would be nice if 4.0 didn't so often
  needlessly promote expression's operands and operations beyond their
  true need, as it should ideally consider an operations target precision
  requirements when determining the necessity of operand/operation
  promotion, which although not explicitly spelled out in C, is implied as a
  logical consequence of assignment operation combined with operation
  semantics, i.e. (a & 32) need not be promoted beyond char regardless of if
  "a" or 32 are considered signed or unsigned values, if it would have no
  effect on the resulting stored value/state, correspondingly nor should the
  operands/operations of the transformed expression be needlessly promoted ]

> Roger
> --


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-12-09 21:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BDDDEBEF.82E7%schlie@comcast.net>
2004-12-09 17:26 ` [PATCH] PR18002: Undo fold_single_bit_test in do_jump Roger Sayle
2004-12-09 17:53   ` Eric Botcazou
2004-12-09 21:34   ` Paul Schlie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).