__attribute__((early

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* __attribute__((early_branch))
@ 2019-04-30 18:35 cmdLP #CODE
  2019-04-30 19:53 ` __attribute__((early_branch)) Jeff Law
  0 siblings, 1 reply; 5+ messages in thread
From: cmdLP #CODE @ 2019-04-30 18:35 UTC (permalink / raw)
  To: gcc

Hello GCC-team,

I use GCC for all my C and C++ programs. I know how to use GCC, but I am
not a contributor to GCC (yet). I often discover some problems C and C++
code have in general. There is often the choice between fast or readable
code. Some code I have seen performs well but looks ugly (goto, etc.);
other code is simple, but has some performance problems. What if we could
make the simple code perform well?

There is a common problem with conditional branches inside loops. This can
decrease the performance of a program. To fix this issue, the conditional
branch should be moved outside of the loop. Sometimes this optimization is
done by the compiler, but guessing on optimizations done by the compiler is
really bad. Often it is not easy to transform the source code to have the
conditional branching outside the loop. Instead I propose a new attribute,
which forces the compiler to do a conditional branch (based on the
annotated parameter) at the beginning of a function. It branches to the
corresponding code of the function compiled with the value being constant.

Here is a code example, which contains this issue.

    enum reduce_op
    {
        REDUCE_ADD,
        REDUCE_MULT,
        REDUCE_MIN,
        REDUCE_MAX
    };

    /* performance critical function */
    void reduce_data(enum reduce_op reduce,
                     unsigned const* data,
                     unsigned data_size)
    {
        unsigned i, result, item;

        result = reduce == REDUCE_MULT ?  1u
               : reduce == REDUCE_MIN  ? ~0u // ~0u is UINT_MAX
               :                          0u;

        for(i = 0; i < data_size; ++i)
        {
            item = data[i];

            switch(reduce)
            {
                case REDUCE_ADD:
                    result += item;
                break;

                case REDUCE_MULT:
                    result *= item;
                break;

                case REDUCE_MIN:
                    if(item < result) result = item;
                    // RIP: result <?= item;
                break;

                case REDUCE_MAX:
                    if(item > result) result = item;
                    // RIP: result >?= item;
                break;
            }
        }

        return result;
    }

The value of  reduce  does not change inside the function. For this
example, the optimization is trivial. But consider more complex examples.
The function should be optimized to:

    void reduce_data(enum reduce_op reduce,
                     unsigned const* data,
                     unsigned data_size)
    {
        unsigned i, result, item;

        switch(reduce)
        {
            case REDUCE_ADD:
                result = 0;

                for(i = 0; i < data_size; ++i)
                    result += data[i];

                return result;

            case REDUCE_MULT:
                result = 1;

                for(i = 0; i < data_size; ++i)
                    result *= data[i];

                return result;

            case REDUCE_MIN:
                result = ~0u;

                for(i = 0; i < data_size; ++i)
                {
                    item = data[i];
                    if(item < result) result = item;
                }

                return result;

            case REDUCE_MAX:
                result = 0;

                for(i = 0; i < data_size; ++i)
                {
                    item = data[i];
                    if(item > result) result = item;
                }

                return result;

            default: return 0u;
        }
    }

As I have mentioned, the value is treated as constant in the code. The
compiler simply compiles the body of the function for each enum-value as if
it was constant and puts it in a conditional branch. This means that
__builtin_constant_p evaluates to true/1, when the parameter has the
attribute. But duplicate code should also be avoided, the initialization of
the loop counter  i  could be moved before the branch.

You might ask, what about values which are not declared in the enum? The
compiler simply compiles a default case as in the example code above. The
default case equals to the normal code (as if the variable was not
annotated with the attribute), except that the compiler knows, that the
enum value is never a value declared in the enum. In this case the return
value is always 0. In the default case __builtin_constant_p with the
annotated variable evaluates to false. (But as  reduce == REDUCE_ADD  is
always false  __builtin_constant_p(reduce == REDUCE_ADD)  should be always
true.)

The attribute can only be applied in the implementation of a function. I
call the attribute early_branch. The usage of the attribute could be

    void reduce_data(enum reduce_op reduce __attribute__((early_branch)),
                     unsigned const* data,
                     unsigned data_size)
    {
        ...
    }

The way to remove the default case could be to just test against non-enum
values and insert __builtin_unreachable()

    if(reduce != REDUCE_ADD  &&
       reduce != REDUCE_MULT &&
       reduce != REDUCE_MIN  &&
       reduce != REDUCE_MAX) __builtin_unreachable();

Or test if it is not constant

    if(!__builtin_constant_p(reduce)) __builtin_unreachable();

For this case, there could be a problem when the proposed attribute is not
available. There could be either a builtin __builtin_default_branch_p or a
macro definition.

    #if __has_attribute(early_branch)
    #define is_default_branch(v) (!__builtin_constant_p(v))
    #else
    #define is_default_branch(v) 0
    #endif

The attribute can take additional parameters, which can specify the values
to compile the code seperately for, this can include non-enum values

    void do_stuff(unsigned element_size __attribute__((early_branch(1, 2,
4, 8)))) { ... }

The bool/_Bool type is interpreted as an enum with two values.

What if the variable changes?

When the variable changes there should be a jump to the corresponding
parallel code of the compiled code with new value.

Unoptimized

    /* removes comments starting with # and ending in a newline */
    void remove_comment(char* dst,
                        char const* src)
    {
        // initialization nessecary
        int state __attribute__((early_branch(0, 1))) = 0;

        char c;

        while(*src)
        {
            c = *src++;

            switch(state)
            {
                case 0:
                    if(c == '#')
                        state = 1;
                    else
                        *dst++ = c;

                    break;

                case 1:
                    if(c == '\n')
                    {
                        *dst++ = '\n';
                        state = 0;
                    }

                    break;
            }
        }
        *dst = '\0';
    }

changed to

    void remove_comment(char* dst,
                        char const* src)
    {
        char c;

        switch(0)
        {
            case 0:
                while(*src)
                {
                    c = *src++;
                    if(c == '#')
                        goto branch1;
                    else
                        *dst++ = c;
                    branch0:
                }
                break;

            case 1:
                while(*src)
                {
                    if(*src++ == '\n')
                    {
                        *dst++ = '\n';
                        goto branch0;
                    }
                    branch1:
                }
                break;
        }
        *dst = '\0';
    }

You see that the state is not tested in each iteration of the loop(s). For
complex cases (like parsers), this can improve performance. This attribute
is needed to avoid such a goto usage. I have seen such code, which is
unmaintainable but performs better than the first solution.

I hope that some of this is implementable. I'd like to discuss about this
and want to know what you think.

cmdLP

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __attribute__((early_branch))
  2019-04-30 18:35 __attribute__((early_branch)) cmdLP #CODE
@ 2019-04-30 19:53 ` Jeff Law
  2019-05-02 12:18   ` __attribute__((early_branch)) Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Law @ 2019-04-30 19:53 UTC (permalink / raw)
  To: cmdLP #CODE, gcc

On 4/30/19 12:34 PM, cmdLP #CODE wrote:
> Hello GCC-team,
> 
> I use GCC for all my C and C++ programs. I know how to use GCC, but I am
> not a contributor to GCC (yet). I often discover some problems C and C++
> code have in general. There is often the choice between fast or readable
> code. Some code I have seen performs well but looks ugly (goto, etc.);
> other code is simple, but has some performance problems. What if we could
> make the simple code perform well?
> 
> There is a common problem with conditional branches inside loops. This can
> decrease the performance of a program. To fix this issue, the conditional
> branch should be moved outside of the loop. Sometimes this optimization is
> done by the compiler, but guessing on optimizations done by the compiler is
> really bad. Often it is not easy to transform the source code to have the
> conditional branching outside the loop. Instead I propose a new attribute,
> which forces the compiler to do a conditional branch (based on the
> annotated parameter) at the beginning of a function. It branches to the
> corresponding code of the function compiled with the value being constant.
> 
> Here is a code example, which contains this issue.
> 
>     enum reduce_op
>     {
>         REDUCE_ADD,
>         REDUCE_MULT,
>         REDUCE_MIN,
>         REDUCE_MAX
>     };
> 
>     /* performance critical function */
>     void reduce_data(enum reduce_op reduce,
>                      unsigned const* data,
>                      unsigned data_size)
>     {
>         unsigned i, result, item;
> 
>         result = reduce == REDUCE_MULT ?  1u
>                : reduce == REDUCE_MIN  ? ~0u // ~0u is UINT_MAX
>                :                          0u;
> 
>         for(i = 0; i < data_size; ++i)
>         {
>             item = data[i];
> 
>             switch(reduce)
>             {
>                 case REDUCE_ADD:
>                     result += item;
>                 break;
> 
>                 case REDUCE_MULT:
>                     result *= item;
>                 break;
> 
>                 case REDUCE_MIN:
>                     if(item < result) result = item;
>                     // RIP: result <?= item;
>                 break;
> 
>                 case REDUCE_MAX:
>                     if(item > result) result = item;
>                     // RIP: result >?= item;
>                 break;
>             }
>         }
> 
>         return result;
>     }
> 
> The value of  reduce  does not change inside the function. For this
> example, the optimization is trivial. But consider more complex examples.
> The function should be optimized to:
[ .... ]
This is loop unswitching.  It's a standard GCC optimization.  If it's
not working as well as it should, we're far better off determining why
and fixing the automatic transformation rather than relying on
attributes to drive the transformation.


> What if the variable changes?
> 
> When the variable changes there should be a jump to the corresponding
> parallel code of the compiled code with new value.
> 
> Unoptimized
> 
>     /* removes comments starting with # and ending in a newline */
>     void remove_comment(char* dst,
>                         char const* src)
>     {
>         // initialization nessecary
>         int state __attribute__((early_branch(0, 1))) = 0;
> 
>         char c;
> 
>         while(*src)
>         {
>             c = *src++;
> 
>             switch(state)
>             {
>                 case 0:
>                     if(c == '#')
>                         state = 1;
>                     else
>                         *dst++ = c;
> 
>                     break;
> 
>                 case 1:
>                     if(c == '\n')
>                     {
>                         *dst++ = '\n';
>                         state = 0;
>                     }
> 
>                     break;
>             }
>         }
>         *dst = '\0';
>     }
> 
> changed to
> 
>     void remove_comment(char* dst,
>                         char const* src)
>     {
>         char c;
> 
>         switch(0)
>         {
>             case 0:
>                 while(*src)
>                 {
>                     c = *src++;
>                     if(c == '#')
>                         goto branch1;
>                     else
>                         *dst++ = c;
>                     branch0:
>                 }
>                 break;
> 
>             case 1:
>                 while(*src)
>                 {
>                     if(*src++ == '\n')
>                     {
>                         *dst++ = '\n';
>                         goto branch0;
>                     }
>                     branch1:
>                 }
>                 break;
>         }
>         *dst = '\0';
>     }
> 
> You see that the state is not tested in each iteration of the loop(s). For
> complex cases (like parsers), this can improve performance. This attribute
> is needed to avoid such a goto usage. I have seen such code, which is
> unmaintainable but performs better than the first solution.
This is backwards jump threading which was primarily designed to help
state machines by avoiding evaluation of the switch at each iteration of
the loop and instead jumping directly to the desired target.

Again, if it's not working for a particular case we're far better off
figuring out why than relying on attributes to drive the transformation.

jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __attribute__((early_branch))
  2019-04-30 19:53 ` __attribute__((early_branch)) Jeff Law
@ 2019-05-02 12:18   ` Richard Biener
  2019-05-02 16:16     ` __attribute__((early_branch)) Segher Boessenkool
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2019-05-02 12:18 UTC (permalink / raw)
  To: Jeff Law; +Cc: cmdLP #CODE, GCC Development

On Tue, Apr 30, 2019 at 9:53 PM Jeff Law <law@redhat.com> wrote:
>
> On 4/30/19 12:34 PM, cmdLP #CODE wrote:
> > Hello GCC-team,
> >
> > I use GCC for all my C and C++ programs. I know how to use GCC, but I am
> > not a contributor to GCC (yet). I often discover some problems C and C++
> > code have in general. There is often the choice between fast or readable
> > code. Some code I have seen performs well but looks ugly (goto, etc.);
> > other code is simple, but has some performance problems. What if we could
> > make the simple code perform well?
> >
> > There is a common problem with conditional branches inside loops. This can
> > decrease the performance of a program. To fix this issue, the conditional
> > branch should be moved outside of the loop. Sometimes this optimization is
> > done by the compiler, but guessing on optimizations done by the compiler is
> > really bad. Often it is not easy to transform the source code to have the
> > conditional branching outside the loop. Instead I propose a new attribute,
> > which forces the compiler to do a conditional branch (based on the
> > annotated parameter) at the beginning of a function. It branches to the
> > corresponding code of the function compiled with the value being constant.
> >
> > Here is a code example, which contains this issue.
> >
> >     enum reduce_op
> >     {
> >         REDUCE_ADD,
> >         REDUCE_MULT,
> >         REDUCE_MIN,
> >         REDUCE_MAX
> >     };
> >
> >     /* performance critical function */
> >     void reduce_data(enum reduce_op reduce,
> >                      unsigned const* data,
> >                      unsigned data_size)
> >     {
> >         unsigned i, result, item;
> >
> >         result = reduce == REDUCE_MULT ?  1u
> >                : reduce == REDUCE_MIN  ? ~0u // ~0u is UINT_MAX
> >                :                          0u;
> >
> >         for(i = 0; i < data_size; ++i)
> >         {
> >             item = data[i];
> >
> >             switch(reduce)
> >             {
> >                 case REDUCE_ADD:
> >                     result += item;
> >                 break;
> >
> >                 case REDUCE_MULT:
> >                     result *= item;
> >                 break;
> >
> >                 case REDUCE_MIN:
> >                     if(item < result) result = item;
> >                     // RIP: result <?= item;
> >                 break;
> >
> >                 case REDUCE_MAX:
> >                     if(item > result) result = item;
> >                     // RIP: result >?= item;
> >                 break;
> >             }
> >         }
> >
> >         return result;
> >     }
> >
> > The value of  reduce  does not change inside the function. For this
> > example, the optimization is trivial. But consider more complex examples.
> > The function should be optimized to:
> [ .... ]
> This is loop unswitching.  It's a standard GCC optimization.  If it's
> not working as well as it should, we're far better off determining why
> and fixing the automatic transformation rather than relying on
> attributes to drive the transformation.

It's currently not implemented for switch () stmts, just for conditionals.
This also hurts SPEC cactusADM.  There might be a missed-optimization
bug about this.  A simple recursive implementation might be possible;
unswitch one case at a time - maybe order by profile probability.  We
already recurse on the unswitched bodies (in case multiple conditions
can be unswitched)

>
> > What if the variable changes?
> >
> > When the variable changes there should be a jump to the corresponding
> > parallel code of the compiled code with new value.
> >
> > Unoptimized
> >
> >     /* removes comments starting with # and ending in a newline */
> >     void remove_comment(char* dst,
> >                         char const* src)
> >     {
> >         // initialization nessecary
> >         int state __attribute__((early_branch(0, 1))) = 0;
> >
> >         char c;
> >
> >         while(*src)
> >         {
> >             c = *src++;
> >
> >             switch(state)
> >             {
> >                 case 0:
> >                     if(c == '#')
> >                         state = 1;
> >                     else
> >                         *dst++ = c;
> >
> >                     break;
> >
> >                 case 1:
> >                     if(c == '\n')
> >                     {
> >                         *dst++ = '\n';
> >                         state = 0;
> >                     }
> >
> >                     break;
> >             }
> >         }
> >         *dst = '\0';
> >     }
> >
> > changed to
> >
> >     void remove_comment(char* dst,
> >                         char const* src)
> >     {
> >         char c;
> >
> >         switch(0)
> >         {
> >             case 0:
> >                 while(*src)
> >                 {
> >                     c = *src++;
> >                     if(c == '#')
> >                         goto branch1;
> >                     else
> >                         *dst++ = c;
> >                     branch0:
> >                 }
> >                 break;
> >
> >             case 1:
> >                 while(*src)
> >                 {
> >                     if(*src++ == '\n')
> >                     {
> >                         *dst++ = '\n';
> >                         goto branch0;
> >                     }
> >                     branch1:
> >                 }
> >                 break;
> >         }
> >         *dst = '\0';
> >     }
> >
> > You see that the state is not tested in each iteration of the loop(s). For
> > complex cases (like parsers), this can improve performance. This attribute
> > is needed to avoid such a goto usage. I have seen such code, which is
> > unmaintainable but performs better than the first solution.
> This is backwards jump threading which was primarily designed to help
> state machines by avoiding evaluation of the switch at each iteration of
> the loop and instead jumping directly to the desired target.
>
> Again, if it's not working for a particular case we're far better off
> figuring out why than relying on attributes to drive the transformation.
>
> jeff
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __attribute__((early_branch))
  2019-05-02 12:18   ` __attribute__((early_branch)) Richard Biener
@ 2019-05-02 16:16     ` Segher Boessenkool
  2019-05-03  7:11       ` __attribute__((early_branch)) Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Segher Boessenkool @ 2019-05-02 16:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, cmdLP #CODE, GCC Development

On Thu, May 02, 2019 at 02:17:51PM +0200, Richard Biener wrote:
> On Tue, Apr 30, 2019 at 9:53 PM Jeff Law <law@redhat.com> wrote:
> > This is loop unswitching.  It's a standard GCC optimization.  If it's
> > not working as well as it should, we're far better off determining why
> > and fixing the automatic transformation rather than relying on
> > attributes to drive the transformation.
> 
> It's currently not implemented for switch () stmts, just for conditionals.
> This also hurts SPEC cactusADM.  There might be a missed-optimization
> bug about this.  A simple recursive implementation might be possible;
> unswitch one case at a time - maybe order by profile probability.  We
> already recurse on the unswitched bodies (in case multiple conditions
> can be unswitched)

Well, if for some case value we can prove the controlling expression is
constant in the loop, we can almost always prove it is constant without
looking at the case value?  So we can pull the whole switch statement
outside just as easily?


Segher

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __attribute__((early_branch))
  2019-05-02 16:16     ` __attribute__((early_branch)) Segher Boessenkool
@ 2019-05-03  7:11       ` Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2019-05-03  7:11 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Jeff Law, cmdLP #CODE, GCC Development

On Thu, May 2, 2019 at 6:16 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, May 02, 2019 at 02:17:51PM +0200, Richard Biener wrote:
> > On Tue, Apr 30, 2019 at 9:53 PM Jeff Law <law@redhat.com> wrote:
> > > This is loop unswitching.  It's a standard GCC optimization.  If it's
> > > not working as well as it should, we're far better off determining why
> > > and fixing the automatic transformation rather than relying on
> > > attributes to drive the transformation.
> >
> > It's currently not implemented for switch () stmts, just for conditionals.
> > This also hurts SPEC cactusADM.  There might be a missed-optimization
> > bug about this.  A simple recursive implementation might be possible;
> > unswitch one case at a time - maybe order by profile probability.  We
> > already recurse on the unswitched bodies (in case multiple conditions
> > can be unswitched)
>
> Well, if for some case value we can prove the controlling expression is
> constant in the loop, we can almost always prove it is constant without
> looking at the case value?  So we can pull the whole switch statement
> outside just as easily?

There isn't any infrastructure to "easily" do that (copy the loop N times,
wrap it in a switch stmt and put the N loop copies into the switch cases).
The infrastructure we have (loop versioning) manages to copy a loop once
and wrap the two copies with a conditional.  It might be also preferable
to only unswitch the most frequently executed case to avoid code size
explosion (IIRC the cactusADM case has 4 cases, only one is actually
executed).

Richard.

>
>
> Segher

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-05-03  7:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 18:35 __attribute__((early_branch)) cmdLP #CODE
2019-04-30 19:53 ` __attribute__((early_branch)) Jeff Law
2019-05-02 12:18   ` __attribute__((early_branch)) Richard Biener
2019-05-02 16:16     ` __attribute__((early_branch)) Segher Boessenkool
2019-05-03  7:11       ` __attribute__((early_branch)) Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).