public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Optimization for static local variables
@ 2017-06-14 11:14 Prachi Godbole
  2017-06-14 11:26 ` Bin.Cheng
  2017-06-14 11:35 ` Richard Biener
  0 siblings, 2 replies; 4+ messages in thread
From: Prachi Godbole @ 2017-06-14 11:14 UTC (permalink / raw)
  To: gcc

I'm developing a solution to optimize away intermediate stores (loads) for static local variables which are assigned to before referenced on every path through a function.

Currently GCC eliminates all loads/stores in a straight line code but not in control structures. AFAIU, passes like sccvn, lim (store motion in particular) sink etc. try to propagate the assignment exprs through SSA vars instead of through memory thereby eliminating almost all loads and some stores but still not all.

For example:

void some_function()
{              
    static some_type sum;
    ...
    for( i = 0 ; i < n ; i++ )
         {
             sum = matrix [i][n] ; 

             for( j = 0 ; j < i ; j++ )
                 {
                     sum -= matrix [i][j] * matrix [j][n] ; 
                 }

            matrix [i][n] = sum ;
        }
    ...
}

Resulting SSA:

...
  <bb 14> [2.25%]:
  _10 = matrix[0][n_13(D)];
  sum = _10;
  goto <bb 10>; [100.00%]
...
  <bb 4> [12.75%]:
  _1 = matrix[i_16][n_13(D)];
  if (i_16 > 0)
    goto <bb 6>; [100.00%]
  else
    goto <bb 9>; [0.00%]
...
  <bb 6> [85.00%]:
  # j_24 = PHI <j_18(6), 0(4)>
  # sum_lsm.4_27 = PHI <_6(6), _1(4)>
  _3 = matrix[i_16][j_24];
  _4 = matrix[j_24][n_13(D)];
  _5 = _3 * _4;
  _6 = sum_lsm.4_27 - _5;
  j_18 = j_24 + 1;
  if (i_16 > j_18)
    goto <bb 6>; [85.00%]
  else
    goto <bb 7>; [15.00%]
...
  <bb 7> [12.75%]:
  # _40 = PHI <_6(6)>
  goto <bb 9>; [100.00%]

  <bb 9> [12.75%]:
  # sum_lsm.4_19 = PHI <_40(7), _1(4)>

  <bb 10> [15.00%]:
  # i_20 = PHI <i_16(9), 0(14)>
  # sum_lsm.4_8 = PHI <sum_lsm.4_19(9), _10(14)>
  # sum_lsm.5_22 = PHI <1(9), 0(14)>
  matrix[i_20][n_13(D)] = sum_lsm.4_8;
  i_16 = i_20 + 1;
  if (n_13(D) > i_16)
    goto <bb 4>; [85.00%]
  else
    goto <bb 11>; [15.00%]

  <bb 11> [2.25%]:
  # sum_lsm.4_39 = PHI <sum_lsm.4_8(10)>
  # sum_lsm.5_38 = PHI <sum_lsm.5_22(10)>
  if (sum_lsm.5_38 != 0)
    goto <bb 12>; [0.00%]
  else
    goto <bb 13>; [100.00%]

  <bb 12> [0.00%]:
  sum = sum_lsm.4_39;
...

Although all intermediate loads are eliminated here, store to sum before and after the 'j' loop (see bb 14 and bb 12) still remain which can be eliminated.

My solution would reuse a lot of data structures and routines from the sink pass. So my question is, can I add this optimization into the sink pass itself or would it require a new pass and if yes, what would be the ideal place for it.
Any other pointers are more than welcome.

Thanks,
Prachi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Optimization for static local variables
  2017-06-14 11:14 Optimization for static local variables Prachi Godbole
@ 2017-06-14 11:26 ` Bin.Cheng
  2017-06-14 11:35 ` Richard Biener
  1 sibling, 0 replies; 4+ messages in thread
From: Bin.Cheng @ 2017-06-14 11:26 UTC (permalink / raw)
  To: Prachi Godbole; +Cc: gcc

On Wed, Jun 14, 2017 at 12:14 PM, Prachi Godbole
<Prachi.Godbole@imgtec.com> wrote:
> I'm developing a solution to optimize away intermediate stores (loads) for static local variables which are assigned to before referenced on every path through a function.
>
> Currently GCC eliminates all loads/stores in a straight line code but not in control structures. AFAIU, passes like sccvn, lim (store motion in particular) sink etc. try to propagate the assignment exprs through SSA vars instead of through memory thereby eliminating almost all loads and some stores but still not all.
>
> For example:
>
> void some_function()
> {
>     static some_type sum;
>     ...
>     for( i = 0 ; i < n ; i++ )
>          {
>              sum = matrix [i][n] ;
>
>              for( j = 0 ; j < i ; j++ )
>                  {
>                      sum -= matrix [i][j] * matrix [j][n] ;
>                  }
>
>             matrix [i][n] = sum ;
>         }
>     ...
> }
>
> Resulting SSA:
>
> ...
>   <bb 14> [2.25%]:
>   _10 = matrix[0][n_13(D)];
>   sum = _10;
>   goto <bb 10>; [100.00%]
> ...
>   <bb 4> [12.75%]:
>   _1 = matrix[i_16][n_13(D)];
>   if (i_16 > 0)
>     goto <bb 6>; [100.00%]
>   else
>     goto <bb 9>; [0.00%]
> ...
>   <bb 6> [85.00%]:
>   # j_24 = PHI <j_18(6), 0(4)>
>   # sum_lsm.4_27 = PHI <_6(6), _1(4)>
>   _3 = matrix[i_16][j_24];
>   _4 = matrix[j_24][n_13(D)];
>   _5 = _3 * _4;
>   _6 = sum_lsm.4_27 - _5;
>   j_18 = j_24 + 1;
>   if (i_16 > j_18)
>     goto <bb 6>; [85.00%]
>   else
>     goto <bb 7>; [15.00%]
> ...
>   <bb 7> [12.75%]:
>   # _40 = PHI <_6(6)>
>   goto <bb 9>; [100.00%]
>
>   <bb 9> [12.75%]:
>   # sum_lsm.4_19 = PHI <_40(7), _1(4)>
>
>   <bb 10> [15.00%]:
>   # i_20 = PHI <i_16(9), 0(14)>
>   # sum_lsm.4_8 = PHI <sum_lsm.4_19(9), _10(14)>
>   # sum_lsm.5_22 = PHI <1(9), 0(14)>
>   matrix[i_20][n_13(D)] = sum_lsm.4_8;
>   i_16 = i_20 + 1;
>   if (n_13(D) > i_16)
>     goto <bb 4>; [85.00%]
>   else
>     goto <bb 11>; [15.00%]
>
>   <bb 11> [2.25%]:
>   # sum_lsm.4_39 = PHI <sum_lsm.4_8(10)>
>   # sum_lsm.5_38 = PHI <sum_lsm.5_22(10)>
>   if (sum_lsm.5_38 != 0)
>     goto <bb 12>; [0.00%]
>   else
>     goto <bb 13>; [100.00%]
>
>   <bb 12> [0.00%]:
>   sum = sum_lsm.4_39;
> ...
>
> Although all intermediate loads are eliminated here, store to sum before and after the 'j' loop (see bb 14 and bb 12) still remain which can be eliminated.
It looks like dead store elimination to sum.  Maybe static local
variable can be better handled there?

Thanks,
bin
>
> My solution would reuse a lot of data structures and routines from the sink pass. So my question is, can I add this optimization into the sink pass itself or would it require a new pass and if yes, what would be the ideal place for it.
> Any other pointers are more than welcome.
>
> Thanks,
> Prachi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Optimization for static local variables
  2017-06-14 11:14 Optimization for static local variables Prachi Godbole
  2017-06-14 11:26 ` Bin.Cheng
@ 2017-06-14 11:35 ` Richard Biener
  2017-06-14 13:19   ` Prachi Godbole
  1 sibling, 1 reply; 4+ messages in thread
From: Richard Biener @ 2017-06-14 11:35 UTC (permalink / raw)
  To: Prachi Godbole; +Cc: gcc

On Wed, Jun 14, 2017 at 1:14 PM, Prachi Godbole
<Prachi.Godbole@imgtec.com> wrote:
> I'm developing a solution to optimize away intermediate stores (loads) for static local variables which are assigned to before referenced on every path through a function.
>
> Currently GCC eliminates all loads/stores in a straight line code but not in control structures. AFAIU, passes like sccvn, lim (store motion in particular) sink etc. try to propagate the assignment exprs through SSA vars instead of through memory thereby eliminating almost all loads and some stores but still not all.
>
> For example:
>
> void some_function()
> {
>     static some_type sum;
>     ...
>     for( i = 0 ; i < n ; i++ )
>          {
>              sum = matrix [i][n] ;
>
>              for( j = 0 ; j < i ; j++ )
>                  {
>                      sum -= matrix [i][j] * matrix [j][n] ;
>                  }
>
>             matrix [i][n] = sum ;
>         }
>     ...
> }
>
> Resulting SSA:
>
> ...
>   <bb 14> [2.25%]:
>   _10 = matrix[0][n_13(D)];
>   sum = _10;
>   goto <bb 10>; [100.00%]
> ...
>   <bb 4> [12.75%]:
>   _1 = matrix[i_16][n_13(D)];
>   if (i_16 > 0)
>     goto <bb 6>; [100.00%]
>   else
>     goto <bb 9>; [0.00%]
> ...
>   <bb 6> [85.00%]:
>   # j_24 = PHI <j_18(6), 0(4)>
>   # sum_lsm.4_27 = PHI <_6(6), _1(4)>
>   _3 = matrix[i_16][j_24];
>   _4 = matrix[j_24][n_13(D)];
>   _5 = _3 * _4;
>   _6 = sum_lsm.4_27 - _5;
>   j_18 = j_24 + 1;
>   if (i_16 > j_18)
>     goto <bb 6>; [85.00%]
>   else
>     goto <bb 7>; [15.00%]
> ...
>   <bb 7> [12.75%]:
>   # _40 = PHI <_6(6)>
>   goto <bb 9>; [100.00%]
>
>   <bb 9> [12.75%]:
>   # sum_lsm.4_19 = PHI <_40(7), _1(4)>
>
>   <bb 10> [15.00%]:
>   # i_20 = PHI <i_16(9), 0(14)>
>   # sum_lsm.4_8 = PHI <sum_lsm.4_19(9), _10(14)>
>   # sum_lsm.5_22 = PHI <1(9), 0(14)>
>   matrix[i_20][n_13(D)] = sum_lsm.4_8;
>   i_16 = i_20 + 1;
>   if (n_13(D) > i_16)
>     goto <bb 4>; [85.00%]
>   else
>     goto <bb 11>; [15.00%]
>
>   <bb 11> [2.25%]:
>   # sum_lsm.4_39 = PHI <sum_lsm.4_8(10)>
>   # sum_lsm.5_38 = PHI <sum_lsm.5_22(10)>
>   if (sum_lsm.5_38 != 0)
>     goto <bb 12>; [0.00%]
>   else
>     goto <bb 13>; [100.00%]
>
>   <bb 12> [0.00%]:
>   sum = sum_lsm.4_39;
> ...
>
> Although all intermediate loads are eliminated here, store to sum before and after the 'j' loop (see bb 14 and bb 12) still remain which can be eliminated.
>
> My solution would reuse a lot of data structures and routines from the sink pass. So my question is, can I add this optimization into the sink pass itself or would it require a new pass and if yes, what would be the ideal place for it.
> Any other pointers are more than welcome.

So the idea is that for any store to such variable that is dominated
by a kill the function exit doesn't count as a use, right?
(unless the variable escapes the scope of the function, for example
via a nested function -- which might be the case that
is very difficult to detect).

I'm not sure what part of the sink pass is suitable for this but as
sinking walks post dominators backwards which also DSE
does it would fit DSE as well, no?  You'd pre-compute blocks
containing kills (or do two backwards passes) and then
special case GIMPLE_RETURN in dse_classify_store where previously
ref_maybe_used_by_stmt_p covers those.

nested fn case:

void foo (void)
{
  static int i = 0;
  int __attribute__((noinline)) bar () { return i; }
  int j = bar ();
  i = 1;
}

I think we don't mark 'i' TREE_ADDRESSABLE (escaped) here so we simply
treat 'i' as a global variable accessible
from other scopes.

Richard.

> Thanks,
> Prachi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Optimization for static local variables
  2017-06-14 11:35 ` Richard Biener
@ 2017-06-14 13:19   ` Prachi Godbole
  0 siblings, 0 replies; 4+ messages in thread
From: Prachi Godbole @ 2017-06-14 13:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

> On Wed, Jun 14, 2017 at 1:14 PM, Prachi Godbole
> <Prachi.Godbole@imgtec.com> wrote:
> > I'm developing a solution to optimize away intermediate stores (loads) for
> static local variables which are assigned to before referenced on every path
> through a function.
> >
> > Currently GCC eliminates all loads/stores in a straight line code but not in
> control structures. AFAIU, passes like sccvn, lim (store motion in particular)
> sink etc. try to propagate the assignment exprs through SSA vars instead of
> through memory thereby eliminating almost all loads and some stores but
> still not all.
> >
> > For example:
> >
> > void some_function()
> > {
> >     static some_type sum;
> >     ...
> >     for( i = 0 ; i < n ; i++ )
> >          {
> >              sum = matrix [i][n] ;
> >
> >              for( j = 0 ; j < i ; j++ )
> >                  {
> >                      sum -= matrix [i][j] * matrix [j][n] ;
> >                  }
> >
> >             matrix [i][n] = sum ;
> >         }
> >     ...
> > }
> >
> > Resulting SSA:
> >
> > ...
> >   <bb 14> [2.25%]:
> >   _10 = matrix[0][n_13(D)];
> >   sum = _10;
> >   goto <bb 10>; [100.00%]
> > ...
> >   <bb 4> [12.75%]:
> >   _1 = matrix[i_16][n_13(D)];
> >   if (i_16 > 0)
> >     goto <bb 6>; [100.00%]
> >   else
> >     goto <bb 9>; [0.00%]
> > ...
> >   <bb 6> [85.00%]:
> >   # j_24 = PHI <j_18(6), 0(4)>
> >   # sum_lsm.4_27 = PHI <_6(6), _1(4)>
> >   _3 = matrix[i_16][j_24];
> >   _4 = matrix[j_24][n_13(D)];
> >   _5 = _3 * _4;
> >   _6 = sum_lsm.4_27 - _5;
> >   j_18 = j_24 + 1;
> >   if (i_16 > j_18)
> >     goto <bb 6>; [85.00%]
> >   else
> >     goto <bb 7>; [15.00%]
> > ...
> >   <bb 7> [12.75%]:
> >   # _40 = PHI <_6(6)>
> >   goto <bb 9>; [100.00%]
> >
> >   <bb 9> [12.75%]:
> >   # sum_lsm.4_19 = PHI <_40(7), _1(4)>
> >
> >   <bb 10> [15.00%]:
> >   # i_20 = PHI <i_16(9), 0(14)>
> >   # sum_lsm.4_8 = PHI <sum_lsm.4_19(9), _10(14)>
> >   # sum_lsm.5_22 = PHI <1(9), 0(14)>
> >   matrix[i_20][n_13(D)] = sum_lsm.4_8;
> >   i_16 = i_20 + 1;
> >   if (n_13(D) > i_16)
> >     goto <bb 4>; [85.00%]
> >   else
> >     goto <bb 11>; [15.00%]
> >
> >   <bb 11> [2.25%]:
> >   # sum_lsm.4_39 = PHI <sum_lsm.4_8(10)>
> >   # sum_lsm.5_38 = PHI <sum_lsm.5_22(10)>
> >   if (sum_lsm.5_38 != 0)
> >     goto <bb 12>; [0.00%]
> >   else
> >     goto <bb 13>; [100.00%]
> >
> >   <bb 12> [0.00%]:
> >   sum = sum_lsm.4_39;
> > ...
> >
> > Although all intermediate loads are eliminated here, store to sum before
> and after the 'j' loop (see bb 14 and bb 12) still remain which can be
> eliminated.
> >
> > My solution would reuse a lot of data structures and routines from the sink
> pass. So my question is, can I add this optimization into the sink pass itself or
> would it require a new pass and if yes, what would be the ideal place for it.
> > Any other pointers are more than welcome.
> 
> So the idea is that for any store to such variable that is dominated by a kill the
> function exit doesn't count as a use, right?
> (unless the variable escapes the scope of the function, for example via a
> nested function -- which might be the case that is very difficult to detect).
> 
Yes that is the general idea. I had thought of initially being conservative and not optimizing at all when the variable could potentially escape the scope.

> I'm not sure what part of the sink pass is suitable for this but as sinking walks
> post dominators backwards which also DSE does it would fit DSE as well, no?
> You'd pre-compute blocks containing kills (or do two backwards passes) and
> then special case GIMPLE_RETURN in dse_classify_store where previously
> ref_maybe_used_by_stmt_p covers those.
> 
I was looking for a one time solution rather than doing it as and when the opportunities present themselves.
But doing it in DSE is more advantageous, I can see that. So I'll go ahead with DSE.

> nested fn case:
> 
> void foo (void)
> {
>   static int i = 0;
>   int __attribute__((noinline)) bar () { return i; }
>   int j = bar ();
>   i = 1;
> }
> 
> I think we don't mark 'i' TREE_ADDRESSABLE (escaped) here so we simply
> treat 'i' as a global variable accessible from other scopes.
> 
> Richard.
> 
> > Thanks,
> > Prachi

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-06-14 13:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 11:14 Optimization for static local variables Prachi Godbole
2017-06-14 11:26 ` Bin.Cheng
2017-06-14 11:35 ` Richard Biener
2017-06-14 13:19   ` Prachi Godbole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).