[Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
@ 2020-03-09  6:06 bina2374 at gmail dot com
  2020-03-09  6:17 ` [Bug tree-optimization/94092] " pinskia at gcc dot gnu.org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: bina2374 at gmail dot com @ 2020-03-09  6:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

            Bug ID: 94092
           Summary: Code size and performance degradations after
                    -ftree-loop-distribute-patterns was enabled at -O[2s]+
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bina2374 at gmail dot com
  Target Milestone: ---
            Target: riscv64-unknown-elf

Since https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440,
loop-distribute-patterns was enabled at -O[2s]+. We understand that this patch
is helpful in some cases, but not for all.
In the original bugzilla, the loop count is known, and the performance and code
size are all improved finally. However, if loop count is unknown, the results
may become worse.

void init_array(int N, int *array) {
  int i;
  for (i = 0; i < N; i++)
    array[i] = 0;
}

init_array:
        mv      a2,a0
        mv      a0,a1
        ble     a2,zero,.L1
        slli    a2,a2,2
        li      a1,0
        tail    memset
.L1:
        ret

For coremark, this is not only harmful to performance, but also code size.
Therefore, I suggest that we have to discuss how to resolve the case of unknown
loop count.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
@ 2020-03-09  6:17 ` pinskia at gcc dot gnu.org
  2020-03-09  6:29 ` pinskia at gcc dot gnu.org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-03-09  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>For coremark, this is not only harmful to performance, but also code size.


Bad, very bad benchmark ....

SPEC CPU is closer to real code but still bad benchmarks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
  2020-03-09  6:17 ` [Bug tree-optimization/94092] " pinskia at gcc dot gnu.org
@ 2020-03-09  6:29 ` pinskia at gcc dot gnu.org
  2020-03-09  6:30 ` pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-03-09  6:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> >For coremark, this is not only harmful to performance, but also code size.
> 
> 
> Bad, very bad benchmark ....
Coremark only handles very very small data sets by default.  I think you should
run your real code over this and see if it improves or not.

Also -O2 does not care exactly about code size, that -Os only.

Can you provide a test where besides coremark which decreases in performance
and increase in size?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
  2020-03-09  6:17 ` [Bug tree-optimization/94092] " pinskia at gcc dot gnu.org
  2020-03-09  6:29 ` pinskia at gcc dot gnu.org
@ 2020-03-09  6:30 ` pinskia at gcc dot gnu.org
  2020-03-09  7:37 ` glisse at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-03-09  6:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2020-03-09

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (2 preceding siblings ...)
  2020-03-09  6:30 ` pinskia at gcc dot gnu.org
@ 2020-03-09  7:37 ` glisse at gcc dot gnu.org
  2020-03-09  9:48 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-03-09  7:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #3 from Marc Glisse <glisse at gcc dot gnu.org> ---
Does profile feedback (so we have an idea on the loop count) make any
difference? It seems clear that for a loop that in practice just copies one
long, having to arrange the arguments, make a function call, test for
alignment, etc, is a lot of overhead. What to do about it though...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (3 preceding siblings ...)
  2020-03-09  7:37 ` glisse at gcc dot gnu.org
@ 2020-03-09  9:48 ` rguenth at gcc dot gnu.org
  2021-02-19  8:05 ` aoliva at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-09  9:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
With profile feedback we (target or middle-end) can produce specialized
RTL expansion doing small copies inline and larget ones offline.  The
idea of GIMPLE level pattern detection is that even for small sizes
the target usually knows how to expand the copy optimally while the
user may have written a byte copying loop.

Of course that requires targets to pay attention.

Note most compiler optimization involves some heuristics and clearly heuristics
can be off.  I wonder if you can obtain better coremark results by using
link-time optimization.  Iff you're only after benchmark numbers...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (4 preceding siblings ...)
  2020-03-09  9:48 ` rguenth at gcc dot gnu.org
@ 2021-02-19  8:05 ` aoliva at gcc dot gnu.org
  2021-02-23 23:57 ` wilson at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: aoliva at gcc dot gnu.org @ 2021-02-19  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Alexandre Oliva <aoliva at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |ASSIGNED
                 CC|                            |aoliva at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |aoliva at gcc dot gnu.org

--- Comment #5 from Alexandre Oliva <aoliva at gcc dot gnu.org> ---
Mine

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (5 preceding siblings ...)
  2021-02-19  8:05 ` aoliva at gcc dot gnu.org
@ 2021-02-23 23:57 ` wilson at gcc dot gnu.org
  2021-02-24  7:43 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: wilson at gcc dot gnu.org @ 2021-02-23 23:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #6 from Jim Wilson <wilson at gcc dot gnu.org> ---
Trying Alex's patch, it doesn't work on this testcase.

One problem is that the loop bound is unknown, so the memset size is estimated
as max 0x1fffffffc bytes.  The code calls
default_use_by_pieces_infrastructure_p which wants the total number of
instructions to be less than SET_RATIO which is 16.  The total number of
instructions is computed as 0x1fffffffc.  So the attempt fails.

There is a contributing problem here that the alignment of the dest was
computed as 8-bits, even though we have a pointer to int which has 32-bit
alignment on a strict alignment system.  That problem happens in
expand_builtin_memset_args which calls get_pointer_alignment on dest.  DEST is
a SSN_NAME, and the SSA_NAME pointer info says align is 0, so it defaults to
8-bit byte alignment.  I haven't tried to figure out where that went wrong.

The patch does work as advertised on libgcc soft-float code.  For an rv32i
build I see the memset calls in libgcc.a reduced from 31 to 10.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (6 preceding siblings ...)
  2021-02-23 23:57 ` wilson at gcc dot gnu.org
@ 2021-02-24  7:43 ` rguenth at gcc dot gnu.org
  2021-02-24  9:05 ` bina2374 at gmail dot com
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-24  7:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, it was expected that the patch cannot handle all cases since most
definitely the ldist transform loses information on the access that is
otherwise used to improve alignment info.  I suggested to add
__builtin_mem{set,cpy,move} variants with an extra argument specifying the
(common, for cpy/move) 'alignment size'.

Note that for unbound loop bounds we want to dispatch to libc memset even
when we know much about alignment since libc is expected to have optimal
code sequences for a variety of alignment/size combinations.  As said elsewhere
I believe we have code that can dynamically dispatch between a short inline
sequence and a libcall dependent on the actual length but I don't remember
whether this is/was in generic code or in target specific code (but I think
it is done only when value-profile data is available).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (7 preceding siblings ...)
  2021-02-24  7:43 ` rguenth at gcc dot gnu.org
@ 2021-02-24  9:05 ` bina2374 at gmail dot com
  2021-02-24  9:20 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bina2374 at gmail dot com @ 2021-02-24  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #8 from Mel Chen <bina2374 at gmail dot com> ---
Sorry for using the bad example to describe the problem I am facing. Let me
clarify my question with a more precise example.

void array_mul(int N, int *C, short *A, short *B) {
  int i, j;
  for (i = 0; i < N; i++) {
    C[i] = 0; // Will be transformed to __builtin_memset
    for (j = 0; j < N; j++) {
      C[i] += (int)A[i * N + j] * (int)B[j];
    }
  }
}

If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
without -fno-tree-loop-distribute-patterns, it will be transformed to memset by
loop distribution (ldist) because ldist executes before dse3. Finally the
memset will not be eliminated.

Another point is if there are other operations in the same level loop as the
store operation, is it really beneficial to do loop distribution and then
convert to builtin function?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (8 preceding siblings ...)
  2021-02-24  9:05 ` bina2374 at gmail dot com
@ 2021-02-24  9:20 ` rguenth at gcc dot gnu.org
  2021-02-26  6:35 ` bina2374 at gmail dot com
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-24  9:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Mel Chen from comment #8)
> Sorry for using the bad example to describe the problem I am facing. Let me
> clarify my question with a more precise example.
> 
> void array_mul(int N, int *C, short *A, short *B) {
>   int i, j;
>   for (i = 0; i < N; i++) {
>     C[i] = 0; // Will be transformed to __builtin_memset
>     for (j = 0; j < N; j++) {
>       C[i] += (int)A[i * N + j] * (int)B[j];
>     }
>   }
> }
> 
> If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> by loop distribution (ldist) because ldist executes before dse3. Finally the
> memset will not be eliminated.
> 
> Another point is if there are other operations in the same level loop as the
> store operation, is it really beneficial to do loop distribution and then
> convert to builtin function?

Sure, it shows a cost modeling issue given that usually loop distribution
merges partitions which touch the same memory stream (but IIRC maybe only
for loads).  But more to the point we're missing to eliminate the dead store
which should be appearant at least after PRE - LIM2 applied store motion
but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
next DSE only happening after loop distribution.

Which means we should eventually do

diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..be3a9becde0 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_sancov);
       NEXT_PASS (pass_asan);
       NEXT_PASS (pass_tsan);
+      NEXT_PASS (pass_dse);
       NEXT_PASS (pass_dce);
       /* Pass group that runs when 1) enabled, 2) there are loops
         in the function.  Make sure to run pass_fix_loops before

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (9 preceding siblings ...)
  2021-02-24  9:20 ` rguenth at gcc dot gnu.org
@ 2021-02-26  6:35 ` bina2374 at gmail dot com
  2021-03-03  6:30 ` bina2374 at gmail dot com
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bina2374 at gmail dot com @ 2021-02-26  6:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #10 from Mel Chen <bina2374 at gmail dot com> ---
(In reply to Richard Biener from comment #9)
> (In reply to Mel Chen from comment #8)
> > Sorry for using the bad example to describe the problem I am facing. Let me
> > clarify my question with a more precise example.
> > 
> > void array_mul(int N, int *C, short *A, short *B) {
> >   int i, j;
> >   for (i = 0; i < N; i++) {
> >     C[i] = 0; // Will be transformed to __builtin_memset
> >     for (j = 0; j < N; j++) {
> >       C[i] += (int)A[i * N + j] * (int)B[j];
> >     }
> >   }
> > }
> > 
> > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> > without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> > by loop distribution (ldist) because ldist executes before dse3. Finally the
> > memset will not be eliminated.
> > 
> > Another point is if there are other operations in the same level loop as the
> > store operation, is it really beneficial to do loop distribution and then
> > convert to builtin function?
> 
> Sure, it shows a cost modeling issue given that usually loop distribution
> merges partitions which touch the same memory stream (but IIRC maybe only
> for loads).  But more to the point we're missing to eliminate the dead store
> which should be appearant at least after PRE - LIM2 applied store motion
> but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> next DSE only happening after loop distribution.
> 
> Which means we should eventually do
> 
> diff --git a/gcc/passes.def b/gcc/passes.def
> index e9ed3c7bc57..be3a9becde0 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_sancov);
>        NEXT_PASS (pass_asan);
>        NEXT_PASS (pass_tsan);
> +      NEXT_PASS (pass_dse);
>        NEXT_PASS (pass_dce);
>        /* Pass group that runs when 1) enabled, 2) there are loops
>          in the function.  Make sure to run pass_fix_loops before

Yes, doing DSE before ldist is a simple and effective way.
This patch has been verified to be work on coremark. Not only improved
performance, but also code size.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (10 preceding siblings ...)
  2021-02-26  6:35 ` bina2374 at gmail dot com
@ 2021-03-03  6:30 ` bina2374 at gmail dot com
  2021-03-03  8:03 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: bina2374 at gmail dot com @ 2021-03-03  6:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #11 from Mel Chen <bina2374 at gmail dot com> ---
(In reply to Mel Chen from comment #10)
> (In reply to Richard Biener from comment #9)
> > (In reply to Mel Chen from comment #8)
> > > Sorry for using the bad example to describe the problem I am facing. Let me
> > > clarify my question with a more precise example.
> > > 
> > > void array_mul(int N, int *C, short *A, short *B) {
> > >   int i, j;
> > >   for (i = 0; i < N; i++) {
> > >     C[i] = 0; // Will be transformed to __builtin_memset
> > >     for (j = 0; j < N; j++) {
> > >       C[i] += (int)A[i * N + j] * (int)B[j];
> > >     }
> > >   }
> > > }
> > > 
> > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> > > without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> > > by loop distribution (ldist) because ldist executes before dse3. Finally the
> > > memset will not be eliminated.
> > > 
> > > Another point is if there are other operations in the same level loop as the
> > > store operation, is it really beneficial to do loop distribution and then
> > > convert to builtin function?
> > 
> > Sure, it shows a cost modeling issue given that usually loop distribution
> > merges partitions which touch the same memory stream (but IIRC maybe only
> > for loads).  But more to the point we're missing to eliminate the dead store
> > which should be appearant at least after PRE - LIM2 applied store motion
> > but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> > next DSE only happening after loop distribution.
> > 
> > Which means we should eventually do
> > 
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index e9ed3c7bc57..be3a9becde0 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
> >        NEXT_PASS (pass_sancov);
> >        NEXT_PASS (pass_asan);
> >        NEXT_PASS (pass_tsan);
> > +      NEXT_PASS (pass_dse);
> >        NEXT_PASS (pass_dce);
> >        /* Pass group that runs when 1) enabled, 2) there are loops
> >          in the function.  Make sure to run pass_fix_loops before
> 
> Yes, doing DSE before ldist is a simple and effective way.
> This patch has been verified to be work on coremark. Not only improved
> performance, but also code size.

The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE.

/* { dg-do compile { target size32plus } } */
/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns
-fdump-tree-ldist-details" } */

#define N (1024)
double a[N][N], b[N][N], c[N][N];

void
foo (void)
{
  unsigned i, j, k;

  for (i = 0; i < N; ++i)
    for (j = 0; j < N; ++j)
      {
        c[i][j] = 0.0;
        for (k = 0; k < N; ++k)
          c[i][j] += a[i][k] * b[k][j];
      }
}

/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 1
library" "ldist" } } */

It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0'
so no loop will be split. My question is how to handle this test case? Add
-fno-tree-dse into dg-options, modify this test case, delete this test case, or
others.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (11 preceding siblings ...)
  2021-03-03  6:30 ` bina2374 at gmail dot com
@ 2021-03-03  8:03 ` rguenther at suse dot de
  2021-04-29  4:24 ` aoliva at gcc dot gnu.org
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2021-03-03  8:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 3 Mar 2021, bina2374 at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092
> 
> --- Comment #11 from Mel Chen <bina2374 at gmail dot com> ---
> (In reply to Mel Chen from comment #10)
> > (In reply to Richard Biener from comment #9)
> > > (In reply to Mel Chen from comment #8)
> > > > Sorry for using the bad example to describe the problem I am facing. Let me
> > > > clarify my question with a more precise example.
> > > > 
> > > > void array_mul(int N, int *C, short *A, short *B) {
> > > >   int i, j;
> > > >   for (i = 0; i < N; i++) {
> > > >     C[i] = 0; // Will be transformed to __builtin_memset
> > > >     for (j = 0; j < N; j++) {
> > > >       C[i] += (int)A[i * N + j] * (int)B[j];
> > > >     }
> > > >   }
> > > > }
> > > > 
> > > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> > > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> > > > without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> > > > by loop distribution (ldist) because ldist executes before dse3. Finally the
> > > > memset will not be eliminated.
> > > > 
> > > > Another point is if there are other operations in the same level loop as the
> > > > store operation, is it really beneficial to do loop distribution and then
> > > > convert to builtin function?
> > > 
> > > Sure, it shows a cost modeling issue given that usually loop distribution
> > > merges partitions which touch the same memory stream (but IIRC maybe only
> > > for loads).  But more to the point we're missing to eliminate the dead store
> > > which should be appearant at least after PRE - LIM2 applied store motion
> > > but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> > > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> > > next DSE only happening after loop distribution.
> > > 
> > > Which means we should eventually do
> > > 
> > > diff --git a/gcc/passes.def b/gcc/passes.def
> > > index e9ed3c7bc57..be3a9becde0 100644
> > > --- a/gcc/passes.def
> > > +++ b/gcc/passes.def
> > > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
> > >        NEXT_PASS (pass_sancov);
> > >        NEXT_PASS (pass_asan);
> > >        NEXT_PASS (pass_tsan);
> > > +      NEXT_PASS (pass_dse);
> > >        NEXT_PASS (pass_dce);
> > >        /* Pass group that runs when 1) enabled, 2) there are loops
> > >          in the function.  Make sure to run pass_fix_loops before
> > 
> > Yes, doing DSE before ldist is a simple and effective way.
> > This patch has been verified to be work on coremark. Not only improved
> > performance, but also code size.
> 
> The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE.
> 
> /* { dg-do compile { target size32plus } } */
> /* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns
> -fdump-tree-ldist-details" } */
> 
> #define N (1024)
> double a[N][N], b[N][N], c[N][N];
> 
> void
> foo (void)
> {
>   unsigned i, j, k;
> 
>   for (i = 0; i < N; ++i)
>     for (j = 0; j < N; ++j)
>       {
>         c[i][j] = 0.0;
>         for (k = 0; k < N; ++k)
>           c[i][j] += a[i][k] * b[k][j];
>       }
> }
> 
> /* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 1
> library" "ldist" } } */
> 
> It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0'
> so no loop will be split. My question is how to handle this test case? Add
> -fno-tree-dse into dg-options, modify this test case, delete this test case, or
> others.

I'd say disable store-motion with a comment (-fno-tree-loop-im, we might
also finally add a separate switch to control the store-motion part of
GIMPLE invariant motion).  I think
the testcase should show we can recognize the memset in a non-innermost
loop (IIRC the above mimics what bwaves does).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (12 preceding siblings ...)
  2021-03-03  8:03 ` rguenther at suse dot de
@ 2021-04-29  4:24 ` aoliva at gcc dot gnu.org
  2021-04-30 14:38 ` law at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: aoliva at gcc dot gnu.org @ 2021-04-29  4:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Alexandre Oliva <aoliva at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|aoliva at gcc dot gnu.org          |unassigned at gcc dot gnu.org
                URL|https://gcc.gnu.org/piperma |
                   |il/gcc-patches/2021-Februar |
                   |y/565558.html               |

--- Comment #13 from Alexandre Oliva <aoliva at gcc dot gnu.org> ---
Since Jim Wilson says
https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565558.html does not
address the problem in this bug, after all, I'm removing myself as assignee.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (13 preceding siblings ...)
  2021-04-29  4:24 ` aoliva at gcc dot gnu.org
@ 2021-04-30 14:38 ` law at gcc dot gnu.org
  2021-05-03  6:28 ` rguenther at suse dot de
  2021-05-04 12:26 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: law at gcc dot gnu.org @ 2021-04-30 14:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Jeffrey A. Law <law at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |law at gcc dot gnu.org

--- Comment #14 from Jeffrey A. Law <law at gcc dot gnu.org> ---
WRT Jim's comment about alignments in c#6.

Right now a pointer's alignment is really only used to eliminate unnecessary
masking -- we don't propagate a pointer's known alignment to improve the known
alignment of memory operations involving that pointer.

This is something I'd cobbled together for a closely related issue which will
try to increase the known alignment of a MEM by using the alignment of a
pointer to that MEM.

We've gone a slightly different (and more effective) route for that internal
issue, but this may still be worth polishing a bit and submitting.

diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 972512e81..be9ff76b5 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr)
      we clear it here.  */
   MEM_ATTRS (rt) = 0;

+  /* If we can deduce a higher alignment for the memory access
+     based on the pointer, then it's advantageous to do so.   */
+  unsigned int align = 0;
+  if (REG_P (addr)
+      && REG_POINTER (addr))
+    align = REGNO_POINTER_ALIGN (REGNO (addr));
+  else if (GET_CODE (addr) == PLUS
+          && REG_P (XEXP (addr, 0))
+          && REG_POINTER (XEXP (addr, 0))
+          && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)))
+          && GET_CODE (XEXP (addr, 1)) == CONST_INT)
+    {
+      unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1);
+      /* ALIGN is in bits.  */
+      tmp <<= 3;
+      align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)));
+      align = (align > tmp) ? tmp : align;
+    }
+
+  if (align > mode_mem_attrs[(int) mode]->align)
+    set_mem_align (rt, align);
+
   return rt;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (14 preceding siblings ...)
  2021-04-30 14:38 ` law at gcc dot gnu.org
@ 2021-05-03  6:28 ` rguenther at suse dot de
  2021-05-04 12:26 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2021-05-03  6:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 30 Apr 2021, law at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092
> 
> Jeffrey A. Law <law at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |law at gcc dot gnu.org
> 
> --- Comment #14 from Jeffrey A. Law <law at gcc dot gnu.org> ---
> WRT Jim's comment about alignments in c#6.
> 
> Right now a pointer's alignment is really only used to eliminate unnecessary
> masking -- we don't propagate a pointer's known alignment to improve the known
> alignment of memory operations involving that pointer.

Hmm, but we do - in set_mem_attributes_minus_bitpos.  But maybe below
is for way later RTL generation,

> This is something I'd cobbled together for a closely related issue which will
> try to increase the known alignment of a MEM by using the alignment of a
> pointer to that MEM.
> 
> We've gone a slightly different (and more effective) route for that internal
> issue, but this may still be worth polishing a bit and submitting.
> 
> diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
> index 972512e81..be9ff76b5 100644
> --- a/gcc/emit-rtl.c
> +++ b/gcc/emit-rtl.c
> @@ -859,6 +859,28 @@ gen_rtx_MEM (machine_mode mode, rtx addr)
>       we clear it here.  */
>    MEM_ATTRS (rt) = 0;
> 
> +  /* If we can deduce a higher alignment for the memory access
> +     based on the pointer, then it's advantageous to do so.   */
> +  unsigned int align = 0;
> +  if (REG_P (addr)
> +      && REG_POINTER (addr))
> +    align = REGNO_POINTER_ALIGN (REGNO (addr));
> +  else if (GET_CODE (addr) == PLUS
> +          && REG_P (XEXP (addr, 0))
> +          && REG_POINTER (XEXP (addr, 0))
> +          && REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)))
> +          && GET_CODE (XEXP (addr, 1)) == CONST_INT)
> +    {
> +      unsigned int tmp = 1 << (ffs_hwi (INTVAL (XEXP (addr, 1))) - 1);
> +      /* ALIGN is in bits.  */
> +      tmp <<= 3;
> +      align = REGNO_POINTER_ALIGN (REGNO (XEXP (addr, 0)));
> +      align = (align > tmp) ? tmp : align;
> +    }
> +
> +  if (align > mode_mem_attrs[(int) mode]->align)
> +    set_mem_align (rt, align);
> +
>    return rt;
>  }
> 
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+
  2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
                   ` (15 preceding siblings ...)
  2021-05-03  6:28 ` rguenther at suse dot de
@ 2021-05-04 12:26 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-05-04 12:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-04 12:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09  6:06 [Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+ bina2374 at gmail dot com
2020-03-09  6:17 ` [Bug tree-optimization/94092] " pinskia at gcc dot gnu.org
2020-03-09  6:29 ` pinskia at gcc dot gnu.org
2020-03-09  6:30 ` pinskia at gcc dot gnu.org
2020-03-09  7:37 ` glisse at gcc dot gnu.org
2020-03-09  9:48 ` rguenth at gcc dot gnu.org
2021-02-19  8:05 ` aoliva at gcc dot gnu.org
2021-02-23 23:57 ` wilson at gcc dot gnu.org
2021-02-24  7:43 ` rguenth at gcc dot gnu.org
2021-02-24  9:05 ` bina2374 at gmail dot com
2021-02-24  9:20 ` rguenth at gcc dot gnu.org
2021-02-26  6:35 ` bina2374 at gmail dot com
2021-03-03  6:30 ` bina2374 at gmail dot com
2021-03-03  8:03 ` rguenther at suse dot de
2021-04-29  4:24 ` aoliva at gcc dot gnu.org
2021-04-30 14:38 ` law at gcc dot gnu.org
2021-05-03  6:28 ` rguenther at suse dot de
2021-05-04 12:26 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).