public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98350] New: Reassociation breaks FMA chains
@ 2020-12-17 15:24 ktkachov at gcc dot gnu.org
  2021-01-05  8:31 ` [Bug tree-optimization/98350] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2020-12-17 15:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

            Bug ID: 98350
           Summary: Reassociation breaks FMA chains
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

Consider the testcase:

#define N 1024
double a[N];
double b[N];
double c[N];
double d[N];
double e[N];
double f[N];
double g[N];
double h[N];
double j[N];
double k[N];
double l[N];
double m[N];
double o[N];
double p[N];


void
foo (void)
{
  for (int i = 0; i < N; i++)
  {
    a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i]
+ m[i]* o[i] + p[i];
  }
}

For -Ofast --param=tree-reassoc-width=1 GCC generates the loop:
.L2:
        ldr     q1, [x1, x0]
        ldr     q0, [x12, x0]
        ldr     q3, [x14, x0]
        fadd    v0.2d, v0.2d, v1.2d
        ldr     q1, [x13, x0]
        ldr     q2, [x11, x0]
        fmla    v0.2d, v3.2d, v1.2d
        ldr     q1, [x10, x0]
        ldr     q3, [x9, x0]
        fmla    v0.2d, v2.2d, v1.2d
        ldr     q1, [x8, x0]
        ldr     q2, [x7, x0]
        fmla    v0.2d, v3.2d, v1.2d
        ldr     q1, [x6, x0]
        ldr     q3, [x5, x0]
        fmla    v0.2d, v2.2d, v1.2d
        ldr     q1, [x4, x0]
        ldr     q2, [x3, x0]
        fmla    v0.2d, v3.2d, v1.2d
        ldr     q1, [x2, x0]
        fmla    v0.2d, v2.2d, v1.2d
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 8192
        bne     .L2

with --param=tree-reassoc-width=4 it generates:
.L2:
        ldr     q5, [x11, x0]
        ldr     q4, [x7, x0]
        ldr     q0, [x3, x0]
        ldr     q3, [x12, x0]
        ldr     q1, [x8, x0]
        ldr     q2, [x4, x0]
        fmul    v3.2d, v3.2d, v5.2d
        fmul    v1.2d, v1.2d, v4.2d
        fmul    v2.2d, v2.2d, v0.2d
        ldr     q16, [x1, x0]
        ldr     q18, [x14, x0]
        ldr     q17, [x13, x0]
        ldr     q0, [x2, x0]
        ldr     q7, [x10, x0]
        ldr     q6, [x9, x0]
        ldr     q5, [x6, x0]
        ldr     q4, [x5, x0]
        fmla    v3.2d, v18.2d, v17.2d
        fadd    v0.2d, v0.2d, v16.2d
        fmla    v1.2d, v7.2d, v6.2d
        fmla    v2.2d, v5.2d, v4.2d
        fadd    v0.2d, v0.2d, v3.2d
        fadd    v1.2d, v1.2d, v2.2d
        fadd    v0.2d, v0.2d, v1.2d
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 8192
        bne     .L2

The reassociation is evident. The problem here is that the fmla chains are
something we'd want to preserve.
Is there a way we can get the reassoc pass to handle FMAs more intelligently?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
@ 2021-01-05  8:31 ` rguenth at gcc dot gnu.org
  2021-09-06 18:40 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-05  8:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unknown                     |11.0
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-01-05
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
There is no built-in way, and yes, reassoc-width is known to have this effect.

What could be done is move/duplicate FMA discovery from
pass_optimize_widening_mul to reassoc(*).  The simplistic idea would be to
perform a separate FMA detection on the OPS array.

The question is how to handle imperfect chains where reassoc would order
after rank, like

a[i] += b[i]* c[i] + d[i] + f[i] * g[i] + h[i] + k[i] * l[i] + m[i] + p[i];

and also how to not "break" the special heuristics the current FMA formation
pass has.  Alternatively altering rewrite_expr_tree_parallel only to avoid
splitting FMA chains in unwanted ways would be possible.

(*) note since reassoc doesn't handle signed integer arithmetic it cannot
fully replace late FMA detect

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
  2021-01-05  8:31 ` [Bug tree-optimization/98350] " rguenth at gcc dot gnu.org
@ 2021-09-06 18:40 ` pinskia at gcc dot gnu.org
  2023-03-23  4:16 ` dizhao at os dot amperecomputing.com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-09-06 18:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |acsawdey at gcc dot gnu.org

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 70912 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
  2021-01-05  8:31 ` [Bug tree-optimization/98350] " rguenth at gcc dot gnu.org
  2021-09-06 18:40 ` pinskia at gcc dot gnu.org
@ 2023-03-23  4:16 ` dizhao at os dot amperecomputing.com
  2023-03-23  4:22 ` dizhao at os dot amperecomputing.com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: dizhao at os dot amperecomputing.com @ 2023-03-23  4:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

Di Zhao <dizhao at os dot amperecomputing.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dizhao at os dot amperecomputing.c
                   |                            |om

--- Comment #3 from Di Zhao <dizhao at os dot amperecomputing.com> ---
Created attachment 54735
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54735&action=edit
move-FLOAT_MODE_P-ahead-to-insert-more-FMAs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-03-23  4:16 ` dizhao at os dot amperecomputing.com
@ 2023-03-23  4:22 ` dizhao at os dot amperecomputing.com
  2023-05-19  7:24 ` pinskia at gcc dot gnu.org
  2023-05-30  6:03 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: dizhao at os dot amperecomputing.com @ 2023-03-23  4:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

--- Comment #4 from Di Zhao <dizhao at os dot amperecomputing.com> ---
I've found the same problem with gcc-12 and gcc-13 (trunk).

By improving the workaround in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114, more FMAs can be inserted
for vector mode. For the testcase in this tracker, 6 "fmla" can be generated
with attachment 54735. The compile option I used is "-Ofast -mcpu=neoverse-n1".

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-03-23  4:22 ` dizhao at os dot amperecomputing.com
@ 2023-05-19  7:24 ` pinskia at gcc dot gnu.org
  2023-05-30  6:03 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-19  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kyukhin at gcc dot gnu.org

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 70479 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98350] Reassociation breaks FMA chains
  2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-05-19  7:24 ` pinskia at gcc dot gnu.org
@ 2023-05-30  6:03 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-30  6:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Lili Cui <cuilili@gcc.gnu.org>:

https://gcc.gnu.org/g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409

commit r14-1371-ge5405f065bace0685cb3b8878d1dfc7a6e7ef409
Author: Lili Cui <lili.cui@intel.com>
Date:   Tue May 30 05:47:47 2023 +0000

    Handle FMA friendly in reassoc pass

    Make some changes in reassoc pass to make it more friendly to fma pass
later.
    Using FMA instead of mult + add reduces register pressure and insruction
    retired.

    There are mainly two changes
    1. Put no-mult ops and mult ops alternately at the end of the queue, which
is
    conducive to generating more fma and reducing the loss of FMA when breaking
    the chain.
    2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
    chains according to the given correlation width, keeping the FMA chance as
    much as possible.

    With the patch applied

    On ICX:
    507.cactuBSSN_r: Improved by 1.7% for multi-copy .
    503.bwaves_r   : Improved by  0.60% for single copy .
    507.cactuBSSN_r: Improved by  1.10% for single copy .
    519.lbm_r      : Improved by  2.21% for single copy .
    no measurable changes for other benchmarks.

    On aarch64
    507.cactuBSSN_r: Improved by 1.7% for multi-copy.
    503.bwaves_r   : Improved by 6.00% for single-copy.
    no measurable changes for other benchmarks.

    TEST1:

    float
    foo (float a, float b, float c, float d, float *e)
    {
       return  *e  + a * b + c * d ;
    }

    For "-Ofast -mfpmath=sse -mfma" GCC generates:
            vmulss  %xmm3, %xmm2, %xmm2
            vfmadd132ss     %xmm1, %xmm2, %xmm0
            vaddss  (%rdi), %xmm0, %xmm0
            ret

    With this patch GCC generates:
            vfmadd213ss   (%rdi), %xmm1, %xmm0
            vfmadd231ss   %xmm2, %xmm3, %xmm0
            ret

    TEST2:

    for (int i = 0; i < N; i++)
    {
      a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] *
l[i] + m[i]* o[i] + p[i];
    }

    For "-Ofast -mfpmath=sse -mfma"  GCC generates:
            vmovapd e(%rax), %ymm4
            vmulpd  d(%rax), %ymm4, %ymm3
            addq    $32, %rax
            vmovapd c-32(%rax), %ymm5
            vmovapd j-32(%rax), %ymm6
            vmulpd  h-32(%rax), %ymm6, %ymm2
            vmovapd a-32(%rax), %ymm6
            vaddpd  p-32(%rax), %ymm6, %ymm0
            vmovapd g-32(%rax), %ymm7
            vfmadd231pd     b-32(%rax), %ymm5, %ymm3
            vmovapd o-32(%rax), %ymm4
            vmulpd  m-32(%rax), %ymm4, %ymm1
            vmovapd l-32(%rax), %ymm5
            vfmadd231pd     f-32(%rax), %ymm7, %ymm2
            vfmadd231pd     k-32(%rax), %ymm5, %ymm1
            vaddpd  %ymm3, %ymm0, %ymm0
            vaddpd  %ymm2, %ymm0, %ymm0
            vaddpd  %ymm1, %ymm0, %ymm0
            vmovapd %ymm0, a-32(%rax)
            cmpq    $8192, %rax
            jne     .L4
            vzeroupper
            ret

    with this patch applied GCC breaks the chain with width = 2 and generates 6
fma:

            vmovapd a(%rax), %ymm2
            vmovapd c(%rax), %ymm0
            addq    $32, %rax
            vmovapd e-32(%rax), %ymm1
            vmovapd p-32(%rax), %ymm5
            vmovapd g-32(%rax), %ymm3
            vmovapd j-32(%rax), %ymm6
            vmovapd l-32(%rax), %ymm4
            vmovapd o-32(%rax), %ymm7
            vfmadd132pd     b-32(%rax), %ymm2, %ymm0
            vfmadd132pd     d-32(%rax), %ymm5, %ymm1
            vfmadd231pd     f-32(%rax), %ymm3, %ymm0
            vfmadd231pd     h-32(%rax), %ymm6, %ymm1
            vfmadd231pd     k-32(%rax), %ymm4, %ymm0
            vfmadd231pd     m-32(%rax), %ymm7, %ymm1
            vaddpd  %ymm1, %ymm0, %ymm0
            vmovapd %ymm0, a-32(%rax)
            cmpq    $8192, %rax
            jne     .L2
            vzeroupper
            ret

    gcc/ChangeLog:

            PR tree-optimization/98350
            * tree-ssa-reassoc.cc
            (rewrite_expr_tree_parallel): Rewrite this function.
            (rank_ops_for_fma): New.
            (reassociate_bb): Handle new function.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/98350
            * gcc.dg/pr98350-1.c: New test.
            * gcc.dg/pr98350-2.c: Ditto.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-30  6:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 15:24 [Bug tree-optimization/98350] New: Reassociation breaks FMA chains ktkachov at gcc dot gnu.org
2021-01-05  8:31 ` [Bug tree-optimization/98350] " rguenth at gcc dot gnu.org
2021-09-06 18:40 ` pinskia at gcc dot gnu.org
2023-03-23  4:16 ` dizhao at os dot amperecomputing.com
2023-03-23  4:22 ` dizhao at os dot amperecomputing.com
2023-05-19  7:24 ` pinskia at gcc dot gnu.org
2023-05-30  6:03 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).