public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
@ 2023-06-28  9:22 hliu at amperecomputing dot com
  2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: hliu at amperecomputing dot com @ 2023-06-28  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

            Bug ID: 110449
           Summary: Vect: use a small step to calculate the loop induction
                    if the loop is unrolled during loop vectorization
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This is inspired by clang. Compile the follwing case with "-mcpu=neoverse-n2
-O3":

void foo(int *arr, int val, int step) {
  for (int i = 0; i < 1024; i++) {
    arr[i] = val;
    val += step;
  }
}

It will be unrolled by 2 during vectorization. GCC generates code:
        fmov    s29, w2                 # step
        shl     v27.2s, v29.2s, 3       # 8*step
        shl     v28.2s, v29.2s, 2       # 4*step
        ...
.L2:
        mov     v30.16b, v31.16b
        add     v31.4s, v31.4s, v27.4s  # += 8*step
        add     v29.4s, v30.4s, v28.4s  # += 4*step
        stp     q30, q29, [x0]
        add     x0, x0, 32
        cmp     x1, x0
        bne     .L2

The v27 (i.e. "8*step") is actually not necessary. We can use v29 + v28 (i.e.
"+ 4*step") and generate simpler code:
        fmov    s29, w2                 # step
        shl     v28.2s, v29.2s, 2       # 4*step
        ...
.L2:
        add     v29.4s, v30.4s, v28.4s  # += 4*step
        stp     q30, q29, [x0]
        add     x0, x0, 32
        add     v30.4s, v29.4s, v28.4s  # += 4*step
        cmp     x1, x0
        bne     .L2

This has two benefits:
(1) Save 1 vector register and one "mov" instructon
(2) For floating point, the result value of small step should be closer to the
original scalar result value than large step. I.e. "A + 4*step + ... + 4*step"
should be closer to "A + step + ... + step" than "A + 8*step + ... 8*step".

Do you think if this is reasonable? 

I have a simple patch to enhance the tree-vect-loop.cc
"vectorizable_induction()" to achieve this. Will send out the patch for code
review later.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
  2023-06-28  9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
@ 2023-06-28 20:11 ` rsandifo at gcc dot gnu.org
  2023-06-29  3:15 ` hliu at amperecomputing dot com
  2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-06-28 20:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org

--- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Interesting idea!  But I think the ideal thing here would be
to do the 8*step after the store:

.L2:
        add     v29.4s, v31.4s, v28.4s  # += 4*step
        stp     q31, q29, [x0]
        add     v31.4s, v31.4s, v27.4s  # += 8*step
        add     x0, x0, 32
        cmp     x1, x0
        bne     .L2

This has the advantage that the loop-carried dependency
is only one ADD instruction deep, rather than 2 ADDs deep.

I haven't looked how easy it would be to do though…

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
  2023-06-28  9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
  2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
@ 2023-06-29  3:15 ` hliu at amperecomputing dot com
  2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: hliu at amperecomputing dot com @ 2023-06-29  3:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

--- Comment #2 from Hao Liu <hliu at amperecomputing dot com> ---
That looks better than the currently generated code (it saves one "MOV"
instruction). Yes, it has the loop-carried dependency advantage. But it still
uses one more register for "8*step" (There may be a register pressure problem
for complicated code, not for this simple case). 

This is still a floating point precision problem. There is a PR84201 discussed
about the same problem for X86:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201. The larger step makes the
floating point calculation result has larger gap compared to the original
scalar calculation result. E.g. The SPEC2017 fp benchmark 549.fotonik may
result in VE (Validation Error) after unrolling a loop of double: 
   319    do ifreq = 1, tmppower%nofreq <------ HERE
   320      frequency(ifreq,ipower) = freq
   321      freq = freq + freqstep
   322    end do

it uses 4*step for unrolled vectorization version other than the 2*step for
non-unrolled vectorization version. The SPEC fp result checks the "relative
tolerance" of the fp results and it is higher than the current standard (i.e.
the compare command line option of "--reltol 1e-10").

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/110449] Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization
  2023-06-28  9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
  2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
  2023-06-29  3:15 ` hliu at amperecomputing dot com
@ 2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-06 16:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110449

--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jeff Law <law@gcc.gnu.org>:

https://gcc.gnu.org/g:224fd59b2dc8a5fa78a309a09863afe9b3cf2111

commit r14-2367-g224fd59b2dc8a5fa78a309a09863afe9b3cf2111
Author: Hao Liu OS <hliu@os.amperecomputing.com>
Date:   Thu Jul 6 10:04:46 2023 -0600

    Vect: use a small step to calculate induction for the unrolled loop (PR
tree-optimization/110449)

    If a loop is unrolled by n times during vectoriation, two steps are used to
    calculate the induction variable:
      - The small step for the unrolled ith-copy: vec_1 = vec_iv + (VF/n *
Step)
      - The large step for the whole loop: vec_loop = vec_iv + (VF * Step)

    This patch calculates an extra vec_n to replace vec_loop:
      vec_n = vec_prev + (VF/n * S) = vec_iv + (VF/n * S) * n = vec_loop.

    So that we can save the large step register and related operations.

    gcc/ChangeLog:

            PR tree-optimization/110449
            * tree-vect-loop.cc (vectorizable_induction): use vec_n to replace
            vec_loop for the unrolled loop.

    gcc/testsuite/ChangeLog:

            * gcc.target/aarch64/pr110449.c: New testcase.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-07-06 16:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-28  9:22 [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization hliu at amperecomputing dot com
2023-06-28 20:11 ` [Bug tree-optimization/110449] " rsandifo at gcc dot gnu.org
2023-06-29  3:15 ` hliu at amperecomputing dot com
2023-07-06 16:05 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).