[Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations
@ 2023-12-21  8:22 fxue at os dot amperecomputing.com
  2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2023-12-21  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

            Bug ID: 113104
           Summary: Suboptimal loop-based slp node splicing across
                    iterations
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a partial vector-sized slp node in loop, code generation would utilize
inter-iteration parallelism to archive full vectorization by splicing defs of
the node in multiple iterations into one vector. This strategy is not always
good, and could be refined in some situation. To be specific, we'd better not
splice node if it participates in a full-vector-sized operation, otherwise a
permute and vextract that are really unneeded would be introduced.

Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an
iteration. Depending on whether backend supports LO/HI version of the
operation, there are two kinds code sequence for splicing.

  // Isolated 2 iterations
  res_v128_I0 = VEC_OP(opnd_v64_I0, ...)   // iteration #0
  res_v128_I1 = VEC_OP(opnd_v64_I1, ...)   // iteration #1 


  // Spliced (1)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }      // extra permute
  opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract
  opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract
  res_v128_I0 = VEC_OP(opnd_v64_lo, ...)
  res_v128_I1 = VEC_OP(opnd_v64_hi, ...)

  // Spliced (2)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }  // extra permute
  res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP
  res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP

Sometime, such permute and vextract might be optimized away by backend passes.
But sometime, it can not. Here is a case on aarch64.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
    unsigned array[4][4];

    for (unsigned i = 0; i < 4; i++, a += n)
      {
        array[i][0] = a[0] << 6;
        array[i][1] = a[1] << 6;
        array[i][2] = a[2] << 6;
        array[i][3] = a[3] << 6;
      }

    return test(array);
  }


        // Current code generation
        mov     x2, x0
        stp     x29, x30, [sp, -80]!
        add     x3, x2, x1, lsl 1
        lsl     x1, x1, 1
        mov     x29, sp
        add     x4, x3, x1
        ldr     d0, [x2]
        movi    v30.4s, 0
        add     x0, sp, 16
        ldr     d31, [x2, x1]
        ldr     d29, [x3, x1]
        ldr     d28, [x4, x1]
        ins     v0.d[1], v31.d[0]            //
        ins     v29.d[1], v28.d[0]           // 
        zip1    v1.8h, v0.8h, v30.8h         // superfluous  
        zip2    v0.8h, v0.8h, v30.8h         //
        zip1    v31.8h, v29.8h, v30.8h       //
        zip2    v29.8h, v29.8h, v30.8h       //
        shl     v1.4s, v1.4s, 6
        shl     v0.4s, v0.4s, 6
        shl     v31.4s, v31.4s, 6
        shl     v29.4s, v29.4s, 6
        stp     q1, q0, [sp, 16]
        stp     q31, q29, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret


        // May be optimized to:
        stp     x29, x30, [sp, -80]!
        mov     x29, sp
        mov     x2, x0
        add     x0, sp, 16
        lsl     x3, x1, 1
        add     x1, x2, x1, lsl 1
        add     x4, x1, x3
        ldr     d31, [x2, x3]
        ushll   v31.4s, v31.4h, 6
        ldr     d30, [x2]
        ushll   v30.4s, v30.4h, 6
        str     q30, [sp, 16]
        ldr     d30, [x1, x3]
        ushll   v30.4s, v30.4h, 6
        str     q31, [sp, 32]
        ldr     d31, [x4, x3]
        ushll   v31.4s, v31.4h, 6
        stp     q30, q31, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
@ 2023-12-21  9:07 ` rguenth at gcc dot gnu.org
  2023-12-21  9:33 ` fxue at os dot amperecomputing.com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-12-21  9:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
See my proposal on the mailing list to lift the restriction of sticking to a
single vector size, I think this is another example showing this.  If you
use BB level vectorization by disabling loop vectorization but not SLP
vectorization the code should improve?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
  2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
@ 2023-12-21  9:33 ` fxue at os dot amperecomputing.com
  2023-12-21  9:41 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2023-12-21  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> ---
(In reply to Richard Biener from comment #1)
> See my proposal on the mailing list to lift the restriction of sticking to a
> single vector size, I think this is another example showing this.  If you
> use BB level vectorization by disabling loop vectorization but not SLP
> vectorization the code should improve?

Yes, the loop is fully unrolled, and BB SLP would.

I could not find the proposal, would you share me a link? Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
  2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
  2023-12-21  9:33 ` fxue at os dot amperecomputing.com
@ 2023-12-21  9:41 ` rguenther at suse dot de
  2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenther at suse dot de @ 2023-12-21  9:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
> 
> --- Comment #2 from Feng Xue <fxue at os dot amperecomputing.com> ---
> (In reply to Richard Biener from comment #1)
> > See my proposal on the mailing list to lift the restriction of sticking to a
> > single vector size, I think this is another example showing this.  If you
> > use BB level vectorization by disabling loop vectorization but not SLP
> > vectorization the code should improve?
> 
> Yes, the loop is fully unrolled, and BB SLP would.

I suspect even when the loop isn't unrolled (just increase iteration
count) the code would improve

> I could not find the proposal, would you share me a link? Thanks

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
                   ` (2 preceding siblings ...)
  2023-12-21  9:41 ` rguenther at suse dot de
@ 2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
  2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-12-30 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2023-12-30
     Ever confirmed|0                           |1
                 CC|                            |rsandifo at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |rsandifo at gcc dot gnu.org

--- Comment #4 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
FWIW, we do get the desired code with -march=armv8-a+sve (even though the test
doesn't use SVE).  This is because of:

  /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we
     can compare SVE against Advanced SIMD and so that we can compare
     multiple SVE vectorization approaches against each other.  There's
     not really any point doing this for Advanced SIMD only, since the
     first mode that works should always be the best.  */
  if (TARGET_SVE && aarch64_sve_compare_costs)
    flags |= VECT_COMPARE_COSTS;

The testcase in this PR is a counterexample to the claim in the final sentence.
 I think the comment might predate significant support for mixed-sized Advanced
SIMD vectorisation.

If we enable SVE (or uncomment the "if" line), the costs are 13 units per
vector iteration for 128-bit vectors and 4 units per vector iteration for
64-bit vectors (so 8 units per 128 bits on a parity basis).  The 64-bit version
is therefore seen as significantly cheaper and is chosen ahead of the 128-bit
version.

I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even
without SVE.  Assigning to myself for that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
                   ` (3 preceding siblings ...)
  2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
@ 2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
  2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
  2024-01-10  5:01 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-01-05 16:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:7328faf89e9b4953baaff10e18262c70fbd3e578

commit r14-6961-g7328faf89e9b4953baaff10e18262c70fbd3e578
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Fri Jan 5 16:25:16 2024 +0000

    aarch64: Extend VECT_COMPARE_COSTS to !SVE [PR113104]

    When SVE is enabled, we try vectorising with multiple different SVE and
    Advanced SIMD approaches and use the cost model to pick the best one.
    Until now, we've not done that for Advanced SIMD, since "the first mode
    that works should always be the best".

    The testcase is a counterexample.  Each iteration of the scalar loop
    vectorises naturally with 64-bit input vectors and 128-bit output
    vectors.  We do try that for SVE, and choose it as the best approach.
    But the first approach we try is instead to use:

    - a vectorisation factor of 2
    - 1 128-bit vector for the inputs
    - 2 128-bit vectors for the outputs

    But since the stride is variable, the cost of marshalling the input
    vector from two iterations outweighs the benefit of doing two iterations
    at once.

    This patch therefore generalises aarch64-sve-compare-costs to
    aarch64-vect-compare-costs and applies it to non-SVE compilations.

    gcc/
            PR target/113104
            * doc/invoke.texi (aarch64-sve-compare-costs): Replace with...
            (aarch64-vect-compare-costs): ...this.
            * config/aarch64/aarch64.opt (-param=aarch64-sve-compare-costs=):
            Replace with...
            (-param=aarch64-vect-compare-costs=): ...this new param.
            * config/aarch64/aarch64.cc (aarch64_override_options_internal):
            Don't disable it when vectorizing for Advanced SIMD only.
            (aarch64_autovectorize_vector_modes): Apply VECT_COMPARE_COSTS
            whenever aarch64_vect_compare_costs is true.

    gcc/testsuite/
            PR target/113104
            * gcc.target/aarch64/pr113104.c: New test.
            * gcc.target/aarch64/sve/cond_arith_1.c: Update for new parameter
            names.
            * gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise.
            * gcc.target/aarch64/sve/cond_arith_3.c: Likewise.
            * gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise.
            * gcc.target/aarch64/sve/gather_load_6.c: Likewise.
            * gcc.target/aarch64/sve/gather_load_7.c: Likewise.
            * gcc.target/aarch64/sve/load_const_offset_2.c: Likewise.
            * gcc.target/aarch64/sve/load_const_offset_3.c: Likewise.
            * gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise.
            * gcc.target/aarch64/sve/mask_gather_load_7.c: Likewise.
            * gcc.target/aarch64/sve/mask_load_slp_1.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.
            * gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.
            * gcc.target/aarch64/sve/pack_1.c: Likewise.
            * gcc.target/aarch64/sve/reduc_4.c: Likewise.
            * gcc.target/aarch64/sve/scatter_store_6.c: Likewise.
            * gcc.target/aarch64/sve/scatter_store_7.c: Likewise.
            * gcc.target/aarch64/sve/strided_load_3.c: Likewise.
            * gcc.target/aarch64/sve/strided_store_3.c: Likewise.
            * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Likewise.
            * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise.
            * gcc.target/aarch64/sve/unpack_signed_1.c: Likewise.
            * gcc.target/aarch64/sve/unpack_unsigned_1.c: Likewise.
            * gcc.target/aarch64/sve/unpack_unsigned_1_run.c: Likewise.
            * gcc.target/aarch64/sve/vcond_11.c: Likewise.
            * gcc.target/aarch64/sve/vcond_11_run.c: Likewise.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
                   ` (4 preceding siblings ...)
  2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
@ 2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
  2024-01-10  5:01 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-01-05 16:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Fixed.  Thanks for the report.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
  2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
                   ` (5 preceding siblings ...)
  2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
@ 2024-01-10  5:01 ` fxue at os dot amperecomputing.com
  6 siblings, 0 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-01-10  5:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Feng Xue <fxue at os dot amperecomputing.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|FIXED                       |---
             Status|RESOLVED                    |REOPENED

--- Comment #7 from Feng Xue <fxue at os dot amperecomputing.com> ---
Partial permutation (especially extract-low/high) would incur inefficient
splicing in some situation even the vector mode with lowest cost is used.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
    unsigned array[4][4];

    for (unsigned i = 0; i < 4; i++, a += n)
      {
        array[i][0] = (a[0] << 3) - (a[4] << 6);
        array[i][1] = (a[1] << 3) - (a[5] << 6);
        array[i][2] = (a[2] << 3) - (a[6] << 6);
        array[i][3] = (a[3] << 3) - (a[7] << 6);
      }

    return test(array);
  }

        // After vect-compare-cost fix
        mov     x2, x0
        stp     x29, x30, [sp, -80]!
        add     x3, x2, x1, lsl 1
        lsl     x1, x1, 1
        mov     x29, sp
        add     x4, x3, x1
        add     x0, sp, 16
        ldr     q5, [x2]
        ldr     q31, [x4, x1]
        ldr     q0, [x3, x1]
        ldr     q1, [x2, x1]
        movi    v28.4s, 0                //
        zip1    v29.2d, v0.2d, v31.2d    //
        zip1    v2.2d, v5.2d, v1.2d      //
        zip2    v31.2d, v0.2d, v31.2d    //
        zip2    v1.2d, v5.2d, v1.2d      // 
        zip1    v30.8h, v29.8h, v28.8h   //
        zip1    v4.8h, v2.8h, v28.8h     //  superfluous
        zip1    v27.8h, v31.8h, v28.8h   //
        zip1    v3.8h, v1.8h, v28.8h     //
        zip2    v29.8h, v29.8h, v28.8h   //
        zip2    v31.8h, v31.8h, v28.8h   //
        zip2    v2.8h, v2.8h, v28.8h     //
        zip2    v1.8h, v1.8h, v28.8h     //
        shl     v30.4s, v30.4s, 3
        shl     v29.4s, v29.4s, 3
        shl     v4.4s, v4.4s, 3
        shl     v2.4s, v2.4s, 3
        shl     v27.4s, v27.4s, 6
        shl     v31.4s, v31.4s, 6
        shl     v3.4s, v3.4s, 6
        shl     v1.4s, v1.4s, 6
        sub     v27.4s, v30.4s, v27.4s
        sub     v31.4s, v29.4s, v31.4s
        sub     v3.4s, v4.4s, v3.4s
        sub     v1.4s, v2.4s, v1.4s
        stp     q27, q31, [sp, 48]
        stp     q3, q1, [sp, 16]
        bl      test
        ldp     x29, x30, [sp], 80
        ret


        // Expect it to be optimized as:
        lsl     x3, x1, 1
        mov     x2, x0
        stp     x29, x30, [sp, -80]!
        add     x1, x2, x1, lsl 1
        add     x4, x1, x3
        mov     x29, sp
        add     x0, sp, 16
        ldr     q30, [x2, x3]
        ldr     q0, [x2]
        ushll   v31.4s, v30.4h, 3
        ushll2  v30.4s, v30.8h, 6
        ushll   v29.4s, v0.4h, 3
        ushll2  v0.4s, v0.8h, 6
        sub     v30.4s, v31.4s, v30.4s
        sub     v0.4s, v29.4s, v0.4s
        str     q0, [sp, 16]
        ldr     q0, [x1, x3]
        str     q30, [sp, 32]
        ldr     q30, [x4, x3]
        ushll   v29.4s, v0.4h, 3
        ushll2  v0.4s, v0.8h, 6
        ushll   v31.4s, v30.4h, 3
        ushll2  v30.4s, v30.8h, 6
        sub     v0.4s, v29.4s, v0.4s
        sub     v30.4s, v31.4s, v30.4s
        stp     q0, q30, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret

Based on cost arising from splicing, we still need a way to select the most
profitable vector mode per slp node, over the global vector mode specified in
vinfo.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-01-10  5:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
2023-12-21  9:33 ` fxue at os dot amperecomputing.com
2023-12-21  9:41 ` rguenther at suse dot de
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10  5:01 ` fxue at os dot amperecomputing.com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).