[Bug target/39821] 120% slowdown with vectorizer

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
@ 2021-07-26  6:19 ` pinskia at gcc dot gnu.org
  2021-07-27  7:24 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-26  6:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The code generation for aarch64 looks fine:
dotproduct_order4:
.LFB1:
        .cfi_startproc
        ldr     q1, [x0]
        ldr     q2, [x1]
        smull   v0.2d, v2.2s, v1.2s
        smlal2  v0.2d, v2.4s, v1.4s
        addp    d0, v0.2d
        fmov    x0, d0
        ret
  vect__6.41_18 = MEM <vector(4) int> [(int32_t *)v1_2(D)];
  vect__10.44_13 = MEM <vector(4) int> [(int32_t *)v2_3(D)];
  vect_patt_25.45_8 = WIDEN_MULT_LO_EXPR <vect__10.44_13, vect__6.41_18>;
  vect_patt_25.45_4 = WIDEN_MULT_HI_EXPR <vect__10.44_13, vect__6.41_18>;
  vect_accum_14.46_31 = vect_patt_25.45_4 + vect_patt_25.45_8;
  _33 = .REDUC_PLUS (vect_accum_14.46_31); [tail call]
---- CUT ----
Even the gimple level for x86_64 looks ok:
  vect__6.41_18 = MEM <vector(4) int> [(int32_t *)v1_2(D)];
  vect__10.44_13 = MEM <vector(4) int> [(int32_t *)v2_3(D)];
  vect_patt_25.45_8 = WIDEN_MULT_LO_EXPR <vect__10.44_13, vect__6.41_18>;
  vect_patt_25.45_4 = WIDEN_MULT_HI_EXPR <vect__10.44_13, vect__6.41_18>;
  vect_accum_14.46_31 = vect_patt_25.45_4 + vect_patt_25.45_8;
  _33 = VEC_PERM_EXPR <vect_accum_14.46_31, { 0, 0 }, { 1, 2 }>;
  _34 = vect_accum_14.46_31 + _33;
  stmp_accum_14.47_35 = BIT_FIELD_REF <_34, 64, 0>;

But the expansion looks bad.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
  2021-07-26  6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
@ 2021-07-27  7:24 ` rguenth at gcc dot gnu.org
  2021-07-27  8:42 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-27  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body
...
0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body
...
t4.c:4:12: note:  Cost model analysis:
  Vector inside of loop cost: 40
  Vector prologue cost: 4
  Vector epilogue cost: 108
  Scalar iteration cost: 40
  Scalar outside cost: 32
  Vector outside cost: 112
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3

so clearly the widening multiplication is not costed correctly.  With SSE 4.2
we can do better:

.L4:
        movdqu  (%rcx,%rax), %xmm0
        movdqu  (%rsi,%rax), %xmm1
        addq    $16, %rax
        movdqa  %xmm0, %xmm3
        movdqa  %xmm1, %xmm4
        punpckldq       %xmm0, %xmm3
        punpckldq       %xmm1, %xmm4
        punpckhdq       %xmm0, %xmm0
        pmuldq  %xmm4, %xmm3
        punpckhdq       %xmm1, %xmm1
        pmuldq  %xmm1, %xmm0
        paddq   %xmm3, %xmm2
        paddq   %xmm0, %xmm2
        cmpq    %rdi, %rax
        jne     .L4

but even there the costing is imprecise.  The vectorizer is unhelpful in
categorizing the widen mult as vec_promote_demote which then fails to
run into

        case MULT_EXPR:
        case WIDEN_MULT_EXPR:
        case MULT_HIGHPART_EXPR:
          stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
          break;

fixing that yields

0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body

for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via

      /* V*DImode is emulated with 5-8 insns.  */
      else if (mode == V2DImode || mode == V4DImode)
        {
          if (TARGET_XOP && mode == V2DImode)
            return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3);
          else
            return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
        }

with cost->mulss == 16.  I suppose it is somehow failing to realize it's
doing a widening multiply.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
  2021-07-26  6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
  2021-07-27  7:24 ` rguenth at gcc dot gnu.org
@ 2021-07-27  8:42 ` cvs-commit at gcc dot gnu.org
  2021-07-27  8:43 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-27  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:c8ce54c6e67295b70052d1b9f9a2f7ce9e2f8f0d

commit r12-2524-gc8ce54c6e67295b70052d1b9f9a2f7ce9e2f8f0d
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Jul 27 09:24:57 2021 +0200

    tree-optimization/39821 - fix cost classification for widening arith

    This adjusts the vectorizer to cost vector_stmt for widening
    arithmetic instead of vec_promote_demote in the line of telling
    the target that stmt_info->stmt is the meaningful piece we cost.

    2021-07-27  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/39821
            * tree-vect-stmts.c (vect_model_promotion_demotion_cost): Use
            vector_stmt for widening arithmetic.
            (vectorizable_conversion): Adjust.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2021-07-27  8:42 ` cvs-commit at gcc dot gnu.org
@ 2021-07-27  8:43 ` rguenth at gcc dot gnu.org
  2021-07-28  5:36 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-07-27  8:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
I've pushed the change that makes us run into ix86_multiplication_cost but as
said that doesn't differentiate highpart or widening multiply yet and thus
we're now missing optimizations because of too conservative costing.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2021-07-27  8:43 ` rguenth at gcc dot gnu.org
@ 2021-07-28  5:36 ` crazylht at gmail dot com
  2021-07-29  1:06 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-07-28  5:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #8)
> I've pushed the change that makes us run into ix86_multiplication_cost but
> as said that doesn't differentiate highpart or widening multiply yet and
> thus we're now missing optimizations because of too conservative costing.

For MULT_HIGHPART_EXPR, x86 only have pmulhw, it's probably ok to go into
ix86_multiplication_cost.

For WIDEN_MULT_EXPR, we need a separate cost function which should also accept
sign info since we have pmuludq under sse2 but pmuldq under sse4.1.


.i.e we should vectorize udotproduct under sse2, but sdotprodoct under sse4.1

#include<stdint.h>
uint64_t udotproduct(uint32_t *v1, uint32_t *v2, int order)
{
    uint64_t accum = 0;
    while (order--)
        accum += (uint64_t) *v1++ * *v2++;
    return accum;
}

#include<stdint.h>
int64_t sdotproduct(int32_t *v1, int32_t *v2, int order)
{
    int64_t accum = 0;
    while (order--)
        accum += (int64_t) *v1++ * *v2++;
    return accum;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2021-07-28  5:36 ` crazylht at gmail dot com
@ 2021-07-29  1:06 ` cvs-commit at gcc dot gnu.org
  2021-07-29  1:12 ` crazylht at gmail dot com
  2021-08-26 12:03 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-07-29  1:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:231bcc77b953406b8381c7f55a3ec181da67d1e7

commit r12-2586-g231bcc77b953406b8381c7f55a3ec181da67d1e7
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Jul 28 16:24:52 2021 +0800

    Add a separate function to calculate cost for WIDEN_MULT_EXPR.

    gcc/ChangeLog:

            PR target/39821
            * config/i386/i386.c (ix86_widen_mult_cost): New function.
            (ix86_add_stmt_cost): Use ix86_widen_mult_cost for
            WIDEN_MULT_EXPR.

    gcc/testsuite/ChangeLog:

            PR target/39821
            * gcc.target/i386/sse2-pr39821.c: New test.
            * gcc.target/i386/sse4-pr39821.c: New test.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2021-07-29  1:06 ` cvs-commit at gcc dot gnu.org
@ 2021-07-29  1:12 ` crazylht at gmail dot com
  2021-08-26 12:03 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: crazylht at gmail dot com @ 2021-07-29  1:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
Fixed in GCC12.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/39821] 120% slowdown with vectorizer
       [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2021-07-29  1:12 ` crazylht at gmail dot com
@ 2021-08-26 12:03 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-26 12:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
   Target Milestone|---                         |12.0
             Status|NEW                         |RESOLVED

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-26 12:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-39821-4@http.gcc.gnu.org/bugzilla/>
2021-07-26  6:19 ` [Bug target/39821] 120% slowdown with vectorizer pinskia at gcc dot gnu.org
2021-07-27  7:24 ` rguenth at gcc dot gnu.org
2021-07-27  8:42 ` cvs-commit at gcc dot gnu.org
2021-07-27  8:43 ` rguenth at gcc dot gnu.org
2021-07-28  5:36 ` crazylht at gmail dot com
2021-07-29  1:06 ` cvs-commit at gcc dot gnu.org
2021-07-29  1:12 ` crazylht at gmail dot com
2021-08-26 12:03 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).