[Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion
@ 2024-01-17 20:29 pinskia at gcc dot gnu.org
  2024-01-17 21:08 ` [Bug tree-optimization/113458] " pinskia at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-17 20:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

            Bug ID: 113458
           Summary: Missed SLP for reduction of multiplication/addition
                    with promotion
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*-*

Take:
```
int f(short *a, signed char *b)
{
        int sum = 0;
        sum += a[0]*b[0];
        sum += a[1]*b[1];
        sum += a[2]*b[2];
        sum += a[3]*b[3];
        return sum;
}
```

This is not SLPed with GCC.

With `-fno-vect-cost-model` it is but in a very inefficient way.

LLVM produces:
```
        ldr     s0, [x1]
        ldr     d1, [x0]
        sshll   v0.8h, v0.8b, #0 // promote to short
        smull   v0.4s, v0.4h, v1.4h //multiply 2 shorts to ints
        addv    s0, v0.4s // do the reduction
        fmov    w0, s0
```

Which GCC should be to produce this too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
@ 2024-01-17 21:08 ` pinskia at gcc dot gnu.org
  2024-01-18  5:13 ` liuhongt at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-17 21:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The loop based vectorizer is able to do a decent job at:
```
int f(short *a, signed char *b, int n)
{
        int sum = 0;
        n = 8;
        for(int i = 0;i < n; i++)
          sum += a[i]*b[i];
        return sum;
}
```

But if we reduce n to 4, the loop based vectorizer is not able to handle it
either.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
  2024-01-17 21:08 ` [Bug tree-optimization/113458] " pinskia at gcc dot gnu.org
@ 2024-01-18  5:13 ` liuhongt at gcc dot gnu.org
  2024-01-18  8:05 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-01-18  5:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #2 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> But if we reduce n to 4, the loop based vectorizer is not able to handle it
> either.

Do we support 1 element vector(i.e V1SI) in vectorizer?
and it also relies on backend support of dot_prodv4qi.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
  2024-01-17 21:08 ` [Bug tree-optimization/113458] " pinskia at gcc dot gnu.org
  2024-01-18  5:13 ` liuhongt at gcc dot gnu.org
@ 2024-01-18  8:05 ` rguenth at gcc dot gnu.org
  2024-01-18  8:10 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-18  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86_64 with -mavx2 we vectorize

t.c:7:13: note: Vectorizing SLP tree:
t.c:7:13: note: Root stmt: sum_26 = _20 + sum_25;
t.c:7:13: note: node 0x57386c0 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: _20 = _17 * _19;
t.c:7:13: note:         stmt 0 _5 = _2 * _4;
t.c:7:13: note:         stmt 1 _10 = _7 * _9;
t.c:7:13: note:         stmt 2 _15 = _12 * _14;
t.c:7:13: note:         stmt 3 _20 = _17 * _19;
t.c:7:13: note:         children 0x5738748 0x5738858
t.c:7:13: note: node 0x5738748 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: _17 = (int) _16;
t.c:7:13: note:         stmt 0 _2 = (int) _1;
t.c:7:13: note:         stmt 1 _7 = (int) _6;
t.c:7:13: note:         stmt 2 _12 = (int) _11;
t.c:7:13: note:         stmt 3 _17 = (int) _16;
t.c:7:13: note:         children 0x57387d0
t.c:7:13: note: node 0x57387d0 (max_nunits=4, refcnt=1) vector(4) short int
t.c:7:13: note: op template: _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 0 _1 = *a_22(D);
t.c:7:13: note:         stmt 1 _6 = MEM[(short int *)a_22(D) + 2B];
t.c:7:13: note:         stmt 2 _11 = MEM[(short int *)a_22(D) + 4B];
t.c:7:13: note:         stmt 3 _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note: node 0x5738858 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: patt_37 = (int) patt_36;
t.c:7:13: note:         stmt 0 patt_28 = (int) patt_27;
t.c:7:13: note:         stmt 1 patt_31 = (int) patt_30;
t.c:7:13: note:         stmt 2 patt_34 = (int) patt_33;
t.c:7:13: note:         stmt 3 patt_37 = (int) patt_36;
t.c:7:13: note:         children 0x57388e0
t.c:7:13: note: node 0x57388e0 (max_nunits=4, refcnt=1) vector(4) signed short
t.c:7:13: note: op template: patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 0 patt_27 = (signed short) _3;
t.c:7:13: note:         stmt 1 patt_30 = (signed short) _8;
t.c:7:13: note:         stmt 2 patt_33 = (signed short) _13;
t.c:7:13: note:         stmt 3 patt_36 = (signed short) _18;
t.c:7:13: note:         children 0x5738968
t.c:7:13: note: node 0x5738968 (max_nunits=4, refcnt=1) vector(4) signed char
t.c:7:13: note: op template: _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 0 _3 = *b_23(D);
t.c:7:13: note:         stmt 1 _8 = MEM[(signed char *)b_23(D) + 1B];
t.c:7:13: note:         stmt 2 _13 = MEM[(signed char *)b_23(D) + 2B];
t.c:7:13: note:         stmt 3 _18 = MEM[(signed char *)b_23(D) + 3B];

thus

  vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)];
  vect__17.6_41 = (vector(4) int) vect__16.5_40;
  vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)];
  vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44;
  vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45;
  vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
  _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
  _50 = .REDUC_PLUS (_49); [tail call]
  _51 = (int) _50;

f:
.LFB0:
        .cfi_startproc
        vpmovsxbd       (%rsi), %xmm1
        vpmovsxwd       (%rdi), %xmm0
        vpmulld %xmm1, %xmm0, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        ret

similar with SSE4.

We do recognize widening mults as patterns but we're somehow not using them
which is likely the failure of reduction root detection not looking for
patterns (that's an issue for all of them) - root detection is done before
pattern recog here.  Interestingly enough for x86 we end up doing

t.c:7:13: note: ------>vectorizing SLP node starting from: patt_38 = _16 w*
patt_36;
t.c:7:13: note: vect_is_simple_use: operand MEM[(short int *)a_22(D) + 6B],
type of def: internal
t.c:7:13: note: vect_is_simple_use: operand (signed short) _18, type of def:
internal
t.c:7:13: note: transform conversion. ncopies = 1.
t.c:7:13: note: add new stmt: _44 = (vector(4) int) vect__16.5_40;
t.c:7:13: note: add new stmt: _45 = (vector(4) int) vect_patt_36.9_43;
t.c:7:13: note: add new stmt: vect_patt_38.10_46 = _44 * _45;

thus add the very same code as without the pattern.

Does the following help for ARM?

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 086377a9ac0..c0626720651 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3649,6 +3649,9 @@ vect_analyze_slp (vec_info *vinfo, unsigned
max_tree_size)
       for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i)
        {
          vect_location = bb_vinfo->roots[i].roots[0]->stmt;
+         for (unsigned j = 0; j < bb_vinfo->roots[i].stmts.length (); ++j)
+           bb_vinfo->roots[i].stmts[j]
+             = vect_stmt_to_vectorize (bb_vinfo->roots[i].stmts[j]);
          if (vect_build_slp_instance (bb_vinfo, bb_vinfo->roots[i].kind,
                                       bb_vinfo->roots[i].stmts,
                                       bb_vinfo->roots[i].roots,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-01-18  8:05 ` rguenth at gcc dot gnu.org
@ 2024-01-18  8:10 ` rguenth at gcc dot gnu.org
  2024-01-18  8:36 ` [Bug target/113458] " pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-18  8:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #2)
> > But if we reduce n to 4, the loop based vectorizer is not able to handle it
> > either.
> 
> Do we support 1 element vector(i.e V1SI) in vectorizer?

Yes, but I'm not sure we'd try it here.

For SVE with -msve-vector-bits=128 we fail to elide the load permutation
thoug it looks odd:

t.c:7:13: missed:   unsupported vect permute { 3 2 0 1 7 6 4 5 11 10 8 9 }
t.c:7:13: missed:   unsupported load permutation

the SLP tree is

t.c:7:13: note:   Final SLP tree for instance 0x45933a0:
t.c:7:13: note:   node 0x45b82a0 (max_nunits=4, refcnt=2) vector(4) int
t.c:7:13: note:   op template: patt_38 = _16 w* patt_36;
t.c:7:13: note:         stmt 0 patt_38 = _16 w* patt_36;
t.c:7:13: note:         stmt 1 patt_35 = _11 w* patt_33;
t.c:7:13: note:         stmt 2 patt_29 = _1 w* patt_27;
t.c:7:13: note:         stmt 3 patt_32 = _6 w* patt_30;
t.c:7:13: note:         children 0x45b8330 0x45b83c0
t.c:7:13: note:   node 0x45b8330 (max_nunits=4, refcnt=2) vector(4) short int
t.c:7:13: note:   op template: _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 0 _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 1 _11 = MEM[(short int *)a_22(D) + 4B];
t.c:7:13: note:         stmt 2 _1 = *a_22(D);
t.c:7:13: note:         stmt 3 _6 = MEM[(short int *)a_22(D) + 2B];
t.c:7:13: note:         load permutation { 3 2 0 1 }
t.c:7:13: note:   node 0x45b83c0 (max_nunits=4, refcnt=2) vector(4) signed
short
t.c:7:13: note:   op template: patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 0 patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 1 patt_33 = (signed short) _13;
t.c:7:13: note:         stmt 2 patt_27 = (signed short) _3;
t.c:7:13: note:         stmt 3 patt_30 = (signed short) _8;
t.c:7:13: note:         children 0x45b8450
t.c:7:13: note:   node 0x45b8450 (max_nunits=4, refcnt=2) vector(4) signed char
t.c:7:13: note:   op template: _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 0 _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 1 _13 = MEM[(signed char *)b_23(D) + 2B];
t.c:7:13: note:         stmt 2 _3 = *b_23(D);
t.c:7:13: note:         stmt 3 _8 = MEM[(signed char *)b_23(D) + 1B];
t.c:7:13: note:         load permutation { 3 2 0 1 }

it looks like NEON doesn't have integer vectors(!?)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-01-18  8:10 ` rguenth at gcc dot gnu.org
@ 2024-01-18  8:36 ` pinskia at gcc dot gnu.org
  2024-01-19  0:33 ` liuhongt at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-18  8:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-01-18
     Ever confirmed|0                           |1
          Component|tree-optimization           |target
             Status|UNCONFIRMED                 |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |pinskia at gcc dot gnu.org

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Hmm, I have to double check but I suspect the issue is there is no v4qi mode
for aarch64 currently but it could be added as emulated as basically v8qi
except load/store as s0.

Let me see if that helps there. The biggest thing is I have to make sure of is
the abi does not change for vector(4) char too (the aarch64 backend uses mode
in some cases still).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-01-18  8:36 ` [Bug target/113458] " pinskia at gcc dot gnu.org
@ 2024-01-19  0:33 ` liuhongt at gcc dot gnu.org
  2024-01-19 21:41 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-01-19  0:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> thus
> 
>   vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)];
>   vect__17.6_41 = (vector(4) int) vect__16.5_40;
>   vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)];
>   vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44;
>   vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45;
>   vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
>   _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
>   _50 = .REDUC_PLUS (_49); [tail call]
>   _51 = (int) _50;
> 
>
Ideally, it should be recognized as DOT_PROD_EXPR, but
vect_recog_dot_prod_pattern only works for loop vectorizer. We may add some
pattern match in match.pd for


vect__17.6_41 = (vector(4) int) vect__16.5_40;
vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45
vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
_49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
_50 = .REDUC_PLUS (_49); [tail call]

to

vect__20.12_48 = DOT_PRDO_EXPR (vect__16.5_40, vect_patt_36.10_45, 0);
_49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
_50 = .REDUC_PLUS (_49); [tail call]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-01-19  0:33 ` liuhongt at gcc dot gnu.org
@ 2024-01-19 21:41 ` pinskia at gcc dot gnu.org
  2024-01-19 22:56 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-19 21:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I have a patch which implements V4QI for many operations (extends and
widden_<su>sum) (though I need to fix the cost model).

I am able to get:
```
        sshll   v30.4h, v30.8b, #0
        smull   v31.4s, v31.4h, v30.4h
        addv    s31, v31.4s

```

But the load is still using inserts and tbl. I have not figured out why though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-01-19 21:41 ` pinskia at gcc dot gnu.org
@ 2024-01-19 22:56 ` pinskia at gcc dot gnu.org
  2024-01-20  0:41 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-19 22:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #7) 
> But the load is still using inserts and tbl. I have not figured out why
> though.

Looks like I have to support const PERMs.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-01-19 22:56 ` pinskia at gcc dot gnu.org
@ 2024-01-20  0:41 ` pinskia at gcc dot gnu.org
  2024-01-20  8:09 ` pinskia at gcc dot gnu.org
  2024-01-21 22:14 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-20  0:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #8)
> (In reply to Andrew Pinski from comment #7) 
> > But the load is still using inserts and tbl. I have not figured out why
> > though.
> 
> Looks like I have to support const PERMs.

Which I have enough supported.  Now on to the cost model.
I do get some testsuite failures which means I need to add more support
instruction for the V4QI mode but it is a good start too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-01-20  0:41 ` pinskia at gcc dot gnu.org
@ 2024-01-20  8:09 ` pinskia at gcc dot gnu.org
  2024-01-21 22:14 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-20  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I will note that LLVM does not try to do V2HI (or rather their equivalent
there), maybe the cost is too high for most cores ...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/113458] Missed SLP for reduction of multiplication/addition with promotion
  2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-01-20  8:09 ` pinskia at gcc dot gnu.org
@ 2024-01-21 22:14 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-21 22:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113458

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #9)
> (In reply to Andrew Pinski from comment #8)
> > (In reply to Andrew Pinski from comment #7) 
> > > But the load is still using inserts and tbl. I have not figured out why
> > > though.
> > 
> > Looks like I have to support const PERMs.
> 
> Which I have enough supported.  Now on to the cost model.
> I do get some testsuite failures which means I need to add more support
> instruction for the V4QI mode but it is a good start too.

Actually it was not the cost model that was the issue. It was just implementing
movmisalign for the mode and also not having the PERM support done correctly. 
Once fixing those 2 issues, V4QI seems enough supported.

I am thinking about removing V2HI support from my patches though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-21 22:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-17 20:29 [Bug tree-optimization/113458] New: Missed SLP for reduction of multiplication/addition with promotion pinskia at gcc dot gnu.org
2024-01-17 21:08 ` [Bug tree-optimization/113458] " pinskia at gcc dot gnu.org
2024-01-18  5:13 ` liuhongt at gcc dot gnu.org
2024-01-18  8:05 ` rguenth at gcc dot gnu.org
2024-01-18  8:10 ` rguenth at gcc dot gnu.org
2024-01-18  8:36 ` [Bug target/113458] " pinskia at gcc dot gnu.org
2024-01-19  0:33 ` liuhongt at gcc dot gnu.org
2024-01-19 21:41 ` pinskia at gcc dot gnu.org
2024-01-19 22:56 ` pinskia at gcc dot gnu.org
2024-01-20  0:41 ` pinskia at gcc dot gnu.org
2024-01-20  8:09 ` pinskia at gcc dot gnu.org
2024-01-21 22:14 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).