[Bug tree-optimization/98772] New: Widening patterns causing missed vectorization

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization
@ 2021-01-20 15:26 joelh at gcc dot gnu.org
  2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-01-20 15:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

            Bug ID: 98772
           Summary: Widening patterns causing missed vectorization
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: joelh at gcc dot gnu.org
  Target Milestone: ---

Disabling widening patterns (widening_mult, widening_plus, widening_minus)
allows some testcases to be vectorized better. Currently mixed scalar and
vector code is produced, due to the patterns being recognized and substituted
but vectorization failing 'no optab'. When they are recognized 16bytes -> 16
shorts, using a pair 8byte->8short instructions is presumed, the datatypes
chosen in 'vectorizable_conversion' are 'vectype_in' 8 bytes, 'vectype out' 8
shorts. This causes the scalar code to be emitted where these patterns were
recognized.


For the following testcases with: gcc -O3

#include <stdint.h>
extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict
pix2)
{
   for( int y = 0; y < 4; y++ )
  {    
    for( int x = 0; x < 4; x++ )
      d[x + y*4] = pix1[x] * pix2[x];
    pix1 += 16;  
    pix2 += 16;
 }

The following output is seen, processing 8 elements per cycle using scalar
instructions and 8 elements per cycle using vector instructions.

wdiff:
.LFB0:
        .cfi_startproc
        ldrb    w3, [x1, 32]
        ldrb    w6, [x2, 32]
        ldrb    w8, [x1, 33]
        ldrb    w5, [x2, 33]
        ldrb    w4, [x1, 34]
        mul     w3, w3, w6
        ldrb    w7, [x1, 35]
        fmov    s0, w3
        ldrb    w3, [x2, 34]
        mul     w8, w8, w5
        ldrb    w9, [x2, 35]
        ldrb    w6, [x2, 48]
        ldrb    w5, [x1, 49]
        ins     v0.h[1], w8
        mul     w3, w4, w3
        mul     w7, w7, w9
        ldrb    w4, [x1, 48]
        ldrb    w8, [x2, 49]
        ldrb    w9, [x2, 50]
        ins     v0.h[2], w3
        ldrb    w3, [x1, 51]
        mul     w6, w6, w4
        ldrb    w4, [x1, 50]
        mul     w5, w5, w8
        ldrb    w8, [x2, 51]
        ldr     d2, [x1]
        ins     v0.h[3], w7
        ldr     d1, [x2]
        mul     w4, w4, w9
        ldr     d4, [x1, 16]
        ldr     d3, [x2, 16]
        mul     w1, w3, w8
        ins     v0.h[4], w6
        zip1    v2.2s, v2.2s, v4.2s
        zip1    v1.2s, v1.2s, v3.2s
        ins     v0.h[5], w5
        umull   v1.8h, v1.8b, v2.8b
        ins     v0.h[6], w4
        ins     v0.h[7], w1
        stp     q1, q0, [x0]
        ret


if the widening multiply instruction is disabled e.g.:

-  { vect_recog_widen_mult_pattern, "widen_mult" },
+  //{ vect_recog_widen_mult_pattern, "widen_mult" },
in tree-vect-patterns.c

then the same testcase is able to process 16 elements per cycle using vector
instructions. 

wdiff:
.LFB0:
        .cfi_startproc
        ldr     b3, [x1, 33]
        ldr     b2, [x2, 33]
        ldr     b1, [x1, 32]
        ldr     b0, [x2, 32]
        ldr     b5, [x1, 34]
        ins     v1.b[1], v3.b[0]
        ldr     b4, [x2, 34]
        ins     v0.b[1], v2.b[0]
        ldr     b3, [x1, 35]
        ldr     b2, [x2, 35]
        ldr     b19, [x1, 48]
        ins     v1.b[2], v5.b[0]
        ldr     b17, [x2, 48]
        ins     v0.b[2], v4.b[0]
        ldr     b18, [x1, 49]
        ldr     b16, [x2, 49]
        ldr     b7, [x1, 50]
        ins     v1.b[3], v3.b[0]
        ldr     b6, [x2, 50]
        ins     v0.b[3], v2.b[0]
        ldr     b5, [x1, 51]
        ldr     b4, [x2, 51]
        ldr     d3, [x1]
        ins     v1.b[4], v19.b[0]
        ldr     d2, [x2]
        ins     v0.b[4], v17.b[0]
        ldr     d19, [x1, 16]
        ldr     d17, [x2, 16]
        ins     v1.b[5], v18.b[0]
        zip1    v3.2s, v3.2s, v19.2s
        ins     v0.b[5], v16.b[0]
        zip1    v2.2s, v2.2s, v17.2s
        ins     v1.b[6], v7.b[0]
        umull   v2.8h, v2.8b, v3.8b
        ins     v0.b[6], v6.b[0]
        ins     v1.b[7], v5.b[0]
        ins     v0.b[7], v4.b[0]
        umull   v0.8h, v0.8b, v1.8b
        stp     q2, q0, [x0]
        ret
        .cfi_endproc

note the use of 2 umull instructions.



The same can be seen for widening plus and widening minus.

It appears to be due to the way than the vectype_in is chosen in vectorizable
conversion, 

in vectorizable conversion, tree-vect-stmts.c:4626

vect_is_simple_use fills the &vectype1_in parameter, which fills the vectype_in
parameter.



during slp vectorization vect_is_simple_use uses the slp tree vectype:

tree-vect-stmts.c:
11369 if (slp_node)
11370 {
11371 slp_tree child = SLP_TREE_CHILDREN (slp_node)[operand]; |
11372 *slp_def = child;
11373 *vectype = SLP_TREE_VECTYPE (child);
11374 if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
11375 { | |11376 *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
| |11377 return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out); |
|11378 }



for 'vect' vectorization, the def_stmt_info is used.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
@ 2021-01-20 15:36 ` rguenth at gcc dot gnu.org
  2021-01-21 11:13 ` joelh at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-20 15:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Blocks|                            |53947
             Target|                            |arm
            Version|unknown                     |11.0

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like arm assembly so assuming arm target.  The pattern recognizer is
supposed to a) fixate the vector type, b) verify the target supports the op.
You need to trace where there's a disconnect between vectorizable_conversion
checking and pattern match checking.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
  2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
@ 2021-01-21 11:13 ` joelh at gcc dot gnu.org
  2021-01-21 13:18 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-01-21 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

--- Comment #2 from Joel Hutton <joelh at gcc dot gnu.org> ---
Yes, it is aarch64, I have updated the field.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
  2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
  2021-01-21 11:13 ` joelh at gcc dot gnu.org
@ 2021-01-21 13:18 ` rguenth at gcc dot gnu.org
  2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-21 13:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
But the issue seems to be

t.c:3:22: note:   ==> examining statement: _34 = *pix1_19;
t.c:3:22: missed:   permutation requires at least three vectors _34 = *pix1_19;
t.c:3:22: missed:   unsupported load permutation
t.c:6:24: missed:   not vectorized: relevant stmt not supported: _34 =
*pix1_19;
t.c:3:22: note:   removing SLP instance operations starting from: *_44 = _45;
t.c:3:22: missed:  unsupported SLP instances
t.c:3:22: note:  re-trying with SLP disabled

so SLP vectorization failing because of unsupported permutes with the larger
vector size and the non-SLP case failing with

t.c:3:22: missed:  loop does not have enough iterations to support
vectorization.
t.c:3:22: note:  ***** Analysis failed with vector mode V16QI

so I don't see the connection with the pattern.  Only for V8QI I see it
remotely mentioned, but there we have _different_ pattens matched...

I think the permute issue is "old" and goes away if you make it
strided-slp by incrementing pix1/2 by a non-constant, then we can
load the vector by char[4] pieces.  We just don't consider that
possibility when instead trying "strided" (with gap at the end).

The widen patterns are a red herring here I think.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-21 13:18 ` rguenth at gcc dot gnu.org
@ 2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
  2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
  2021-02-11 15:07 ` joelh at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-01-21 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-01-21
                 CC|                            |rsandifo at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
To try to summarise a conversation we had on IRC:

As things stand, codes like WIDEN_MULT_EXPR are intended
to be code-generated as a hi/lo pair, with both the hi
and lo operation being vector(N*2) → vector(N) operations.
This works for BB SLP if the SLP group size is ≥ N*2,
but (as things stand) is bound to fail otherwise.

On targets that operate on only a single vector size,
a hard failure is not a problem for group sizes < N*2,
since we would have failed in the same place even if
we hadn't matched a WIDEN_MULT_EXPR.  But it hurts on
aarch64 because we could vectorise the multiplication
and conversions using mixed vector sizes.

I think the conclusion was that:

(1) We should define vector(N) → vector(N) optabs for
    each current widening operation.  E.g. in the testcase
    aarch64 would provide v8qi → v8hi widening operations.

(2) We should add directly-mapped internal functions for the new optabs.

(3) We should make the modifier==NONE paths in vectorizable_conversion
    use the new internal functions for WIDEN_*_EXPRs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
@ 2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
  2021-02-11 15:07 ` joelh at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-11 15:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Joel Hutton <joelh@gcc.gnu.org>:

https://gcc.gnu.org/g:4af29981ab57ad7ef4467e371e4145cce9c16eaa

commit r11-7189-g4af29981ab57ad7ef4467e371e4145cce9c16eaa
Author: Joel Hutton <joel.hutton@arm.com>
Date:   Thu Feb 11 14:59:26 2021 +0000

    [aarch64][vect] Support V8QI->V8HI WIDEN_ patterns

    In the case where 8 out of every 16 elements are widened using a
    widening pattern and the next 8 are skipped, the patterns are not
    recognized. This is because they are normally used in a pair, such  as
    VEC_WIDEN_MINUS_HI/LO, to achieve a v16qi->v16hi conversion for example.
    This patch adds support for V8QI->V8HI patterns.

    gcc/ChangeLog:

            PR tree-optimization/98772
            * optabs-tree.c (supportable_half_widening_operation): New function
            to check for supportable V8QI->V8HI widening patterns.
            * optabs-tree.h (supportable_half_widening_operation): New
function.
            * tree-vect-stmts.c (vect_create_half_widening_stmts): New function
            to create promotion stmts for V8QI->V8HI widening patterns.
            (vectorizable_conversion): Add case for V8QI->V8HI.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/98772
            * gcc.target/aarch64/pr98772.c: New test.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
  2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
@ 2021-02-11 15:07 ` joelh at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-02-11 15:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772

Joel Hutton <joelh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Joel Hutton <joelh at gcc dot gnu.org> ---
fixed on trunk

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-02-11 15:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
2021-01-21 11:13 ` joelh at gcc dot gnu.org
2021-01-21 13:18 ` rguenth at gcc dot gnu.org
2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
2021-02-11 15:07 ` joelh at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).