public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization
@ 2021-01-20 15:26 joelh at gcc dot gnu.org
2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-01-20 15:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
Bug ID: 98772
Summary: Widening patterns causing missed vectorization
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: joelh at gcc dot gnu.org
Target Milestone: ---
Disabling widening patterns (widening_mult, widening_plus, widening_minus)
allows some testcases to be vectorized better. Currently mixed scalar and
vector code is produced, due to the patterns being recognized and substituted
but vectorization failing 'no optab'. When they are recognized 16bytes -> 16
shorts, using a pair 8byte->8short instructions is presumed, the datatypes
chosen in 'vectorizable_conversion' are 'vectype_in' 8 bytes, 'vectype out' 8
shorts. This causes the scalar code to be emitted where these patterns were
recognized.
For the following testcases with: gcc -O3
#include <stdint.h>
extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict
pix2)
{
for( int y = 0; y < 4; y++ )
{
for( int x = 0; x < 4; x++ )
d[x + y*4] = pix1[x] * pix2[x];
pix1 += 16;
pix2 += 16;
}
The following output is seen, processing 8 elements per cycle using scalar
instructions and 8 elements per cycle using vector instructions.
wdiff:
.LFB0:
.cfi_startproc
ldrb w3, [x1, 32]
ldrb w6, [x2, 32]
ldrb w8, [x1, 33]
ldrb w5, [x2, 33]
ldrb w4, [x1, 34]
mul w3, w3, w6
ldrb w7, [x1, 35]
fmov s0, w3
ldrb w3, [x2, 34]
mul w8, w8, w5
ldrb w9, [x2, 35]
ldrb w6, [x2, 48]
ldrb w5, [x1, 49]
ins v0.h[1], w8
mul w3, w4, w3
mul w7, w7, w9
ldrb w4, [x1, 48]
ldrb w8, [x2, 49]
ldrb w9, [x2, 50]
ins v0.h[2], w3
ldrb w3, [x1, 51]
mul w6, w6, w4
ldrb w4, [x1, 50]
mul w5, w5, w8
ldrb w8, [x2, 51]
ldr d2, [x1]
ins v0.h[3], w7
ldr d1, [x2]
mul w4, w4, w9
ldr d4, [x1, 16]
ldr d3, [x2, 16]
mul w1, w3, w8
ins v0.h[4], w6
zip1 v2.2s, v2.2s, v4.2s
zip1 v1.2s, v1.2s, v3.2s
ins v0.h[5], w5
umull v1.8h, v1.8b, v2.8b
ins v0.h[6], w4
ins v0.h[7], w1
stp q1, q0, [x0]
ret
if the widening multiply instruction is disabled e.g.:
- { vect_recog_widen_mult_pattern, "widen_mult" },
+ //{ vect_recog_widen_mult_pattern, "widen_mult" },
in tree-vect-patterns.c
then the same testcase is able to process 16 elements per cycle using vector
instructions.
wdiff:
.LFB0:
.cfi_startproc
ldr b3, [x1, 33]
ldr b2, [x2, 33]
ldr b1, [x1, 32]
ldr b0, [x2, 32]
ldr b5, [x1, 34]
ins v1.b[1], v3.b[0]
ldr b4, [x2, 34]
ins v0.b[1], v2.b[0]
ldr b3, [x1, 35]
ldr b2, [x2, 35]
ldr b19, [x1, 48]
ins v1.b[2], v5.b[0]
ldr b17, [x2, 48]
ins v0.b[2], v4.b[0]
ldr b18, [x1, 49]
ldr b16, [x2, 49]
ldr b7, [x1, 50]
ins v1.b[3], v3.b[0]
ldr b6, [x2, 50]
ins v0.b[3], v2.b[0]
ldr b5, [x1, 51]
ldr b4, [x2, 51]
ldr d3, [x1]
ins v1.b[4], v19.b[0]
ldr d2, [x2]
ins v0.b[4], v17.b[0]
ldr d19, [x1, 16]
ldr d17, [x2, 16]
ins v1.b[5], v18.b[0]
zip1 v3.2s, v3.2s, v19.2s
ins v0.b[5], v16.b[0]
zip1 v2.2s, v2.2s, v17.2s
ins v1.b[6], v7.b[0]
umull v2.8h, v2.8b, v3.8b
ins v0.b[6], v6.b[0]
ins v1.b[7], v5.b[0]
ins v0.b[7], v4.b[0]
umull v0.8h, v0.8b, v1.8b
stp q2, q0, [x0]
ret
.cfi_endproc
note the use of 2 umull instructions.
The same can be seen for widening plus and widening minus.
It appears to be due to the way than the vectype_in is chosen in vectorizable
conversion,
in vectorizable conversion, tree-vect-stmts.c:4626
vect_is_simple_use fills the &vectype1_in parameter, which fills the vectype_in
parameter.
during slp vectorization vect_is_simple_use uses the slp tree vectype:
tree-vect-stmts.c:
11369 if (slp_node)
11370 {
11371 slp_tree child = SLP_TREE_CHILDREN (slp_node)[operand]; |
11372 *slp_def = child;
11373 *vectype = SLP_TREE_VECTYPE (child);
11374 if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
11375 { | |11376 *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
| |11377 return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out); |
|11378 }
for 'vect' vectorization, the def_stmt_info is used.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
@ 2021-01-20 15:36 ` rguenth at gcc dot gnu.org
2021-01-21 11:13 ` joelh at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-20 15:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Blocks| |53947
Target| |arm
Version|unknown |11.0
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looks like arm assembly so assuming arm target. The pattern recognizer is
supposed to a) fixate the vector type, b) verify the target supports the op.
You need to trace where there's a disconnect between vectorizable_conversion
checking and pattern match checking.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
@ 2021-01-21 11:13 ` joelh at gcc dot gnu.org
2021-01-21 13:18 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-01-21 11:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
--- Comment #2 from Joel Hutton <joelh at gcc dot gnu.org> ---
Yes, it is aarch64, I have updated the field.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
2021-01-21 11:13 ` joelh at gcc dot gnu.org
@ 2021-01-21 13:18 ` rguenth at gcc dot gnu.org
2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-21 13:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
But the issue seems to be
t.c:3:22: note: ==> examining statement: _34 = *pix1_19;
t.c:3:22: missed: permutation requires at least three vectors _34 = *pix1_19;
t.c:3:22: missed: unsupported load permutation
t.c:6:24: missed: not vectorized: relevant stmt not supported: _34 =
*pix1_19;
t.c:3:22: note: removing SLP instance operations starting from: *_44 = _45;
t.c:3:22: missed: unsupported SLP instances
t.c:3:22: note: re-trying with SLP disabled
so SLP vectorization failing because of unsupported permutes with the larger
vector size and the non-SLP case failing with
t.c:3:22: missed: loop does not have enough iterations to support
vectorization.
t.c:3:22: note: ***** Analysis failed with vector mode V16QI
so I don't see the connection with the pattern. Only for V8QI I see it
remotely mentioned, but there we have _different_ pattens matched...
I think the permute issue is "old" and goes away if you make it
strided-slp by incrementing pix1/2 by a non-constant, then we can
load the vector by char[4] pieces. We just don't consider that
possibility when instead trying "strided" (with gap at the end).
The widen patterns are a red herring here I think.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
` (2 preceding siblings ...)
2021-01-21 13:18 ` rguenth at gcc dot gnu.org
@ 2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
2021-02-11 15:07 ` joelh at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-01-21 15:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2021-01-21
CC| |rsandifo at gcc dot gnu.org
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
To try to summarise a conversation we had on IRC:
As things stand, codes like WIDEN_MULT_EXPR are intended
to be code-generated as a hi/lo pair, with both the hi
and lo operation being vector(N*2) → vector(N) operations.
This works for BB SLP if the SLP group size is ≥ N*2,
but (as things stand) is bound to fail otherwise.
On targets that operate on only a single vector size,
a hard failure is not a problem for group sizes < N*2,
since we would have failed in the same place even if
we hadn't matched a WIDEN_MULT_EXPR. But it hurts on
aarch64 because we could vectorise the multiplication
and conversions using mixed vector sizes.
I think the conclusion was that:
(1) We should define vector(N) → vector(N) optabs for
each current widening operation. E.g. in the testcase
aarch64 would provide v8qi → v8hi widening operations.
(2) We should add directly-mapped internal functions for the new optabs.
(3) We should make the modifier==NONE paths in vectorizable_conversion
use the new internal functions for WIDEN_*_EXPRs.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
` (3 preceding siblings ...)
2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
@ 2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
2021-02-11 15:07 ` joelh at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-11 15:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Joel Hutton <joelh@gcc.gnu.org>:
https://gcc.gnu.org/g:4af29981ab57ad7ef4467e371e4145cce9c16eaa
commit r11-7189-g4af29981ab57ad7ef4467e371e4145cce9c16eaa
Author: Joel Hutton <joel.hutton@arm.com>
Date: Thu Feb 11 14:59:26 2021 +0000
[aarch64][vect] Support V8QI->V8HI WIDEN_ patterns
In the case where 8 out of every 16 elements are widened using a
widening pattern and the next 8 are skipped, the patterns are not
recognized. This is because they are normally used in a pair, such as
VEC_WIDEN_MINUS_HI/LO, to achieve a v16qi->v16hi conversion for example.
This patch adds support for V8QI->V8HI patterns.
gcc/ChangeLog:
PR tree-optimization/98772
* optabs-tree.c (supportable_half_widening_operation): New function
to check for supportable V8QI->V8HI widening patterns.
* optabs-tree.h (supportable_half_widening_operation): New
function.
* tree-vect-stmts.c (vect_create_half_widening_stmts): New function
to create promotion stmts for V8QI->V8HI widening patterns.
(vectorizable_conversion): Add case for V8QI->V8HI.
gcc/testsuite/ChangeLog:
PR tree-optimization/98772
* gcc.target/aarch64/pr98772.c: New test.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug tree-optimization/98772] Widening patterns causing missed vectorization
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
` (4 preceding siblings ...)
2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
@ 2021-02-11 15:07 ` joelh at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: joelh at gcc dot gnu.org @ 2021-02-11 15:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
Joel Hutton <joelh at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |FIXED
--- Comment #6 from Joel Hutton <joelh at gcc dot gnu.org> ---
fixed on trunk
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-02-11 15:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 15:26 [Bug tree-optimization/98772] New: Widening patterns causing missed vectorization joelh at gcc dot gnu.org
2021-01-20 15:36 ` [Bug tree-optimization/98772] " rguenth at gcc dot gnu.org
2021-01-21 11:13 ` joelh at gcc dot gnu.org
2021-01-21 13:18 ` rguenth at gcc dot gnu.org
2021-01-21 15:30 ` rsandifo at gcc dot gnu.org
2021-02-11 15:05 ` cvs-commit at gcc dot gnu.org
2021-02-11 15:07 ` joelh at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).