public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants
@ 2022-07-18 18:28 manolis.tsamis at vrull dot eu
  2022-07-19  6:52 ` [Bug target/106346] " rguenth at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2022-07-18 18:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

            Bug ID: 106346
           Summary: Potential regression on vectorization of left shift
                    with constants
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---
            Target: aarch64

Created attachment 53317
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53317&action=edit
Does not vectorize on GCC > 10.3

The following test case:

  void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8])
  {
    uint16_t diff_e0 = src1[0] - src2[0];
    uint16_t diff_e1 = src1[1] - src2[1];
    uint16_t diff_e2 = src1[2] - src2[2];
    uint16_t diff_e3 = src1[3] - src2[3];
    uint16_t diff_e4 = src1[4] - src2[4];
    uint16_t diff_e5 = src1[5] - src2[5];
    uint16_t diff_e6 = src1[6] - src2[6];
    uint16_t diff_e7 = src1[7] - src2[7];

    uint32_t a0 = diff_e0 << 1;
    uint32_t a1 = diff_e1 << 3;
    uint32_t a2 = diff_e2 << 4;
    uint32_t a3 = diff_e3 << 2;
    uint32_t a4 = diff_e4 << 12;
    uint32_t a5 = diff_e5 << 11;
    uint32_t a6 = diff_e6 << 9;
    uint32_t a7 = diff_e7 << 3;

    dst[0] = a0;
    dst[1] = a1;
    dst[2] = a2;
    dst[3] = a3;
    dst[4] = a4;
    dst[5] = a5;
    dst[6] = a6;
    dst[7] = a7;
  }

Compiles at -O3 to nice vectorized code by loading the constants from memory in
GCC 10.3:

  ldr     d0, [x1]
  adrp    x3, .LC0
  ldr     d1, [x2]
  adrp    x1, .LC1
  ldr     q3, [x3, #:lo12:.LC0]
  usubl   v0.8h, v0.8b, v1.8b
  ldr     q2, [x1, #:lo12:.LC1]
  uxtl    v1.4s, v0.4h
  uxtl2   v0.4s, v0.8h
  sshl    v1.4s, v1.4s, v3.4s
  sshl    v0.4s, v0.4s, v2.4s
  stp     q1, q0, [x0]
  ret

But this has regressed in later releases, with GCC still loading the constants
from memory but also emitting a lot of scalar code before that. For example GCC
13 produces:

  adrp    x3, .LC0
  ldrb    w6, [x1, 4]
  fmov    d0, x6
  ldrb    w7, [x1]
  ldr     q5, [x3, #:lo12:.LC0]
  fmov    d1, x7
  ldrb    w3, [x1, 5]
  ldrb    w4, [x1, 1]
  ldrb    w8, [x2, 4]
  ldrb    w5, [x2, 5]
  ins     v0.h[1], w3
  ldrb    w6, [x2]
  fmov    d2, x8
  ldrb    w3, [x2, 1]
  fmov    d3, x6
  ins     v2.h[1], w5
  ins     v1.h[1], w4
  ldrb    w9, [x1, 2]
  ins     v3.h[1], w3
  ldrb    w8, [x1, 6]
  ldrb    w7, [x2, 2]
  ldrb    w6, [x2, 6]
  ins     v1.h[2], w9
  ins     v0.h[2], w8
  ldrb    w5, [x1, 3]
  ins     v3.h[2], w7
  ldrb    w4, [x1, 7]
  ins     v2.h[2], w6
  ldrb    w1, [x2, 7]
  ldrb    w3, [x2, 3]
  ins     v1.h[3], w5
  ins     v0.h[3], w4
  ins     v2.h[3], w1
  ins     v3.h[3], w3
  adrp    x1, .LC1
  ldr     q4, [x1, #:lo12:.LC1]
  sub     v1.4h, v1.4h, v3.4h
  sub     v0.4h, v0.4h, v2.4h
  uxtl    v1.4s, v1.4h
  uxtl    v0.4s, v0.4h
  sshl    v1.4s, v1.4s, v5.4s
  sshl    v0.4s, v0.4s, v4.4s
  stp     q1, q0, [x0]
  ret

Interestingly, this happens only with left shift and not with right shift.

GCC 10.3 vs trunk comparison: https://godbolt.org/z/xWbfGdfen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] Potential regression on vectorization of left shift with constants
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
@ 2022-07-19  6:52 ` rguenth at gcc dot gnu.org
  2022-07-19 13:09 ` manolis.tsamis at vrull dot eu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-19  6:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
           Keywords|                            |needs-bisection
          Component|tree-optimization           |target

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We no longer seem to handle the widening subtract but instead build up vectors
from scalar widened src1/src2.

"fails" with 10.4 on x86_64, works with 11.3, 12.1 and trunk there (but needs
AVX2).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] Potential regression on vectorization of left shift with constants
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
  2022-07-19  6:52 ` [Bug target/106346] " rguenth at gcc dot gnu.org
@ 2022-07-19 13:09 ` manolis.tsamis at vrull dot eu
  2022-07-22 10:11 ` [Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94 marxin at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2022-07-19 13:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #2 from manolis.tsamis at vrull dot eu ---
I bisected the changes from GCC 10.3 onwards and the first commit that results
in the "worse" version of the generated code is
9fc9573f9a5e9432e53c7de93985cfbd267f0309:

[2/3] [vect] Add widening add, subtract patterns

    Add widening add, subtract patterns to tree-vect-patterns. Update the
    widened code of patterns that detect PLUS_EXPR to also detect
    WIDEN_PLUS_EXPR. These patterns take 2 vectors with N elements of size
    S and perform an add/subtract on the elements, storing the results as N
    elements of size 2*S (in 2 result vectors). This is implemented in the
    aarch64 backend as addl,addl2 and subl,subl2 respectively. Add aarch64
    tests for patterns.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
  2022-07-19  6:52 ` [Bug target/106346] " rguenth at gcc dot gnu.org
  2022-07-19 13:09 ` manolis.tsamis at vrull dot eu
@ 2022-07-22 10:11 ` marxin at gcc dot gnu.org
  2022-07-22 11:00 ` tnfchris at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: marxin at gcc dot gnu.org @ 2022-07-22 10:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|needs-bisection             |
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |joelh at gcc dot gnu.org,
                   |                            |marxin at gcc dot gnu.org
            Summary|Potential regression on     |Potential regression on
                   |vectorization of left shift |vectorization of left shift
                   |with constants              |with constants since
                   |                            |r11-5160-g9fc9573f9a5e94
   Last reconfirmed|                            |2022-07-22

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (2 preceding siblings ...)
  2022-07-22 10:11 ` [Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94 marxin at gcc dot gnu.org
@ 2022-07-22 11:00 ` tnfchris at gcc dot gnu.org
  2022-07-27  7:27 ` [Bug target/106346] [11/12/13 Regression] " tnfchris at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-07-22 11:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Joel is no longer working on GCC, I'll fix the regression next week when back
from holidays.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (3 preceding siblings ...)
  2022-07-22 11:00 ` tnfchris at gcc dot gnu.org
@ 2022-07-27  7:27 ` tnfchris at gcc dot gnu.org
  2022-07-27  7:48 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-07-27  7:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Potential regression on     |[11/12/13 Regression]
                   |vectorization of left shift |Potential regression on
                   |with constants since        |vectorization of left shift
                   |r11-5160-g9fc9573f9a5e94    |with constants since
                   |                            |r11-5160-g9fc9573f9a5e94
           Priority|P3                          |P2
                 CC|                            |rguenth at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org
   Target Milestone|---                         |11.5
             Status|NEW                         |ASSIGNED

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e

We added an optab for the widening left shift pattern there however the
operation requires a uniform shift constant to work. See
https://godbolt.org/z/4hqKc69Ke

The existing pattern that deals with this is vect_recog_widen_shift_pattern
which is a scalar pattern.  during build_slp it validates that constants are
the same and when they're not it aborts SLP.  This is why we lose
vectorization.  Eventually we hit V4HI for which we have no widening shift
optab for and it vectorizes using that low VF.

This example shows a number of things wrong:

1. The generic costing seems off, this sequence shouldn't have been generated,
as a vector sequence it's more inefficient than the scalar sequence. Using
-mcpu=neover-n1 or any other costing structure correctly only gives scalar.

2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
predates the existence of the SLP pattern matcher. Because of the uniform
requirements it's better to use the SLP pattern matcher where we have access to
all the constants to decide whether the pattern is a match or not.  That way we
don't abort SLP. Are you ok with this as a fix Richi?

3. The epilogue costing seems off..

This example https://godbolt.org/z/YoPcWv6Td ends up generating an
exceptionally high epilogue cost and so thinks vectorization at the higher VF
is not profitable.

*src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
/app/example.c:16:12: note: Cost model analysis for part in loop 0:
  Vector cost: 23
  Scalar cost: 17

For some reason it thinks it needs a scalar epilogue? using
-fno-vect-cost-model gives the desired codegen.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (4 preceding siblings ...)
  2022-07-27  7:27 ` [Bug target/106346] [11/12/13 Regression] " tnfchris at gcc dot gnu.org
@ 2022-07-27  7:48 ` rguenth at gcc dot gnu.org
  2022-07-27  8:26 ` tnfchris at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-27  7:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #4)
> I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e
> 
> We added an optab for the widening left shift pattern there however the
> operation requires a uniform shift constant to work. See
> https://godbolt.org/z/4hqKc69Ke
> 
> The existing pattern that deals with this is vect_recog_widen_shift_pattern
> which is a scalar pattern.  during build_slp it validates that constants are
> the same and when they're not it aborts SLP.  This is why we lose
> vectorization.  Eventually we hit V4HI for which we have no widening shift
> optab for and it vectorizes using that low VF.
> 
> This example shows a number of things wrong:
> 
> 1. The generic costing seems off, this sequence shouldn't have been
> generated, as a vector sequence it's more inefficient than the scalar
> sequence. Using -mcpu=neover-n1 or any other costing structure correctly
> only gives scalar.
> 
> 2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
> predates the existence of the SLP pattern matcher. Because of the uniform
> requirements it's better to use the SLP pattern matcher where we have access
> to all the constants to decide whether the pattern is a match or not.  That
> way we don't abort SLP. Are you ok with this as a fix Richi?

patterns are difficult beasts - I think vect_recog_widen_shift_pattern is
at the correct place but instead what is lacking is SLP discovery support
for scrapping it - that is, ideally the vectorizer would take patterns as
a hint and ignore them when they are not helpful.

Now - in theory, for SLP vectorization, all patterns could be handled
as SLP patterns and scalar patterns disabled.  But that isn't easy to
do either.

I fear to fight this regression the easiest route is to pretend the
ISA can do widen shift by vector and fixup in the expander ...

> 3. The epilogue costing seems off..
> 
> This example https://godbolt.org/z/YoPcWv6Td ends up generating an
> exceptionally high epilogue cost and so thinks vectorization at the higher
> VF is not profitable.
> 
> *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
> /app/example.c:16:12: note: Cost model analysis for part in loop 0:
>   Vector cost: 23
>   Scalar cost: 17

I don't see any epilogue cost - the example doesn't have a loop.  With BB
vect you could see no epilogue costs?

> For some reason it thinks it needs a scalar epilogue? using
> -fno-vect-cost-model gives the desired codegen.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (5 preceding siblings ...)
  2022-07-27  7:48 ` rguenth at gcc dot gnu.org
@ 2022-07-27  8:26 ` tnfchris at gcc dot gnu.org
  2022-07-27  8:50 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-07-27  8:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e
> > 
> > We added an optab for the widening left shift pattern there however the
> > operation requires a uniform shift constant to work. See
> > https://godbolt.org/z/4hqKc69Ke
> > 
> > The existing pattern that deals with this is vect_recog_widen_shift_pattern
> > which is a scalar pattern.  during build_slp it validates that constants are
> > the same and when they're not it aborts SLP.  This is why we lose
> > vectorization.  Eventually we hit V4HI for which we have no widening shift
> > optab for and it vectorizes using that low VF.
> > 
> > This example shows a number of things wrong:
> > 
> > 1. The generic costing seems off, this sequence shouldn't have been
> > generated, as a vector sequence it's more inefficient than the scalar
> > sequence. Using -mcpu=neover-n1 or any other costing structure correctly
> > only gives scalar.
> > 
> > 2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
> > predates the existence of the SLP pattern matcher. Because of the uniform
> > requirements it's better to use the SLP pattern matcher where we have access
> > to all the constants to decide whether the pattern is a match or not.  That
> > way we don't abort SLP. Are you ok with this as a fix Richi?
> 
> patterns are difficult beasts - I think vect_recog_widen_shift_pattern is
> at the correct place but instead what is lacking is SLP discovery support
> for scrapping it - that is, ideally the vectorizer would take patterns as
> a hint and ignore them when they are not helpful.

Hmm, yes but the problem is that we've consumed additional related statements
which now need to be handled by build_slp as well. I suppose you could do an
in-place build_slp on the pattern stmt seq iterator.  Though that seems like
undoing a mistake.

> 
> Now - in theory, for SLP vectorization, all patterns could be handled
> as SLP patterns and scalar patterns disabled.  But that isn't easy to
> do either.

As long as we still have the non-SLP loop vect it's probably not a good idea
no?
since we then lose all patterns for it.  The widening shift was already
sufficiently
limited that it wouldn't really regress here.

> 
> I fear to fight this regression the easiest route is to pretend the
> ISA can do widen shift by vector and fixup in the expander ...

I can do this, but we're hiding the cost then. Or did you want me to fudge the
numbers in the costing hooks?

> 
> > 3. The epilogue costing seems off..
> > 
> > This example https://godbolt.org/z/YoPcWv6Td ends up generating an
> > exceptionally high epilogue cost and so thinks vectorization at the higher
> > VF is not profitable.
> > 
> > *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
> > MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
> > /app/example.c:16:12: note: Cost model analysis for part in loop 0:
> >   Vector cost: 23
> >   Scalar cost: 17
> 
> I don't see any epilogue cost - the example doesn't have a loop.  With BB
> vect you could see no epilogue costs?

That was my expectation, but see e.g. https://godbolt.org/z/MGEMYEe86 the SLP
shows the
above output.  I don't understand where the vec_to_scalar costs come from.

> 
> > For some reason it thinks it needs a scalar epilogue? using
> > -fno-vect-cost-model gives the desired codegen.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (6 preceding siblings ...)
  2022-07-27  8:26 ` tnfchris at gcc dot gnu.org
@ 2022-07-27  8:50 ` rguenther at suse dot de
  2023-07-31 15:30 ` [Bug target/106346] [11/12/13/14 " tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenther at suse dot de @ 2022-07-27  8:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 27 Jul 2022, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346
> 
> --- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #5)
> > (In reply to Tamar Christina from comment #4)
> > > I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e
> > > 
> > > We added an optab for the widening left shift pattern there however the
> > > operation requires a uniform shift constant to work. See
> > > https://godbolt.org/z/4hqKc69Ke
> > > 
> > > The existing pattern that deals with this is vect_recog_widen_shift_pattern
> > > which is a scalar pattern.  during build_slp it validates that constants are
> > > the same and when they're not it aborts SLP.  This is why we lose
> > > vectorization.  Eventually we hit V4HI for which we have no widening shift
> > > optab for and it vectorizes using that low VF.
> > > 
> > > This example shows a number of things wrong:
> > > 
> > > 1. The generic costing seems off, this sequence shouldn't have been
> > > generated, as a vector sequence it's more inefficient than the scalar
> > > sequence. Using -mcpu=neover-n1 or any other costing structure correctly
> > > only gives scalar.
> > > 
> > > 2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
> > > predates the existence of the SLP pattern matcher. Because of the uniform
> > > requirements it's better to use the SLP pattern matcher where we have access
> > > to all the constants to decide whether the pattern is a match or not.  That
> > > way we don't abort SLP. Are you ok with this as a fix Richi?
> > 
> > patterns are difficult beasts - I think vect_recog_widen_shift_pattern is
> > at the correct place but instead what is lacking is SLP discovery support
> > for scrapping it - that is, ideally the vectorizer would take patterns as
> > a hint and ignore them when they are not helpful.
> 
> Hmm, yes but the problem is that we've consumed additional related statements
> which now need to be handled by build_slp as well. I suppose you could do an
> in-place build_slp on the pattern stmt seq iterator.  Though that seems like
> undoing a mistake.

Yes.  But it would be actually undoing a mistake, so ... ;)  The issue
is of course that we have many ways to vectorize things and patterns
just add to that - and SLP discovery needs to somehow pick one and
we stick to it (and cost based selection is only done on the vector mode
choice).

I'm not sure what's the best thing to do in the very end - for patterns
one could think of adding alternate (sub-)SLP-graph overlays instead
of forking the whole SLP graph for every combination of 
pattern/non-pattern which would surely explode.  At analysis time
one could choose the locally cheapest variant assuming the local
choice does not influence analysis of the rest of the graph (which
some widening patterns do after all, at least in the loop case by
influencing the VF).

> > 
> > Now - in theory, for SLP vectorization, all patterns could be handled
> > as SLP patterns and scalar patterns disabled.  But that isn't easy to
> > do either.
> 
> As long as we still have the non-SLP loop vect it's probably not a good idea
> no?
> since we then lose all patterns for it.  The widening shift was already
> sufficiently
> limited that it wouldn't really regress here.

Well, we'd need both obviously.  But yes, if we get to the point (sorry
for it taking forever) that we have SLP data structures for everything
we could shelve the stmt based patterns.

> > 
> > I fear to fight this regression the easiest route is to pretend the
> > ISA can do widen shift by vector and fixup in the expander ...
> 
> I can do this, but we're hiding the cost then. Or did you want me to fudge the
> numbers in the costing hooks?

Yes.

> > 
> > > 3. The epilogue costing seems off..
> > > 
> > > This example https://godbolt.org/z/YoPcWv6Td ends up generating an
> > > exceptionally high epilogue cost and so thinks vectorization at the higher
> > > VF is not profitable.
> > > 
> > > *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
> > > MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
> > > /app/example.c:16:12: note: Cost model analysis for part in loop 0:
> > >   Vector cost: 23
> > >   Scalar cost: 17
> > 
> > I don't see any epilogue cost - the example doesn't have a loop.  With BB
> > vect you could see no epilogue costs?
> 
> That was my expectation, but see e.g. https://godbolt.org/z/MGEMYEe86 the SLP
> shows the
> above output.  I don't understand where the vec_to_scalar costs come from.
> 
> > 
> > > For some reason it thinks it needs a scalar epilogue? using
> > > -fno-vect-cost-model gives the desired codegen.
> 
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13/14 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (7 preceding siblings ...)
  2022-07-27  8:50 ` rguenther at suse dot de
@ 2023-07-31 15:30 ` tnfchris at gcc dot gnu.org
  2023-08-04 12:50 ` cvs-commit at gcc dot gnu.org
  2023-08-04 13:58 ` tnfchris at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2023-07-31 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.5                        |14.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13/14 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (8 preceding siblings ...)
  2023-07-31 15:30 ` [Bug target/106346] [11/12/13/14 " tnfchris at gcc dot gnu.org
@ 2023-08-04 12:50 ` cvs-commit at gcc dot gnu.org
  2023-08-04 13:58 ` tnfchris at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-08-04 12:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:451391a6477f5b012faeca42cdba1bfb8e6eecc0

commit r14-2991-g451391a6477f5b012faeca42cdba1bfb8e6eecc0
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Fri Aug 4 13:49:23 2023 +0100

    AArch64: Undo vec_widen_<sur>shiftl optabs [PR106346]

    In GCC 11 we implemented the vectorizer optab for widening left shifts,
    however this optab is only supported for uniform shift constants.

    At the moment GCC still has two loop vectorization strategy (classical loop
and
    SLP based loop vec) and the optab is implemented as a scalar pattern.

    This means that when we apply it to a non-uniform constant inside a loop we
only
    find out during SLP build that the constants aren't uniform.  At this point
it's
    too late and we lose SLP entirely.

    Over the years I've tried various options but none of it works well:

    1. Dissolving patterns during SLP built (problematic, also dissolves them
for
    non-slp).
    2. Optionally ignoring patterns for SLP build (problematic, ends up
interfearing
    with relevancy detection).
    3. Relaxing contraint on SLP build to allow non-constant values and
dissolving
    them after SLP build using an SLP pattern.  (problematic, ends up breaking
    shift reassociation).

    As a result we've concluded that for now this pattern should just be
removed
    and formed during RTL.

    The plan is to move this to an SLP only pattern once we remove classical
loop
    vectorization support from GCC, at which time we can also properly support
SVE's
    Top and Bottom variants.

    This removes the optab and reworks the RTL to recognize both the vector
variant
    and the intrinsics variant.  Also just simplifies all these patterns.

    gcc/ChangeLog:

            PR target/106346
            * config/aarch64/aarch64-simd.md (vec_widen_<sur>shiftl_lo_<mode>,
            vec_widen_<sur>shiftl_hi_<mode>): Remove.
            (aarch64_<sur>shll<mode>_internal): Renamed to...
            (aarch64_<su>shll<mode>): .. This.
            (aarch64_<sur>shll2<mode>_internal): Renamed to...
            (aarch64_<su>shll2<mode>): .. This.
            (aarch64_<sur>shll_n<mode>, aarch64_<sur>shll2_n<mode>): Re-use new
            optabs.
            * config/aarch64/constraints.md (D2, DL): New.
            * config/aarch64/predicates.md (aarch64_simd_shll_imm_vec): New.

    gcc/testsuite/ChangeLog:

            PR target/106346
            * gcc.target/aarch64/pr98772.c: Adjust assembly.
            * gcc.target/aarch64/vect-widen-shift.c: New test.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/106346] [11/12/13/14 Regression] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94
  2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
                   ` (9 preceding siblings ...)
  2023-08-04 12:50 ` cvs-commit at gcc dot gnu.org
@ 2023-08-04 13:58 ` tnfchris at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2023-08-04 13:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Fixed in GCC 14.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-08-04 13:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 18:28 [Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants manolis.tsamis at vrull dot eu
2022-07-19  6:52 ` [Bug target/106346] " rguenth at gcc dot gnu.org
2022-07-19 13:09 ` manolis.tsamis at vrull dot eu
2022-07-22 10:11 ` [Bug target/106346] Potential regression on vectorization of left shift with constants since r11-5160-g9fc9573f9a5e94 marxin at gcc dot gnu.org
2022-07-22 11:00 ` tnfchris at gcc dot gnu.org
2022-07-27  7:27 ` [Bug target/106346] [11/12/13 Regression] " tnfchris at gcc dot gnu.org
2022-07-27  7:48 ` rguenth at gcc dot gnu.org
2022-07-27  8:26 ` tnfchris at gcc dot gnu.org
2022-07-27  8:50 ` rguenther at suse dot de
2023-07-31 15:30 ` [Bug target/106346] [11/12/13/14 " tnfchris at gcc dot gnu.org
2023-08-04 12:50 ` cvs-commit at gcc dot gnu.org
2023-08-04 13:58 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).