[Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
@ 2021-01-27 13:44 marxin at gcc dot gnu.org
  2021-01-27 13:44 ` [Bug tree-optimization/98855] " marxin at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

            Bug ID: 98855
           Summary: [11 Regression] botan XTEA is 100% slower on znver2
                    since r11-4428-g4a369d199bf2f34e
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marxin at gcc dot gnu.org
  Target Milestone: ---

Since the revision the following is now slower:

$ make clean && ./configure.py --cxxflags="-Ofast -march=znver2" && make -j16
&& ./botan speed XTEA

as seen here:
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=245.710.1&plot.1=171.710.1&

Algorithm is implemented here:
src/lib/block/xtea/xtea.cpp

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
@ 2021-01-27 13:44 ` marxin at gcc dot gnu.org
  2021-01-27 14:29 ` marxin at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |11.0
                 CC|                            |rguenth at gcc dot gnu.org
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=98854

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
  2021-01-27 13:44 ` [Bug tree-optimization/98855] " marxin at gcc dot gnu.org
@ 2021-01-27 14:29 ` marxin at gcc dot gnu.org
  2021-01-27 14:42 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-27 14:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
And likely something similar happens since the same revision:

botan/KASUMI decrypt
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=245.694.1&plot.1=171.694.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
  2021-01-27 13:44 ` [Bug tree-optimization/98855] " marxin at gcc dot gnu.org
  2021-01-27 14:29 ` marxin at gcc dot gnu.org
@ 2021-01-27 14:42 ` rguenth at gcc dot gnu.org
  2021-01-29 12:12 ` marxin at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-27 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, let's see whether the fix for 98854 makes a difference before investigating
closer.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-01-27 14:42 ` rguenth at gcc dot gnu.org
@ 2021-01-29 12:12 ` marxin at gcc dot gnu.org
  2021-02-03 13:55 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-01-29 12:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-01-29

--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
So it's still there after the fix for PR98845.

I briefly looked at src/lib/block/xtea/xtea.cpp:

one can isolate the one problematic SLP:
-fdbg-cnt=vect_slp:4-4

build/include/botan/loadstor.h:470:15: note: Basic block will be vectorized
using SLP
build/include/botan/loadstor.h:470:15: note: Vectorizing SLP tree:
build/include/botan/loadstor.h:470:15: note: node 0x2cabac0 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: MEM <unsigned int>
[(char * {ref-all})_66] = _133;
build/include/botan/loadstor.h:470:15: note:    stmt 0 MEM <unsigned int>
[(char * {ref-all})_66] = _133;
build/include/botan/loadstor.h:470:15: note:    stmt 1 MEM <unsigned int>
[(char * {ref-all})_66 + 4B] = _134;
build/include/botan/loadstor.h:470:15: note:    stmt 2 MEM <unsigned int>
[(char * {ref-all})_66 + 8B] = _135;
build/include/botan/loadstor.h:470:15: note:    stmt 3 MEM <unsigned int>
[(char * {ref-all})_66 + 12B] = _136;
build/include/botan/loadstor.h:470:15: note:    stmt 4 MEM <unsigned int>
[(char * {ref-all})_66 + 16B] = _137;
build/include/botan/loadstor.h:470:15: note:    stmt 5 MEM <unsigned int>
[(char * {ref-all})_66 + 20B] = _138;
build/include/botan/loadstor.h:470:15: note:    stmt 6 MEM <unsigned int>
[(char * {ref-all})_66 + 24B] = _139;
build/include/botan/loadstor.h:470:15: note:    stmt 7 MEM <unsigned int>
[(char * {ref-all})_66 + 28B] = _140;
build/include/botan/loadstor.h:470:15: note:    children 0x2cabb40
build/include/botan/loadstor.h:470:15: note: node 0x2cabb40 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: _133 =
__builtin_bswap32 (_92);
build/include/botan/loadstor.h:470:15: note:    stmt 0 _133 = __builtin_bswap32
(_92);
build/include/botan/loadstor.h:470:15: note:    stmt 1 _134 = __builtin_bswap32
(_592);
build/include/botan/loadstor.h:470:15: note:    stmt 2 _135 = __builtin_bswap32
(_90);
build/include/botan/loadstor.h:470:15: note:    stmt 3 _136 = __builtin_bswap32
(_591);
build/include/botan/loadstor.h:470:15: note:    stmt 4 _137 = __builtin_bswap32
(_594);
build/include/botan/loadstor.h:470:15: note:    stmt 5 _138 = __builtin_bswap32
(_590);
build/include/botan/loadstor.h:470:15: note:    stmt 6 _139 = __builtin_bswap32
(_593);
build/include/botan/loadstor.h:470:15: note:    stmt 7 _140 = __builtin_bswap32
(_589);
build/include/botan/loadstor.h:470:15: note:    children 0x2cabbc0
build/include/botan/loadstor.h:470:15: note: node 0x2cabbc0 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: _92 = PHI <_14(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 0 _92 = PHI <_14(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 1 _592 = PHI <_46(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 2 _90 = PHI <_23(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 3 _591 = PHI <_52(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 4 _594 = PHI <_31(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 5 _590 = PHI <_58(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 6 _593 = PHI <_37(17)>
build/include/botan/loadstor.h:470:15: note:    stmt 7 _589 = PHI <_64(17)>
build/include/botan/loadstor.h:470:15: note:    children 0x2cabc40
build/include/botan/loadstor.h:470:15: note: node 0x2cabc40 (max_nunits=4,
refcnt=2)
build/include/botan/loadstor.h:470:15: note: op template: _14 = _13 + L0_224;
build/include/botan/loadstor.h:470:15: note:    stmt 0 _14 = _13 + L0_224;
build/include/botan/loadstor.h:470:15: note:    stmt 1 _46 = _45 + R0_225;
build/include/botan/loadstor.h:470:15: note:    stmt 2 _23 = _22 + L1_226;
build/include/botan/loadstor.h:470:15: note:    stmt 3 _52 = _51 + R1_227;
build/include/botan/loadstor.h:470:15: note:    stmt 4 _31 = _30 + L2_228;
build/include/botan/loadstor.h:470:15: note:    stmt 5 _58 = _57 + R2_229;
build/include/botan/loadstor.h:470:15: note:    stmt 6 _37 = _36 + L3_230;
build/include/botan/loadstor.h:470:15: note:    stmt 7 _64 = _63 + R3_231;
build/include/botan/loadstor.h:470:15: note:    children 0x2cabcc0 0x2cabdc0
build/include/botan/loadstor.h:470:15: note: node 0x2cabcc0 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: _13 = _9 ^ _12;
build/include/botan/loadstor.h:470:15: note:    stmt 0 _13 = _9 ^ _12;
build/include/botan/loadstor.h:470:15: note:    stmt 1 _45 = _41 ^ _44;
build/include/botan/loadstor.h:470:15: note:    stmt 2 _22 = _12 ^ _18;
build/include/botan/loadstor.h:470:15: note:    stmt 3 _51 = _44 ^ _50;
build/include/botan/loadstor.h:470:15: note:    stmt 4 _30 = _12 ^ _27;
build/include/botan/loadstor.h:470:15: note:    stmt 5 _57 = _44 ^ _56;
build/include/botan/loadstor.h:470:15: note:    stmt 6 _36 = _12 ^ _35;
build/include/botan/loadstor.h:470:15: note:    stmt 7 _63 = _44 ^ _62;
build/include/botan/loadstor.h:470:15: note:    children 0x2cabd40 0x2cabfc0
build/include/botan/loadstor.h:470:15: note: node (external) 0x2cabd40
(max_nunits=1, refcnt=1)
build/include/botan/loadstor.h:470:15: note:    { _9, _41, _18, _50, _27, _56,
_35, _62 }
build/include/botan/loadstor.h:470:15: note: node (external) 0x2cabfc0
(max_nunits=4, refcnt=1)
build/include/botan/loadstor.h:470:15: note:    stmt 0 _12 = *_11;
build/include/botan/loadstor.h:470:15: note:    stmt 1 _44 = *_43;
build/include/botan/loadstor.h:470:15: note:    stmt 2 _12 = *_11;
build/include/botan/loadstor.h:470:15: note:    stmt 3 _44 = *_43;
build/include/botan/loadstor.h:470:15: note:    stmt 4 _12 = *_11;
build/include/botan/loadstor.h:470:15: note:    stmt 5 _44 = *_43;
build/include/botan/loadstor.h:470:15: note:    stmt 6 _12 = *_11;
build/include/botan/loadstor.h:470:15: note:    stmt 7 _44 = *_43;
build/include/botan/loadstor.h:470:15: note: node 0x2cabdc0 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: L0_224 = PHI
<_14(28), _118(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 0 L0_224 = PHI <_14(28),
_118(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 1 R0_225 = PHI <_46(28),
_120(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 2 L1_226 = PHI <_23(28),
_122(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 3 R1_227 = PHI <_52(28),
_124(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 4 L2_228 = PHI <_31(28),
_126(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 5 R2_229 = PHI <_58(28),
_128(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 6 L3_230 = PHI <_37(28),
_130(16)>
build/include/botan/loadstor.h:470:15: note:    stmt 7 R3_231 = PHI <_64(28),
_132(16)>
build/include/botan/loadstor.h:470:15: note:    children 0x2cabc40 0x2cac040
build/include/botan/loadstor.h:470:15: note: node 0x2cac040 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: _118 =
__builtin_bswap32 (_117);
build/include/botan/loadstor.h:470:15: note:    stmt 0 _118 = __builtin_bswap32
(_117);
build/include/botan/loadstor.h:470:15: note:    stmt 1 _120 = __builtin_bswap32
(_119);
build/include/botan/loadstor.h:470:15: note:    stmt 2 _122 = __builtin_bswap32
(_121);
build/include/botan/loadstor.h:470:15: note:    stmt 3 _124 = __builtin_bswap32
(_123);
build/include/botan/loadstor.h:470:15: note:    stmt 4 _126 = __builtin_bswap32
(_125);
build/include/botan/loadstor.h:470:15: note:    stmt 5 _128 = __builtin_bswap32
(_127);
build/include/botan/loadstor.h:470:15: note:    stmt 6 _130 = __builtin_bswap32
(_129);
build/include/botan/loadstor.h:470:15: note:    stmt 7 _132 = __builtin_bswap32
(_131);
build/include/botan/loadstor.h:470:15: note:    children 0x2cac0c0
build/include/botan/loadstor.h:470:15: note: node 0x2cac0c0 (max_nunits=4,
refcnt=1)
build/include/botan/loadstor.h:470:15: note: op template: _117 = MEM <unsigned
int> [(char * {ref-all})_5];
build/include/botan/loadstor.h:470:15: note:    stmt 0 _117 = MEM <unsigned
int> [(char * {ref-all})_5];
build/include/botan/loadstor.h:470:15: note:    stmt 1 _119 = MEM <unsigned
int> [(char * {ref-all})_5 + 4B];
build/include/botan/loadstor.h:470:15: note:    stmt 2 _121 = MEM <unsigned
int> [(char * {ref-all})_5 + 8B];
build/include/botan/loadstor.h:470:15: note:    stmt 3 _123 = MEM <unsigned
int> [(char * {ref-all})_5 + 12B];
build/include/botan/loadstor.h:470:15: note:    stmt 4 _125 = MEM <unsigned
int> [(char * {ref-all})_5 + 16B];
build/include/botan/loadstor.h:470:15: note:    stmt 5 _127 = MEM <unsigned
int> [(char * {ref-all})_5 + 20B];
build/include/botan/loadstor.h:470:15: note:    stmt 6 _129 = MEM <unsigned
int> [(char * {ref-all})_5 + 24B];
build/include/botan/loadstor.h:470:15: note:    stmt 7 _131 = MEM <unsigned
int> [(char * {ref-all})_5 + 28B];
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _13 = _9 ^ _12;
build/include/botan/loadstor.h:470:15: note: transform binary/unary operation.
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__13.606_578 =
_580 ^ _582;
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__13.606_577 =
_579 ^ _581;
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _117 = MEM <unsigned int> [(char * {ref-all})_5];
build/include/botan/loadstor.h:470:15: note: transform load. ncopies = 1
build/include/botan/loadstor.h:470:15: note: create vector_type-pointer
variable to type: vector(4) unsigned int  vectorizing a pointer ref: MEM
<unsigned int> [(char * {ref-all})_5]
build/include/botan/loadstor.h:470:15: note: created vectp.608_575
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__117.609_574 =
MEM <vector(4) unsigned int> [(char * {ref-all})vectp.608_575];
build/include/botan/loadstor.h:470:15: note: add new stmt: vectp.608_573 =
vectp.608_575 + 16;
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__117.610_572 =
MEM <vector(4) unsigned int> [(char * {ref-all})vectp.608_573];
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _118 = __builtin_bswap32 (_117);
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand MEM
<unsigned int> [(char * {ref-all})_5], type of def: internal
build/include/botan/loadstor.h:470:15: note: add new stmt: _571 =
VIEW_CONVERT_EXPR<vector(16) char>(vect__117.609_574);
build/include/botan/loadstor.h:470:15: note: add new stmt: _570 = VEC_PERM_EXPR
<_571, _571, { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }>;
build/include/botan/loadstor.h:470:15: note: add new stmt: _569 =
VIEW_CONVERT_EXPR<vector(4) unsigned int>(_570);
build/include/botan/loadstor.h:470:15: note: add new stmt: _568 =
VIEW_CONVERT_EXPR<vector(16) char>(vect__117.610_572);
build/include/botan/loadstor.h:470:15: note: add new stmt: _567 = VEC_PERM_EXPR
<_568, _568, { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }>;
build/include/botan/loadstor.h:470:15: note: add new stmt: _566 =
VIEW_CONVERT_EXPR<vector(4) unsigned int>(_567);
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: L0_224 = PHI <_14(28), _118(16)>
build/include/botan/loadstor.h:470:15: note: extracting lane for live stmt
R0_225 = PHI <_46(28), _120(16)>
build/include/botan/loadstor.h:470:15: note: extracting lane for live stmt
R1_227 = PHI <_52(28), _124(16)>
build/include/botan/loadstor.h:470:15: note: extracting lane for live stmt
R2_229 = PHI <_58(28), _128(16)>
build/include/botan/loadstor.h:470:15: note: extracting lane for live stmt
R3_231 = PHI <_64(28), _132(16)>
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _14 = _13 + L0_224;
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand _9 ^
_12, type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand L0_224
= PHI <_14(28), _118(16)>, type of def: internal
build/include/botan/loadstor.h:470:15: note: transform binary/unary operation.
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__14.612_559 =
vect__13.606_578 + vect_L0_224.611_565;
build/include/botan/loadstor.h:470:15: note: add new stmt: vect__14.612_558 =
vect__13.606_577 + vect_L0_224.611_564;
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _92 = PHI <_14(17)>
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: _133 = __builtin_bswap32 (_92);
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand _92 =
PHI <_14(17)>, type of def: internal
build/include/botan/loadstor.h:470:15: note: add new stmt: _555 =
VIEW_CONVERT_EXPR<vector(16) char>(vect__92.613_557);
build/include/botan/loadstor.h:470:15: note: add new stmt: _554 = VEC_PERM_EXPR
<_555, _555, { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }>;
build/include/botan/loadstor.h:470:15: note: add new stmt: _553 =
VIEW_CONVERT_EXPR<vector(4) unsigned int>(_554);
build/include/botan/loadstor.h:470:15: note: add new stmt: _552 =
VIEW_CONVERT_EXPR<vector(16) char>(vect__92.613_556);
build/include/botan/loadstor.h:470:15: note: add new stmt: _551 = VEC_PERM_EXPR
<_552, _552, { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }>;
build/include/botan/loadstor.h:470:15: note: add new stmt: _550 =
VIEW_CONVERT_EXPR<vector(4) unsigned int>(_551);
build/include/botan/loadstor.h:470:15: note: ------>vectorizing SLP node
starting from: MEM <unsigned int> [(char * {ref-all})_66] = _133;
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_92), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_592), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_90), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_591), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_594), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_590), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_593), type of def: internal
build/include/botan/loadstor.h:470:15: note: vect_is_simple_use: operand
__builtin_bswap32 (_589), type of def: internal
build/include/botan/loadstor.h:470:15: note: transform store. ncopies = 1
build/include/botan/loadstor.h:470:15: note: create vector_type-pointer
variable to type: vector(4) unsigned int  vectorizing a pointer ref: MEM
<unsigned int> [(char * {ref-all})_66]
build/include/botan/loadstor.h:470:15: note: created vectp.615_549
build/include/botan/loadstor.h:470:15: note: add new stmt: MEM <vector(4)
unsigned int> [(char * {ref-all})vectp.615_549] = _553;
build/include/botan/loadstor.h:470:15: note: add new stmt: vectp.615_547 =
vectp.615_549 + 16;
build/include/botan/loadstor.h:470:15: note: add new stmt: MEM <vector(4)
unsigned int> [(char * {ref-all})vectp.615_547] = _550;
build/include/botan/loadstor.h:470:15: note: vectorizing stmts using SLP.
build/include/botan/loadstor.h:470:15: optimized: basic block part vectorized
using 16 byte vectors

I tried to isolate a self-contained test-case, but I was not lucky enough.

The original loop:

void XTEA::encrypt_n(const uint8_t in[], uint8_t out[], size_t blocks) const
   {
   verify_key_set(m_EK.empty() == false);

   const uint32_t* EK = &m_EK[0];

   const size_t blocks4 = blocks / 4;
   const size_t blocks_left = blocks % 4;

   BOTAN_PARALLEL_FOR(size_t i = 0; i < blocks4; i++)
      {
      uint32_t L0, R0, L1, R1, L2, R2, L3, R3;
      load_be(in + 4*BLOCK_SIZE*i, L0, R0, L1, R1, L2, R2, L3, R3);

      for(size_t r = 0; r != 32; ++r)
         {
         L0 += (((R0 << 4) ^ (R0 >> 5)) + R0) ^ EK[2*r];
         L1 += (((R1 << 4) ^ (R1 >> 5)) + R1) ^ EK[2*r];
         L2 += (((R2 << 4) ^ (R2 >> 5)) + R2) ^ EK[2*r];
         L3 += (((R3 << 4) ^ (R3 >> 5)) + R3) ^ EK[2*r];

         R0 += (((L0 << 4) ^ (L0 >> 5)) + L0) ^ EK[2*r+1];
         R1 += (((L1 << 4) ^ (L1 >> 5)) + L1) ^ EK[2*r+1];
         R2 += (((L2 << 4) ^ (L2 >> 5)) + L2) ^ EK[2*r+1];
         R3 += (((L3 << 4) ^ (L3 >> 5)) + L3) ^ EK[2*r+1];
         }

      store_be(out + 4*BLOCK_SIZE*i, L0, R0, L1, R1, L2, R2, L3, R3);
      }

   BOTAN_PARALLEL_FOR(size_t i = 0; i < blocks_left; ++i)
      {
      uint32_t L, R;
      load_be(in + BLOCK_SIZE*(4*blocks4+i), L, R);

      for(size_t r = 0; r != 32; ++r)
         {
         L += (((R << 4) ^ (R >> 5)) + R) ^ EK[2*r];
         R += (((L << 4) ^ (L >> 5)) + L) ^ EK[2*r+1];
         }

      store_be(out + BLOCK_SIZE*(4*blocks4+i), L, R);
      }
   }

====

   BOTAN_PARALLEL_FOR(size_t i = 0; i < blocks_left; ++i)
should not be executed, the benchmark runs it with block == 128

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-01-29 12:12 ` marxin at gcc dot gnu.org
@ 2021-02-03 13:55 ` rguenth at gcc dot gnu.org
  2021-02-03 14:21 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-03 13:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we vectorized the bswap part after the loop before but now we connect it
to the reduction performed in the loop and catching the bswap at the top.
But the inner loop remains unvectorized apart from the XOR with the reduction
value so we have to decompose/compose the reduction value.

What's special here is that the "unprofitable" part of the vectorization
is in a deeper loop than the "profitable" part but the profitable part
outweights the unprofitable one.  Thus we miss weighting the costs like
we for example do (very crudely) in outer loop vectorization.

For the scalar inner loop we have 48

0x25e59f0 _39 ^ _42 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _48 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _54 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _60 1 times scalar_stmt costs 4 in body
0x25e59f0 _43 + R0_204 1 times scalar_stmt costs 4 in body
0x25e59f0 _49 + R1_206 1 times scalar_stmt costs 4 in body
0x25e59f0 _55 + R2_208 1 times scalar_stmt costs 4 in body
0x25e59f0 _61 + R3_210 1 times scalar_stmt costs 4 in body
0x25e59f0 R0_204 = PHI <_44(18), _117(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R1_206 = PHI <_50(18), _121(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R2_208 = PHI <_56(18), _125(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R3_210 = PHI <_62(18), _129(6)> 1 times scalar_stmt costs 4 in body

and the vector part 116

0x2728470 _7 ^ _10 1 times vector_stmt costs 4 in body
0x2728470 _11 + L0_203 1 times vector_stmt costs 4 in body
0x2728470 <unknown> 1 times vec_construct costs 44 in prologue
0x2728470 <unknown> 1 times vec_construct costs 44 in prologue
0x2728470 L0_203 = PHI <_12(18), _115(6)> 1 times vector_stmt costs 4 in body
0x2728470 _11 + L0_203 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _20 + L1_205 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _28 + L2_207 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _34 + L3_209 1 times vec_to_scalar costs 4 in epilogue

for outer loop vect we scale the inner loop cost by 50.

We could also have done better in vectorizing.  We've detected

build/include/botan/mem_ops.h:148:15: note:   node 0x297eff8 (max_nunits=8,
refcnt=2)
build/include/botan/mem_ops.h:148:15: note:   op template: _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 0 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 2 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 3 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 4 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 5 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 6 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 7 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     load permutation { 0 1 0 1 0 1
0 1 }

but decided

build/include/botan/mem_ops.h:148:15: note:   ==> examining statement: _10 =
*_9;
build/include/botan/mem_ops.h:148:15: missed:   BB vectorization with gaps at
the end of a load is not supported
src/lib/block/xtea/xtea.cpp:32:55: missed:   not vectorized: relevant stmt not
supported: _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:   Building vector operands of
0x297eff8 from scalars instead

I think we have a duplicate for this somewhere.  The desired vectorization
would be a HImode load and splat.

We also miss to represent the bswap nodes as VEC_PERM ones and to optimize
them away entirely.

And we fail to elide the bswap even though we vectorize it!?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-02-03 13:55 ` rguenth at gcc dot gnu.org
@ 2021-02-03 14:21 ` rguenth at gcc dot gnu.org
  2021-02-04 12:02 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-03 14:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50118
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50118&action=edit
testcase

Testcase reproducing the bad SLP

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-02-03 14:21 ` rguenth at gcc dot gnu.org
@ 2021-02-04 12:02 ` cvs-commit at gcc dot gnu.org
  2021-02-04 12:51 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-04 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:7002a33d1ba81e4577d965fb9daaee146b31faa8

commit r11-7099-g7002a33d1ba81e4577d965fb9daaee146b31faa8
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Feb 4 12:08:47 2021 +0100

    tree-optimization/98855 - fix some vectorizer cost issues

    This fixes us not costing vectorized bswap for SLP as well as
    avoiding biasing to the vectorized side when costing single-argument
    PHIs.  Instead we assume coalescing here and cost them with zero cost
    for both the scalar and vectorized code.

    This doesn't fix the PR on its own.

    2021-02-04  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/98855
            * tree-vect-loop.c (vectorizable_phi): Do not cost
            single-argument PHIs.
            * tree-vect-slp.c (vect_bb_slp_scalar_cost): Likewise.
            * tree-vect-stmts.c (vectorizable_bswap): Also perform
            costing for SLP operation.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-02-04 12:02 ` cvs-commit at gcc dot gnu.org
@ 2021-02-04 12:51 ` rguenth at gcc dot gnu.org
  2021-02-04 14:18 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-04 12:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with this change and prototyping a scale-cost-by-loop depth in a way to
very conservatively (or optimistically, depending on the view) estimate of
all loops iterating exactly twice (so cost scaling by 1 << loop_depth works)
this isn't enough to make the vectorization appear unprofitable.

Technically scaling by loop depth is going to be needed but it has some
impact throughout the code, esp. the cost vector entries do not have a
scale but only 'count' which I abused for the prototype but that should
also be an indication to the backend on how many stmts we emit so we
shouldn't overload it by scaling.

Another idea would be to constrain inner loop "parts" to be profitable
on their own so that the vectorization is profitable independent on the
number of iterations.  For a start that means at least annotating the
cost entries with the loop number for example.

ARM get's away cost-wise for this testcase because it seems to have a uniform
scalar cost of one and with a lower VF (we can only use neon) this is enough to
have the vectorization rejected in this case.

On x86 we have the target specific "issue" that load/store costs dominate
everything since stmt cost (scalar and vector) is generally 4 while
load/store cost (scalar and vector, aligned and unaligned) is 12.

So design-wise I'm leaning towards requiring that BB vectorization should
be profitable independent of the number of iterations of a loop which
would mean that we need to check profitability for each loop level we cover
from inner to outer loop (and including inner loops) but without applying
any scaling.  So assume we annotated the loop tree with the cost of stmts
belonging directly to them do

 FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
   {
     if (!profitable_p (loop->scalar_cost, loop->vector_cost))
       reject SLP;
     loop_outer (loop)->scalar_cost += loop->scalar_cost;
     loop_outer (loop)->vector_cost += loop->vector_cost; 
   }

(in the end we should do sth more efficient)

Thoughts?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-02-04 12:51 ` rguenth at gcc dot gnu.org
@ 2021-02-04 14:18 ` rguenth at gcc dot gnu.org
  2021-02-05 13:03 ` cvs-commit at gcc dot gnu.org
  2021-02-05 13:03 ` rguenth at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-04 14:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 50127
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50127&action=edit
crude hack

So this is a crude attempt at doing this entirely in the BB vect costing
routine by using the stored stmt_info to nail the vectorized stmts place
("somewhat" unreliable for invariants).  The idea to further simplify things is
that we
have to consider zero trip loops as well which means individual loop body
(excluding contained loop bodies) parts should be profitable.  That makes
sorting after the owning loop possible and doing separate costing on that
portions.

(of course we know some bodies will be at least executed once or have a
constant niter amount ... well)

The chance is of course that any preheader code will now disable "BB loop"
vectorization.

As said, it's a crude hack.  But maybe it's good enough to polish it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-02-04 14:18 ` rguenth at gcc dot gnu.org
@ 2021-02-05 13:03 ` cvs-commit at gcc dot gnu.org
  2021-02-05 13:03 ` rguenth at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-05 13:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:63538886d1f7fc7cbf066b4c2d6d7fd4da537259

commit r11-7123-g63538886d1f7fc7cbf066b4c2d6d7fd4da537259
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Feb 5 09:54:00 2021 +0100

    tree-optimization/98855 - redo BB vectorization costing

    The following attempts to account for the fact that BB vectorization
    regions now can span multiple loop levels and that an unprofitable
    inner loop vectorization shouldn't be offsetted by a profitable
    outer loop vectorization to make it overall profitable.

    For now I've implemented a heuristic based on the premise that
    vectorization should be profitable even if loops may not be entered
    or if they iterate any number of times.  Especially the first
    assumption then requires that stmts directly belonging to loop A
    need to be costed separately from stmts belonging to another loop
    which also simplifies the implementation.

    On x86 the added testcase has in the outer loop

    t.c:38:20: note: Cost model analysis for part in loop 1:
      Vector cost: 56
      Scalar cost: 192

    and the inner loop

    t.c:38:20: note: Cost model analysis for part in loop 2:
      Vector cost: 132
      Scalar cost: 48

    and thus the vectorization is considered not profitable
    (note the same would happen in case the 2nd cost were for
    a loop outer to the 1st costing).

    Future enhancements may consider static knowledge of whether
    a loop is always entered which would allow some inefficiency
    in the vectorization of its loop header.  Likewise stmts only
    reachable from a loop exit can be treated this way.

    2021-02-05  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/98855
            * tree-vectorizer.h (add_stmt_cost): New overload.
            * tree-vect-slp.c (li_cost_vec_cmp): New.
            (vect_bb_slp_scalar_cost): Cost individual loop regions
            separately.  Account for the scalar instance root stmt.

            * g++.dg/vect/slp-pr98855.cc: New testcase.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/98855] [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e
  2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-02-05 13:03 ` cvs-commit at gcc dot gnu.org
@ 2021-02-05 13:03 ` rguenth at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-05 13:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-02-05 13:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 13:44 [Bug tree-optimization/98855] New: [11 Regression] botan XTEA is 100% slower on znver2 since r11-4428-g4a369d199bf2f34e marxin at gcc dot gnu.org
2021-01-27 13:44 ` [Bug tree-optimization/98855] " marxin at gcc dot gnu.org
2021-01-27 14:29 ` marxin at gcc dot gnu.org
2021-01-27 14:42 ` rguenth at gcc dot gnu.org
2021-01-29 12:12 ` marxin at gcc dot gnu.org
2021-02-03 13:55 ` rguenth at gcc dot gnu.org
2021-02-03 14:21 ` rguenth at gcc dot gnu.org
2021-02-04 12:02 ` cvs-commit at gcc dot gnu.org
2021-02-04 12:51 ` rguenth at gcc dot gnu.org
2021-02-04 14:18 ` rguenth at gcc dot gnu.org
2021-02-05 13:03 ` cvs-commit at gcc dot gnu.org
2021-02-05 13:03 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).