From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id EF6A238708CD; Fri, 5 Feb 2021 12:52:32 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EF6A238708CD From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af Date: Fri, 05 Feb 2021 12:52:32 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 12:52:33 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856 --- Comment #10 from Richard Biener --- (In reply to Jakub Jelinek from comment #9) > For arithmetic >> (element_precision - 1) one can just use > {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec = < 0; > (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 } > For other arithmetic shifts by scalar constant, perhaps one can replace > return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < = 0) > << (64 - 17)); > - it will actually work even for non-constant scalar shift amounts because > {,v}psllq treats shift counts > 63 as 0. OK, so that yields poly_double_le2: .LFB0: .cfi_startproc vmovdqu (%rsi), %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpalignr $8, %xmm0, %xmm0, %xmm2 vpcmpgtq %xmm2, %xmm1, %xmm1 vpand .LC0(%rip), %xmm1, %xmm1 vpsllq $1, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) ret when I feed the following to SLP2 directly: void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp")) poly_double_le2 (unsigned char * out, const unsigned char * in) { long unsigned int carry; long unsigned int _1; long unsigned int _2; long unsigned int _3; long unsigned int _4; long unsigned int _5; long unsigned int _6; __int128 unsigned _9; long unsigned int _14; long unsigned int _15; long int _18; long int _19; long unsigned int _20; __BB(2,guessed_local(1073741824)): _9 =3D __MEM <__int128 unsigned, 8> ((char *)in_8(D)); _14 =3D __BIT_FIELD_REF (_9, 64u, 64u); _18 =3D (long int) _14; _1 =3D _18 < 0l ? _Literal (unsigned long) -1ul : 0ul; carry_10 =3D _1 & 135ul; _2 =3D _14 << 1; _15 =3D __BIT_FIELD_REF (_9, 64u, 0u); _19 =3D (long int) _15; _20 =3D _19 < 0l ? _Literal (unsigned long) -1ul : 0ul; _3 =3D _20 & 1ul; _4 =3D _2 ^ _3; _5 =3D _15 << 1; _6 =3D _5 ^ carry_10; __MEM ((char *)out_11(D)) =3D _6; __MEM ((char *)out_11(D) + _Literal (char *) 8) = =3D _4; return; } with [local count: 1073741824]: _9 =3D MEM <__int128 unsigned> [(char *)in_8(D)]; _12 =3D VIEW_CONVERT_EXPR(_9); _7 =3D VEC_PERM_EXPR <_12, _12, { 1, 0 }>; vect__18.1_25 =3D VIEW_CONVERT_EXPR(_7); vect_carry_10.3_28 =3D .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, = 0 }, 108); vect__5.0_13 =3D _12 << 1; vect__6.4_29 =3D vect__5.0_13 ^ vect_carry_10.3_28; MEM [(char *)out_11(D)] =3D vect__6.4_29; return; in .optimized The latency of the data is at least 7 instructions that way, compared to 4 in the not vectorized code (guess I could try Intel iaca on it). So if that's indeed the best we can do then it's not profitable (btw, with the above the vectorizers conclusion is not profitable but due to excessive costing of constants for the condition vectorization). Simple asm replacement of the kernel results in ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 = ms) AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte (382.79 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte (382.79 MiB in 499.03 ms) compared to AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0= ms) AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte (384.22 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte (384.22 MiB in 499.45 ms) so that's indeed no improvement. Bigger block sizes also contain vector code but that's not exercised by the botan speed measurement.=