[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
Date: Fri, 05 Feb 2021 12:52:32 +0000	[thread overview]
Message-ID: <bug-98856-4-knIJiDyNSq@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-98856-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> For arithmetic >> (element_precision - 1) one can just use
> {,v}pxor + {,v}pcmpgtq, as in instead of return vec >> 63; do return vec < 0;
> (in C++-ish way), aka VEC_COND_EXPR vec < 0, { all ones }, { 0 }
> For other arithmetic shifts by scalar constant, perhaps one can replace
> return vec >> 17; with return (vectype) ((uvectype) vec >> 17) | ((vec < 0)
> << (64 - 17));
> - it will actually work even for non-constant scalar shift amounts because
> {,v}psllq treats shift counts > 63 as 0.

OK, so that yields

poly_double_le2:
.LFB0:
        .cfi_startproc
        vmovdqu (%rsi), %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpalignr        $8, %xmm0, %xmm0, %xmm2
        vpcmpgtq        %xmm2, %xmm1, %xmm1
        vpand   .LC0(%rip), %xmm1, %xmm1
        vpsllq  $1, %xmm0, %xmm0
        vpxor   %xmm1, %xmm0, %xmm0
        vmovdqu %xmm0, (%rdi)
        ret

when I feed the following to SLP2 directly:

void __GIMPLE (ssa,guessed_local(1073741824),startwith("slp"))
poly_double_le2 (unsigned char * out, const unsigned char * in)
{
  long unsigned int carry;
  long unsigned int _1;
  long unsigned int _2;
  long unsigned int _3;
  long unsigned int _4;
  long unsigned int _5;
  long unsigned int _6;
  __int128 unsigned _9;
  long unsigned int _14;
  long unsigned int _15;
  long int _18;
  long int _19;
  long unsigned int _20;

  __BB(2,guessed_local(1073741824)):
  _9 = __MEM <__int128 unsigned, 8> ((char *)in_8(D));
  _14 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 64u);
  _18 = (long int) _14;
  _1 = _18 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  carry_10 = _1 & 135ul;
  _2 = _14 << 1;
  _15 = __BIT_FIELD_REF <long unsigned int> (_9, 64u, 0u);
  _19 = (long int) _15;
  _20 = _19 < 0l ? _Literal (unsigned long) -1ul : 0ul;
  _3 = _20 & 1ul;
  _4 = _2 ^ _3;
  _5 = _15 << 1;
  _6 = _5 ^ carry_10;
  __MEM <long unsigned int, 8> ((char *)out_11(D)) = _6;
  __MEM <long unsigned int, 8> ((char *)out_11(D) + _Literal (char *) 8) = _4;
  return;

}

with

  <bb 2> [local count: 1073741824]:
  _9 = MEM <__int128 unsigned> [(char *)in_8(D)];
  _12 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_9);
  _7 = VEC_PERM_EXPR <_12, _12, { 1, 0 }>;
  vect__18.1_25 = VIEW_CONVERT_EXPR<vector(2) long int>(_7);
  vect_carry_10.3_28 = .VCOND (vect__18.1_25, { 0, 0 }, { 135, 1 }, { 0, 0 },
108);
  vect__5.0_13 = _12 << 1;
  vect__6.4_29 = vect__5.0_13 ^ vect_carry_10.3_28;
  MEM <vector(2) long unsigned int> [(char *)out_11(D)] = vect__6.4_29;
  return;

in .optimized

The latency of the data is at least 7 instructions that way, compared to
4 in the not vectorized code (guess I could try Intel iaca on it).

So if that's indeed the best we can do then it's not profitable (btw,
with the above the vectorizers conclusion is not profitable but due
to excessive costing of constants for the condition vectorization).

Simple asm replacement of the kernel results in

ES-128/XTS 292740 key schedule/sec; 0.00 ms/op 11571 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 765.571 MiB/sec 4.62 cycles/byte
(382.79 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 767.064 MiB/sec 4.61 cycles/byte
(382.79 MiB in 499.03 ms)

compared to

AES-128/XTS 283527 key schedule/sec; 0.00 ms/op 11932 cycles/op (2 ops in 0 ms)
AES-128/XTS encrypt buffer size 1024 bytes: 768.446 MiB/sec 4.60 cycles/byte
(384.22 MiB in 500.00 ms)
AES-128/XTS decrypt buffer size 1024 bytes: 769.292 MiB/sec 4.60 cycles/byte
(384.22 MiB in 499.45 ms)

so that's indeed no improvement.  Bigger block sizes also contain vector
code but that's not exercised by the botan speed measurement.

next prev parent reply	other threads:[~2021-02-05 12:52 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 14:28 [Bug tree-optimization/98856] New: " marxin at gcc dot gnu.org
2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org
2021-01-27 14:44 ` rguenth at gcc dot gnu.org
2021-01-28  7:47 ` rguenth at gcc dot gnu.org
2021-01-28  8:44 ` marxin at gcc dot gnu.org
2021-01-28  9:40 ` rguenth at gcc dot gnu.org
2021-01-28 11:03 ` rguenth at gcc dot gnu.org
2021-01-28 11:19 ` rguenth at gcc dot gnu.org
2021-01-28 11:57 ` rguenth at gcc dot gnu.org
2021-02-05 10:18 ` rguenth at gcc dot gnu.org
2021-02-05 11:52 ` jakub at gcc dot gnu.org
2021-02-05 12:52 ` rguenth at gcc dot gnu.org [this message]
2021-02-05 13:43 ` jakub at gcc dot gnu.org
2021-02-05 14:36 ` jakub at gcc dot gnu.org
2021-02-05 16:29 ` jakub at gcc dot gnu.org
2021-02-05 17:55 ` jakub at gcc dot gnu.org
2021-02-05 19:48 ` jakub at gcc dot gnu.org
2021-02-08 15:14 ` jakub at gcc dot gnu.org
2021-03-04 12:14 ` rguenth at gcc dot gnu.org
2021-03-04 15:36 ` rguenth at gcc dot gnu.org
2021-03-04 16:12 ` rguenth at gcc dot gnu.org
2021-03-04 17:56 ` ubizjak at gmail dot com
2021-03-04 18:12 ` ubizjak at gmail dot com
2021-03-05  7:44 ` rguenth at gcc dot gnu.org
2021-03-05  7:46 ` rguenth at gcc dot gnu.org
2021-03-05  8:29 ` ubizjak at gmail dot com
2021-03-05 10:04 ` rguenther at suse dot de
2021-03-05 10:43 ` rguenth at gcc dot gnu.org
2021-03-05 11:56 ` ubizjak at gmail dot com
2021-03-05 12:25 ` ubizjak at gmail dot com
2021-03-05 12:27 ` rguenth at gcc dot gnu.org
2021-03-05 12:49 ` jakub at gcc dot gnu.org
2021-03-05 12:52 ` ubizjak at gmail dot com
2021-03-05 12:55 ` rguenther at suse dot de
2021-03-05 13:06 ` rguenth at gcc dot gnu.org
2021-03-05 13:08 ` ubizjak at gmail dot com
2021-03-05 14:35 ` rguenth at gcc dot gnu.org
2021-03-08 10:41 ` rguenth at gcc dot gnu.org
2021-03-08 13:20 ` rguenth at gcc dot gnu.org
2021-03-08 15:46 ` amonakov at gcc dot gnu.org
2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org
2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org
2021-07-28  7:05 ` rguenth at gcc dot gnu.org
2022-01-21 13:20 ` rguenth at gcc dot gnu.org
2022-04-21  7:48 ` rguenth at gcc dot gnu.org
2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com
2023-04-18  9:07 ` rguenth at gcc dot gnu.org
2023-05-29 10:04 ` jakub at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-98856-4-knIJiDyNSq@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).