public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af Date: Thu, 28 Jan 2021 11:03:00 +0000 [thread overview] Message-ID: <bug-98856-4-i4skgGd2Xi@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-98856-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Looks like STLF issues. There's a ls_stlf counter, with SLP vectorization disabled I see 34.39% 1417 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 32.27% 1333 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 7.31% 306 botan libbotan-2.so.17 [.] Botan::poly_double_n_le while with SLP vectorization enabled there's Samples: 4K of event 'ls_stlf:u', Event count (approx.): 723886942 Overhead Samples Command Shared Object Symbol 32.41% 1320 botan libbotan-2.so.17 [.] Botan::poly_double_n_le 27.23% 1114 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 27.06% 1107 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip but then the register docs suggest that the unnamed cpu/event=0x24,umask=0x2/u is supposed to be the forwarding fails due to incomplete/misaligned data. Unvectorized: Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.): 1024347253 Overhead Samples Command Shared Object Symbol 33.56% 1382 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 30.32% 1246 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 23.18% 953 botan libbotan-2.so.17 [.] Botan::poly_double_n_le vectorized: Samples: 4K of event 'cpu/event=0x24,umask=0x2/u', Event count (approx.): 489384781 Overhead Samples Command Shared Object Symbol 30.17% 1229 botan libbotan-2.so.17 [.] Botan::poly_double_n_le 29.40% 1203 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip 28.09% 1147 botan libbotan-2.so.17 [.] Botan::Block_Cipher_Fixed_Params<16ul, 16ul, 0ul, 1ul, Botan::BlockCip but the masking doesn't work as expected since I get hits for either bit on 4.05 | vmovdqa %xmm4,0x10(%rsp) # | const uint64_t carry = POLY * (W[LIMBS-1] >> 63); # 12.24 | mov 0x18(%rsp),%rdx # | W[0] = (W[0] << 1) ^ carry; # 24.00 | vmovdqa 0x10(%rsp),%xmm5 which should only happen for bit 2 (data not ready). Of course this code-gen is weird since 0x10(%rsp) is available in %xmm4. Well, changing the above doesn't make a difference. I guess the event hit is just quite delayed - that makes perf quite useless here. As a general optimization remark we fail to scalarize 'W' in poly_double_le for the larger sizes, but the relevant differences likely appear for the cases we expand the memcpy inline on GIMPLE, specifically <bb 10> [local count: 1431655747]: _60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)]; _61 = BIT_FIELD_REF <_60, 64, 64>; _62 = _61 >> 63; carry_63 = _62 * 135; _308 = _61 << 1; _228 = (long unsigned int) _60; _310 = _228 >> 63; _311 = _308 ^ _310; _71 = _228 << 1; _72 = carry_63 ^ _71; MEM <long unsigned int> [(char * {ref-all})out_5(D)] = _72; MEM <long unsigned int> [(char * {ref-all})out_5(D) + 8B] = _311; this is turned into <bb 10> [local count: 1431655747]: _60 = MEM <__int128 unsigned> [(char * {ref-all})in_6(D)]; _114 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_60); vect__71.335_298 = _114 << 1; _61 = BIT_FIELD_REF <_60, 64, 64>; _62 = _61 >> 63; carry_63 = _62 * 135; _228 = (long unsigned int) _60; _310 = _228 >> 63; _147 = {carry_63, _310}; vect__72.336_173 = _147 ^ vect__71.335_298; MEM <vector(2) long unsigned int> [(char * {ref-all})out_5(D)] = vect__72.336_173; after the patch which is build/include/botan/mem_ops.h:148:15: note: Basic block will be vectorized using SLP build/include/botan/mem_ops.h:148:15: note: Vectorizing SLP tree: build/include/botan/mem_ops.h:148:15: note: node 0x275d8e8 (max_nunits=2, refcnt=1) build/include/botan/mem_ops.h:148:15: note: op template: MEM <long unsigned int> [(char * {ref-all})out_5(D)] = _72; build/include/botan/mem_ops.h:148:15: note: stmt 0 MEM <long unsigned int> [(char * {ref-all})out_5(D)] = _72; build/include/botan/mem_ops.h:148:15: note: stmt 1 MEM <long unsigned int> [(char * {ref-all})out_5(D) + 8B] = _311; build/include/botan/mem_ops.h:148:15: note: children 0x275d960 build/include/botan/mem_ops.h:148:15: note: node 0x275d960 (max_nunits=2, refcnt=1) build/include/botan/mem_ops.h:148:15: note: op template: _72 = carry_63 ^ _71; build/include/botan/mem_ops.h:148:15: note: stmt 0 _72 = carry_63 ^ _71; build/include/botan/mem_ops.h:148:15: note: stmt 1 _311 = _308 ^ _310; build/include/botan/mem_ops.h:148:15: note: children 0x275d9d8 0x275da50 build/include/botan/mem_ops.h:148:15: note: node (external) 0x275d9d8 (max_nunits=1, refcnt=1) build/include/botan/mem_ops.h:148:15: note: { carry_63, _310 } build/include/botan/mem_ops.h:148:15: note: node 0x275da50 (max_nunits=2, refcnt=1) build/include/botan/mem_ops.h:148:15: note: op template: _71 = _228 << 1; build/include/botan/mem_ops.h:148:15: note: stmt 0 _71 = _228 << 1; build/include/botan/mem_ops.h:148:15: note: stmt 1 _308 = _61 << 1; build/include/botan/mem_ops.h:148:15: note: children 0x275dac8 0x275dbb8 build/include/botan/mem_ops.h:148:15: note: node 0x275dac8 (max_nunits=1, refcnt=1) build/include/botan/mem_ops.h:148:15: note: op: VEC_PERM_EXPR build/include/botan/mem_ops.h:148:15: note: stmt 0 _228 = BIT_FIELD_REF <_60, 64, 0>; build/include/botan/mem_ops.h:148:15: note: stmt 1 _61 = BIT_FIELD_REF <_60, 64, 64>; build/include/botan/mem_ops.h:148:15: note: lane permutation { 0[0] 0[1] } build/include/botan/mem_ops.h:148:15: note: children 0x275db40 build/include/botan/mem_ops.h:148:15: note: node (external) 0x275db40 (max_nunits=1, refcnt=1) build/include/botan/mem_ops.h:148:15: note: { } build/include/botan/mem_ops.h:148:15: note: node (constant) 0x275dbb8 (max_nunits=1, refcnt=1) build/include/botan/mem_ops.h:148:15: note: { 1, 1 } with costs build/include/botan/mem_ops.h:148:15: note: Cost model analysis: Vector inside of basic block cost: 24 Vector prologue cost: 8 Vector epilogue cost: 8 Scalar cost of basic block: 52 the vectorization isn't too bad I think, it turns into .L56: .cfi_restore_state vmovdqu (%rsi), %xmm4 vmovdqa %xmm4, 16(%rsp) movq 24(%rsp), %rdx vmovdqa 16(%rsp), %xmm5 shrq $63, %rdx imulq $135, %rdx, %rdi movq 16(%rsp), %rdx vmovq %rdi, %xmm0 vpsllq $1, %xmm5, %xmm1 shrq $63, %rdx vpinsrq $1, %rdx, %xmm0, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rax) jmp .L53 instead of .L56: .cfi_restore_state movq 8(%rsi), %rdx movq (%rsi), %rdi movq %rdx, %rcx leaq (%rdi,%rdi), %rsi addq %rdx, %rdx shrq $63, %rdi shrq $63, %rcx xorq %rdi, %rdx imulq $135, %rcx, %rcx movq %rdx, 8(%rax) xorq %rsi, %rcx movq %rcx, (%rax) jmp .L53 but we see the 128bit move split when using GPRs possibly avoiding the STLF issue. I don't understand why we spill to extract the high part though. Will see to create a small testcase for the above kernel. With the vectorization disabled for just this kernel I get AES-128/XTS 280780 key schedule/sec; 0.00 ms/op 12122 cycles/op (2 ops in 0 ms) AES-128/XTS encrypt buffer size 1024 bytes: 852.401 MiB/sec 4.14 cycles/byte (426.20 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 854.461 MiB/sec 4.13 cycles/byte (426.20 MiB in 498.80 ms) compared to ES-128/XTS 286409 key schedule/sec; 0.00 ms/op 11761 cycles/op (2 ops in 0 ms) AES-128/XTS encrypt buffer size 1024 bytes: 765.736 MiB/sec 4.62 cycles/byte (382.87 MiB in 500.00 ms) AES-128/XTS decrypt buffer size 1024 bytes: 766.612 MiB/sec 4.61 cycles/byte (382.87 MiB in 499.43 ms) so that seems to be it.
next prev parent reply other threads:[~2021-01-28 11:03 UTC|newest] Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-01-27 14:28 [Bug tree-optimization/98856] New: " marxin at gcc dot gnu.org 2021-01-27 14:29 ` [Bug tree-optimization/98856] " marxin at gcc dot gnu.org 2021-01-27 14:44 ` rguenth at gcc dot gnu.org 2021-01-28 7:47 ` rguenth at gcc dot gnu.org 2021-01-28 8:44 ` marxin at gcc dot gnu.org 2021-01-28 9:40 ` rguenth at gcc dot gnu.org 2021-01-28 11:03 ` rguenth at gcc dot gnu.org [this message] 2021-01-28 11:19 ` rguenth at gcc dot gnu.org 2021-01-28 11:57 ` rguenth at gcc dot gnu.org 2021-02-05 10:18 ` rguenth at gcc dot gnu.org 2021-02-05 11:52 ` jakub at gcc dot gnu.org 2021-02-05 12:52 ` rguenth at gcc dot gnu.org 2021-02-05 13:43 ` jakub at gcc dot gnu.org 2021-02-05 14:36 ` jakub at gcc dot gnu.org 2021-02-05 16:29 ` jakub at gcc dot gnu.org 2021-02-05 17:55 ` jakub at gcc dot gnu.org 2021-02-05 19:48 ` jakub at gcc dot gnu.org 2021-02-08 15:14 ` jakub at gcc dot gnu.org 2021-03-04 12:14 ` rguenth at gcc dot gnu.org 2021-03-04 15:36 ` rguenth at gcc dot gnu.org 2021-03-04 16:12 ` rguenth at gcc dot gnu.org 2021-03-04 17:56 ` ubizjak at gmail dot com 2021-03-04 18:12 ` ubizjak at gmail dot com 2021-03-05 7:44 ` rguenth at gcc dot gnu.org 2021-03-05 7:46 ` rguenth at gcc dot gnu.org 2021-03-05 8:29 ` ubizjak at gmail dot com 2021-03-05 10:04 ` rguenther at suse dot de 2021-03-05 10:43 ` rguenth at gcc dot gnu.org 2021-03-05 11:56 ` ubizjak at gmail dot com 2021-03-05 12:25 ` ubizjak at gmail dot com 2021-03-05 12:27 ` rguenth at gcc dot gnu.org 2021-03-05 12:49 ` jakub at gcc dot gnu.org 2021-03-05 12:52 ` ubizjak at gmail dot com 2021-03-05 12:55 ` rguenther at suse dot de 2021-03-05 13:06 ` rguenth at gcc dot gnu.org 2021-03-05 13:08 ` ubizjak at gmail dot com 2021-03-05 14:35 ` rguenth at gcc dot gnu.org 2021-03-08 10:41 ` rguenth at gcc dot gnu.org 2021-03-08 13:20 ` rguenth at gcc dot gnu.org 2021-03-08 15:46 ` amonakov at gcc dot gnu.org 2021-04-27 11:40 ` [Bug tree-optimization/98856] [11/12 " jakub at gcc dot gnu.org 2021-05-13 10:17 ` cvs-commit at gcc dot gnu.org 2021-07-28 7:05 ` rguenth at gcc dot gnu.org 2022-01-21 13:20 ` rguenth at gcc dot gnu.org 2022-04-21 7:48 ` rguenth at gcc dot gnu.org 2023-04-17 21:43 ` [Bug tree-optimization/98856] [11/12/13/14 " lukebenes at hotmail dot com 2023-04-18 9:07 ` rguenth at gcc dot gnu.org 2023-05-29 10:04 ` jakub at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-98856-4-i4skgGd2Xi@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).