From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cdw.me.uk (cdw.me.uk [91.203.57.136]) by sourceware.org (Postfix) with ESMTP id 08B3E39A0C21 for ; Sun, 12 Jul 2020 18:44:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 08B3E39A0C21 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=arachsys.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=chris@arachsys.com Received: from chris by delta.arachsys.com with local (Exim 4.80) (envelope-from ) id 1jugxw-0003fJ-AB for gcc-help@gcc.gnu.org; Sun, 12 Jul 2020 19:44:44 +0100 Date: Sun, 12 Jul 2020 19:44:44 +0100 From: Chris Webb To: gcc-help@gcc.gnu.org Subject: Puzzling auto-vectorization behaviour Message-ID: <20200712184444.GB30327@arachsys.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL, KAM_DMARC_STATUS, KAM_LOTSOFHASH, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-help@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-help mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 12 Jul 2020 18:44:48 -0000 I'm seeing 2x slower execution of the following cryptographic permutation function when compiled with gcc 10.1.0 vs llvm/clang: void gimli(uint32_t state[12]) { size_t round, column; uint32_t x, y, z; for (round = 24; round > 0; round--) { for (column = 0; column < 4; column++) { x = state[column] << 24 | state[column] >> 8; y = state[4 + column] << 9 | state[4 + column] >> 23; z = state[8 + column]; state[8 + column] = x ^ (z << 1) ^ ((y & z) << 2); state[4 + column] = y ^ x ^ ((x | z) << 1); state[column] = z ^ y ^ ((x & y) << 3); } switch (round & 3) { case 0: x = state[0], state[0] = state[1], state[1] = x; x = state[2], state[2] = state[3], state[3] = x; state[0] ^= 0x9e377900 | round; break; case 2: x = state[0], state[0] = state[2], state[2] = x; x = state[1], state[1] = state[3], state[3] = x; break; } } } Compiled with LLVM clang at -O3 -march=native on an x86-64 box, it gets vectorized with AVX instructions, taking about 110ns to run on my test machine. This isn't too far off a hand-optimised version using vector intrinsics. My gimli.c and the generated assembler from clang's -S (for comparison) are at https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli.c https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-llvm.s respectively. However, gcc 10.1.0 isn't vectorizing the inner loop, so the compiled function takes 240ns on the same machine. gcc seems to examine the outer loop, decides it can't be vectorized but never considers the inner loop: $ gcc -O3 -march=native -S -fopt-info-vec-all gimli.c gimli.c:8:3: missed: couldn't vectorize loop gimli.c:8:3: missed: not vectorized: control flow in loop. gimli.c:4:6: note: vectorized 0 loops in function. gimli.c:4:6: note: ***** Analysis failed with vector mode V8SI gimli.c:4:6: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V8SI gimli.c:23:18: note: ***** Analysis failed with vector mode VOID gimli.c:8:3: note: ***** Analysis failed with vector mode VOID gimli.c:19:5: note: ***** Analysis failed with vector mode VOID gimli.c:19:5: note: ***** Analysis failed with vector mode VOID gimli.c:31:1: note: ***** Analysis succeeded with vector mode V8SI gimli.c:31:1: note: SLPing BB part gimli.c:31:1: note: ------>vectorizing SLP node starting from: *state_38(D) = state__lsm.6_276; gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.6_276 = PHI , type of def: external gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.3_35 = PHI , type of def: external gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.4_280 = PHI , type of def: external gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.5_279 = PHI , type of def: external gimli.c:31:1: note: vect_is_simple_use: operand _281 = PHI <_83(4)>, type of def: external gimli.c:31:1: note: vect_is_simple_use: operand _277 = PHI <_115(4)>, type of def: external gimli.c:31:1: note: vect_is_simple_use: operand _274 = PHI <_147(4)>, type of def: external gimli.c:31:1: note: vect_is_simple_use: operand _272 = PHI <_179(4)>, type of def: external gimli.c:31:1: note: transform store. ncopies = 1 gimli.c:31:1: note: add new stmt: vect_cst__268 = _269; gimli.c:31:1: note: created new init_stmt: vect_cst__268 = _269; gimli.c:31:1: note: create vector_type-pointer variable to type: vector(8) unsigned int vectorizing a pointer ref: *state_38(D) gimli.c:31:1: note: created vectp.17_267 gimli.c:31:1: note: add new stmt: MEM [(uint32_t *)vectp.17_267] = vect_cst__268; gimli.c:31:1: note: vectorizing stmts using SLP. gimli.c:31:1: optimized: basic block part vectorized using 32 byte vectors gimli.c:31:1: note: ***** The result for vector mode V32QI would be the same gimli.c:31:1: note: basic block vectorized The resulting (much larger) gimli-gcc.s file is at https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-gcc.s The puzzling/interesting thing here is that, if I pull out the inner loop into a standalone function, it then seems to get vectorized into AVX instructions just fine: $ gcc -O3 -march=native -S -fopt-info-vec-all oneround.c oneround.c:8:3: optimized: loop vectorized using 16 byte vectors oneround.c:4:6: note: vectorized 1 loops in function. oneround.c:17:1: note: ***** Analysis failed with vector mode VOID https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround.c https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround-gcc.s (If called from the outer function, the inner one needs to be non-static, or the optimiser will inline it - and then the vectorization fails again.) Is there something about the larger function that upsets gcc's vectorizer or other optimisations in a way that the smaller one doesn't? Can I coax gcc into generating comparable code to llvm here with a #pragma, flag or other hint? Many thanks in advance for any pointers or suggestions.