From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <chris@arachsys.com>
Received: from cdw.me.uk (cdw.me.uk [91.203.57.136])
 by sourceware.org (Postfix) with ESMTP id 08B3E39A0C21
 for <gcc-help@gcc.gnu.org>; Sun, 12 Jul 2020 18:44:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 08B3E39A0C21
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=arachsys.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=chris@arachsys.com
Received: from chris by delta.arachsys.com with local (Exim 4.80)
 (envelope-from <chris@arachsys.com>) id 1jugxw-0003fJ-AB
 for gcc-help@gcc.gnu.org; Sun, 12 Jul 2020 19:44:44 +0100
Date: Sun, 12 Jul 2020 19:44:44 +0100
From: Chris Webb <chris@arachsys.com>
To: gcc-help@gcc.gnu.org
Subject: Puzzling auto-vectorization behaviour
Message-ID: <20200712184444.GB30327@arachsys.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00, JMQ_SPF_NEUTRAL,
 KAM_DMARC_STATUS, KAM_LOTSOFHASH, SPF_HELO_NONE,
 SPF_PASS autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-help@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-help mailing list <gcc-help.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 12 Jul 2020 18:44:48 -0000

I'm seeing 2x slower execution of the following cryptographic permutation
function when compiled with gcc 10.1.0 vs llvm/clang:

  void gimli(uint32_t state[12]) {
    size_t round, column;
    uint32_t x, y, z;

    for (round = 24; round > 0; round--) {
      for (column = 0; column < 4; column++) {
        x = state[column] << 24 | state[column] >> 8;
        y = state[4 + column] << 9 | state[4 + column] >> 23;
        z = state[8 + column];

        state[8 + column] = x ^ (z << 1) ^ ((y & z) << 2);
        state[4 + column] = y ^ x ^ ((x | z) << 1);
        state[column] = z ^ y ^ ((x & y) << 3);
      }

      switch (round & 3) {
        case 0:
          x = state[0], state[0] = state[1], state[1] = x;
          x = state[2], state[2] = state[3], state[3] = x;
          state[0] ^= 0x9e377900 | round;
          break;
        case 2:
          x = state[0], state[0] = state[2], state[2] = x;
          x = state[1], state[1] = state[3], state[3] = x;
          break;
      }
    }
  }

Compiled with LLVM clang at -O3 -march=native on an x86-64 box, it gets
vectorized with AVX instructions, taking about 110ns to run on my test machine.
This isn't too far off a hand-optimised version using vector intrinsics. My
gimli.c and the generated assembler from clang's -S (for comparison) are at

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli.c

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-llvm.s

respectively.

However, gcc 10.1.0 isn't vectorizing the inner loop, so the compiled
function takes 240ns on the same machine. gcc seems to examine the outer
loop, decides it can't be vectorized but never considers the inner loop:

  $ gcc -O3 -march=native -S -fopt-info-vec-all gimli.c
  gimli.c:8:3: missed: couldn't vectorize loop
  gimli.c:8:3: missed: not vectorized: control flow in loop.
  gimli.c:4:6: note: vectorized 0 loops in function.
  gimli.c:4:6: note: ***** Analysis failed with vector mode V8SI
  gimli.c:4:6: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V8SI
  gimli.c:23:18: note: ***** Analysis failed with vector mode VOID
  gimli.c:8:3: note: ***** Analysis failed with vector mode VOID
  gimli.c:19:5: note: ***** Analysis failed with vector mode VOID
  gimli.c:19:5: note: ***** Analysis failed with vector mode VOID
  gimli.c:31:1: note: ***** Analysis succeeded with vector mode V8SI
  gimli.c:31:1: note: SLPing BB part
  gimli.c:31:1: note: ------>vectorizing SLP node starting from: *state_38(D) = state__lsm.6_276;
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.6_276 = PHI <state__lsm.6_131(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.3_35 = PHI <state__lsm.3_67(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.4_280 = PHI <state__lsm.4_94(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.5_279 = PHI <state__lsm.5_104(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _281 = PHI <_83(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _277 = PHI <_115(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _274 = PHI <_147(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _272 = PHI <_179(4)>, type of def: external
  gimli.c:31:1: note: transform store. ncopies = 1
  gimli.c:31:1: note: add new stmt: vect_cst__268 = _269;
  gimli.c:31:1: note: created new init_stmt: vect_cst__268 = _269;
  gimli.c:31:1: note: create vector_type-pointer variable to type: vector(8) unsigned int  vectorizing a pointer ref: *state_38(D)
  gimli.c:31:1: note: created vectp.17_267
  gimli.c:31:1: note: add new stmt: MEM <vector(8) unsigned int> [(uint32_t *)vectp.17_267] = vect_cst__268;
  gimli.c:31:1: note: vectorizing stmts using SLP.
  gimli.c:31:1: optimized: basic block part vectorized using 32 byte vectors
  gimli.c:31:1: note: ***** The result for vector mode V32QI would be the same
  gimli.c:31:1: note: basic block vectorized

The resulting (much larger) gimli-gcc.s file is at

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-gcc.s

The puzzling/interesting thing here is that, if I pull out the inner loop into
a standalone function, it then seems to get vectorized into AVX instructions
just fine:

$ gcc -O3 -march=native -S -fopt-info-vec-all oneround.c
oneround.c:8:3: optimized: loop vectorized using 16 byte vectors
oneround.c:4:6: note: vectorized 1 loops in function.
oneround.c:17:1: note: ***** Analysis failed with vector mode VOID

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround.c

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround-gcc.s

(If called from the outer function, the inner one needs to be non-static, or
the optimiser will inline it - and then the vectorization fails again.)

Is there something about the larger function that upsets gcc's vectorizer
or other optimisations in a way that the smaller one doesn't? Can I coax gcc
into generating comparable code to llvm here with a #pragma, flag or other
hint?

Many thanks in advance for any pointers or suggestions.