Puzzling auto-vectorization behaviour

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

From: Chris Webb <chris@arachsys.com>
To: gcc-help@gcc.gnu.org
Subject: Puzzling auto-vectorization behaviour
Date: Sun, 12 Jul 2020 19:44:44 +0100	[thread overview]
Message-ID: <20200712184444.GB30327@arachsys.com> (raw)

I'm seeing 2x slower execution of the following cryptographic permutation
function when compiled with gcc 10.1.0 vs llvm/clang:

  void gimli(uint32_t state[12]) {
    size_t round, column;
    uint32_t x, y, z;

    for (round = 24; round > 0; round--) {
      for (column = 0; column < 4; column++) {
        x = state[column] << 24 | state[column] >> 8;
        y = state[4 + column] << 9 | state[4 + column] >> 23;
        z = state[8 + column];

        state[8 + column] = x ^ (z << 1) ^ ((y & z) << 2);
        state[4 + column] = y ^ x ^ ((x | z) << 1);
        state[column] = z ^ y ^ ((x & y) << 3);
      }

      switch (round & 3) {
        case 0:
          x = state[0], state[0] = state[1], state[1] = x;
          x = state[2], state[2] = state[3], state[3] = x;
          state[0] ^= 0x9e377900 | round;
          break;
        case 2:
          x = state[0], state[0] = state[2], state[2] = x;
          x = state[1], state[1] = state[3], state[3] = x;
          break;
      }
    }
  }

Compiled with LLVM clang at -O3 -march=native on an x86-64 box, it gets
vectorized with AVX instructions, taking about 110ns to run on my test machine.
This isn't too far off a hand-optimised version using vector intrinsics. My
gimli.c and the generated assembler from clang's -S (for comparison) are at

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli.c

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-llvm.s

respectively.

However, gcc 10.1.0 isn't vectorizing the inner loop, so the compiled
function takes 240ns on the same machine. gcc seems to examine the outer
loop, decides it can't be vectorized but never considers the inner loop:

  $ gcc -O3 -march=native -S -fopt-info-vec-all gimli.c
  gimli.c:8:3: missed: couldn't vectorize loop
  gimli.c:8:3: missed: not vectorized: control flow in loop.
  gimli.c:4:6: note: vectorized 0 loops in function.
  gimli.c:4:6: note: ***** Analysis failed with vector mode V8SI
  gimli.c:4:6: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V8SI
  gimli.c:23:18: note: ***** Analysis failed with vector mode VOID
  gimli.c:8:3: note: ***** Analysis failed with vector mode VOID
  gimli.c:19:5: note: ***** Analysis failed with vector mode VOID
  gimli.c:19:5: note: ***** Analysis failed with vector mode VOID
  gimli.c:31:1: note: ***** Analysis succeeded with vector mode V8SI
  gimli.c:31:1: note: SLPing BB part
  gimli.c:31:1: note: ------>vectorizing SLP node starting from: *state_38(D) = state__lsm.6_276;
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.6_276 = PHI <state__lsm.6_131(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.3_35 = PHI <state__lsm.3_67(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.4_280 = PHI <state__lsm.4_94(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand state__lsm.5_279 = PHI <state__lsm.5_104(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _281 = PHI <_83(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _277 = PHI <_115(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _274 = PHI <_147(4)>, type of def: external
  gimli.c:31:1: note: vect_is_simple_use: operand _272 = PHI <_179(4)>, type of def: external
  gimli.c:31:1: note: transform store. ncopies = 1
  gimli.c:31:1: note: add new stmt: vect_cst__268 = _269;
  gimli.c:31:1: note: created new init_stmt: vect_cst__268 = _269;
  gimli.c:31:1: note: create vector_type-pointer variable to type: vector(8) unsigned int  vectorizing a pointer ref: *state_38(D)
  gimli.c:31:1: note: created vectp.17_267
  gimli.c:31:1: note: add new stmt: MEM <vector(8) unsigned int> [(uint32_t *)vectp.17_267] = vect_cst__268;
  gimli.c:31:1: note: vectorizing stmts using SLP.
  gimli.c:31:1: optimized: basic block part vectorized using 32 byte vectors
  gimli.c:31:1: note: ***** The result for vector mode V32QI would be the same
  gimli.c:31:1: note: basic block vectorized

The resulting (much larger) gimli-gcc.s file is at

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/gimli-gcc.s

The puzzling/interesting thing here is that, if I pull out the inner loop into
a standalone function, it then seems to get vectorized into AVX instructions
just fine:

$ gcc -O3 -march=native -S -fopt-info-vec-all oneround.c
oneround.c:8:3: optimized: loop vectorized using 16 byte vectors
oneround.c:4:6: note: vectorized 1 loops in function.
oneround.c:17:1: note: ***** Analysis failed with vector mode VOID

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround.c

  https://gist.githubusercontent.com/arachsys/afff76fae190ce133da2ebd7b374c478/raw/156bc1240fd858b2088ad26354940b2be22e6188/oneround-gcc.s

(If called from the outer function, the inner one needs to be non-static, or
the optimiser will inline it - and then the vectorization fails again.)

Is there something about the larger function that upsets gcc's vectorizer
or other optimisations in a way that the smaller one doesn't? Can I coax gcc
into generating comparable code to llvm here with a #pragma, flag or other
hint?

Many thanks in advance for any pointers or suggestions.

                 reply	other threads:[~2020-07-12 18:44 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200712184444.GB30327@arachsys.com \
    --to=chris@arachsys.com \
    --cc=gcc-help@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).