public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
@ 2015-01-28 19:52 chris_s_jones at yahoo dot com
  2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: chris_s_jones at yahoo dot com @ 2015-01-28 19:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

            Bug ID: 64844
           Summary: Vectorization inhibited in gcc5 when loop starts with
                    elem[1], aarch64 perf regression from 4.9.1
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chris_s_jones at yahoo dot com

Created attachment 34611
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34611&action=edit
Simple test case

% ./trunk_aarch64/bin/aarch64-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=./trunk_aarch64/bin/aarch64-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/local/trunk_aarch64/libexec/gcc/aarch64-linux-gnu/5.0.0/lto-wrapper
Target: aarch64-linux-gnu
Configured with: /local/src/gcc-trunk/configure --prefix=/local/trunk_aarch64
--target=aarch64-linux-gnu --with-sysroot=/local/trunk_aarch64/sysroot
--with-gmp=/local/trunk_aarch64 --with-mpc=/local/trunk_aarch64
--with-mpfr=/local/trunk_aarch64 --with-cloog=/local/trunk_aarch64
--with-isl=/local/trunk_aarch64 --enable-__cxa_atexit --with-gnu-as
--with-gnu-ld --enable-shared --disable-libssp --disable-libmudflap
--enable-languages=c,c++,fortran --disable-libsanitizer --disable-nls
Thread model: posix
gcc version 5.0.0 20150127 (experimental) (GCC)

For the following code sample, only the first inlined call to compute() seems
to get vectorized by GCC5 using the command line shown below.  In GCC 4.9.1,
both calls get vectorized.  This results in a nearly 50% performance hit for
the newer compiler.

File smpd.c:
#include <stdint.h>
#include <stdio.h>

inline double compute(size_t n,
                      double const * restrict a, double const * restrict b)
{
    double res = 0.0;
    for (size_t i = 0; i < n; ++i) {
        res += a[i] + b[i];
    }
    return res;
}


int
main(int argc, char **argv) {

    double ary1[1024];
    double ary2[1024];

    // Initialize arrays
    for (size_t i = 0; i < 1024; ++i) {
        ary1[i] = argc / (double)(i + 1);
        ary2[i] = argc + argc / (double) (i + 1);
    }

    // Compute two results using different starting elements
    printf("Result 0 is %f\n", compute(512, &ary1[0], &ary2[0]));
    printf("Result 1 is %f\n", compute(512, &ary1[1], &ary2[1]));

    return 0;
}

Command line:

% aarch64-linux-gnu-gcc -O3 -mcpu=cortex-a57 -ffast-math -g -std=c99 -o
smdp.gcc5.test smdp.c

Code generated by GCC5:

Loop from first call to compute (vectorized):
  400460:       3ce06a60        ldr     q0, [x19,x0]
  400464:       3ce06a82        ldr     q2, [x20,x0]
  400468:       91004000        add     x0, x0, #0x10
  40046c:       f140041f        cmp     x0, #0x1, lsl #12
  400470:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  400474:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  400478:       54ffff41        b.ne    400460 <main+0x50>

Loop from second call to compute (not vectorized):
  400494:       fc607a81        ldr     d1, [x20,x0,lsl #3]
  400498:       fc607a62        ldr     d2, [x19,x0,lsl #3]
  40049c:       91000400        add     x0, x0, #0x1
  4004a0:       f108041f        cmp     x0, #0x201
  4004a4:       1e622821        fadd    d1, d1, d2
  4004a8:       1e612800        fadd    d0, d0, d1
  4004ac:       54ffff41        b.ne    400494 <main+0x84>

In GCC 4.9.1, I see the following code generated for the second call, following
a short prologue to handle the first data element:
  40048c:       3cc10402        ldr     q2, [x0],#16
  400490:       3cc10420        ldr     q0, [x1],#16
  400494:       eb13001f        cmp     x0, x19
  400498:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  40049c:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  4004a0:       54ffff61        b.ne    40048c <main+0xbc>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-01-29 12:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
2015-01-29  2:44 ` [Bug target/64844] " pinskia at gcc dot gnu.org
2015-01-29  9:36 ` rguenth at gcc dot gnu.org
2015-01-29  9:46 ` rguenth at gcc dot gnu.org
2015-01-29 12:54 ` rguenth at gcc dot gnu.org
2015-01-29 12:55 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).