public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
@ 2015-01-28 19:52 chris_s_jones at yahoo dot com
  2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: chris_s_jones at yahoo dot com @ 2015-01-28 19:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

            Bug ID: 64844
           Summary: Vectorization inhibited in gcc5 when loop starts with
                    elem[1], aarch64 perf regression from 4.9.1
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chris_s_jones at yahoo dot com

Created attachment 34611
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34611&action=edit
Simple test case

% ./trunk_aarch64/bin/aarch64-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=./trunk_aarch64/bin/aarch64-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/local/trunk_aarch64/libexec/gcc/aarch64-linux-gnu/5.0.0/lto-wrapper
Target: aarch64-linux-gnu
Configured with: /local/src/gcc-trunk/configure --prefix=/local/trunk_aarch64
--target=aarch64-linux-gnu --with-sysroot=/local/trunk_aarch64/sysroot
--with-gmp=/local/trunk_aarch64 --with-mpc=/local/trunk_aarch64
--with-mpfr=/local/trunk_aarch64 --with-cloog=/local/trunk_aarch64
--with-isl=/local/trunk_aarch64 --enable-__cxa_atexit --with-gnu-as
--with-gnu-ld --enable-shared --disable-libssp --disable-libmudflap
--enable-languages=c,c++,fortran --disable-libsanitizer --disable-nls
Thread model: posix
gcc version 5.0.0 20150127 (experimental) (GCC)

For the following code sample, only the first inlined call to compute() seems
to get vectorized by GCC5 using the command line shown below.  In GCC 4.9.1,
both calls get vectorized.  This results in a nearly 50% performance hit for
the newer compiler.

File smpd.c:
#include <stdint.h>
#include <stdio.h>

inline double compute(size_t n,
                      double const * restrict a, double const * restrict b)
{
    double res = 0.0;
    for (size_t i = 0; i < n; ++i) {
        res += a[i] + b[i];
    }
    return res;
}


int
main(int argc, char **argv) {

    double ary1[1024];
    double ary2[1024];

    // Initialize arrays
    for (size_t i = 0; i < 1024; ++i) {
        ary1[i] = argc / (double)(i + 1);
        ary2[i] = argc + argc / (double) (i + 1);
    }

    // Compute two results using different starting elements
    printf("Result 0 is %f\n", compute(512, &ary1[0], &ary2[0]));
    printf("Result 1 is %f\n", compute(512, &ary1[1], &ary2[1]));

    return 0;
}

Command line:

% aarch64-linux-gnu-gcc -O3 -mcpu=cortex-a57 -ffast-math -g -std=c99 -o
smdp.gcc5.test smdp.c

Code generated by GCC5:

Loop from first call to compute (vectorized):
  400460:       3ce06a60        ldr     q0, [x19,x0]
  400464:       3ce06a82        ldr     q2, [x20,x0]
  400468:       91004000        add     x0, x0, #0x10
  40046c:       f140041f        cmp     x0, #0x1, lsl #12
  400470:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  400474:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  400478:       54ffff41        b.ne    400460 <main+0x50>

Loop from second call to compute (not vectorized):
  400494:       fc607a81        ldr     d1, [x20,x0,lsl #3]
  400498:       fc607a62        ldr     d2, [x19,x0,lsl #3]
  40049c:       91000400        add     x0, x0, #0x1
  4004a0:       f108041f        cmp     x0, #0x201
  4004a4:       1e622821        fadd    d1, d1, d2
  4004a8:       1e612800        fadd    d0, d0, d1
  4004ac:       54ffff41        b.ne    400494 <main+0x84>

In GCC 4.9.1, I see the following code generated for the second call, following
a short prologue to handle the first data element:
  40048c:       3cc10402        ldr     q2, [x0],#16
  400490:       3cc10420        ldr     q0, [x1],#16
  400494:       eb13001f        cmp     x0, x19
  400498:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  40049c:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  4004a0:       54ffff61        b.ne    40048c <main+0xbc>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
@ 2015-01-28 20:42 ` rguenth at gcc dot gnu.org
  2015-01-29  2:44 ` [Bug target/64844] " pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-28 20:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |aarch64-linux-gnu
             Status|UNCONFIRMED                 |ASSIGNED
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2015-01-28
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1
            Summary|Vectorization inhibited in  |[5 Regression]
                   |gcc5 when loop starts with  |Vectorization inhibited in
                   |elem[1], aarch64 perf       |gcc5 when loop starts with
                   |regression from 4.9.1       |elem[1], aarch64 perf
                   |                            |regression from 4.9.1
   Target Milestone|---                         |5.0

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will have a look.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
  2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
@ 2015-01-29  2:44 ` pinskia at gcc dot gnu.org
  2015-01-29  9:36 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2015-01-29  2:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
t.c:8:5: note: === vect_update_slp_costs_according_to_vf ===
t.c:8:5: note: cost model: the vector iteration cost = 26 divided by the scalar
iteration cost = 10 is greater or equal to the vectorization factor = 2.
t.c:8:5: note: not vectorized: vectorization not profitable.
t.c:8:5: note: not vectorized: vector version will never be profitable.
t.c:8:5: note: bad operation or unsupported loop bound.


A cost model issue with cortex-a57.  The cost model changed in GCC 5 for
cortex-a57.  I think this is due to unaligned loads.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
  2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
  2015-01-29  2:44 ` [Bug target/64844] " pinskia at gcc dot gnu.org
@ 2015-01-29  9:36 ` rguenth at gcc dot gnu.org
  2015-01-29  9:46 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-29  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> t.c:8:5: note: === vect_update_slp_costs_according_to_vf ===
> t.c:8:5: note: cost model: the vector iteration cost = 26 divided by the
> scalar iteration cost = 10 is greater or equal to the vectorization factor =
> 2.
> t.c:8:5: note: not vectorized: vectorization not profitable.
> t.c:8:5: note: not vectorized: vector version will never be profitable.
> t.c:8:5: note: bad operation or unsupported loop bound.
> 
> 
> A cost model issue with cortex-a57.  The cost model changed in GCC 5 for
> cortex-a57.  I think this is due to unaligned loads.

But we (should) have known misalignment here and thus peeling for alignment
should be able to arrange for aligned vectors.  Iff aarch64 can align
the stack properly.  If not then the first loop should behave the same
as the 2nd...

Right:

t.c:7:5: note: vect_model_load_cost: aligned.
t.c:7:5: note: vect_get_data_access_cost: inside_cost = 5, outside_cost = 0.
t.c:7:5: note: vect_model_load_cost: aligned.
t.c:7:5: note: vect_get_data_access_cost: inside_cost = 10, outside_cost = 0.
t.c:7:5: note: Try peeling by 1
t.c:7:5: note: Alignment of access forced using peeling.
t.c:7:5: note: Peeling for alignment will be applied.

but the costs are odd.

For the first (aligned loop we get)

t.c:7:5: note: Cost model analysis:
  Vector inside of loop cost: 16
  Vector prologue cost: 8
  Vector epilogue cost: 11
  Scalar iteration cost: 10
  Scalar outside cost: 0
  Vector outside cost: 19
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 10

while for the 2nd:

t.c:7:5: note: Cost model analysis:
  Vector inside of loop cost: 26
  Vector prologue cost: 18
  Vector epilogue cost: 21
  Scalar iteration cost: 10
  Scalar outside cost: 0
  Vector outside cost: 39
  prologue iterations: 1
  epilogue iterations: 1

while the vector inside of loop cost should be the same.

The issue is that both vect_enhance_data_refs_alignment at analysis time
and vectorizable_load at transform time account for the cost via the
add_stmt_cost hook.

With that fixed we get

t.c:7:5: note: Cost model analysis:
  Vector inside of loop cost: 16
  Vector prologue cost: 18
  Vector epilogue cost: 21
  Scalar iteration cost: 10
  Scalar outside cost: 0
  Vector outside cost: 39
  prologue iterations: 1
  epilogue iterations: 1
  Calculated minimum iters for profitability: 12

which is more reasonable and vectorizes both loops as expected.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
                   ` (2 preceding siblings ...)
  2015-01-29  9:36 ` rguenth at gcc dot gnu.org
@ 2015-01-29  9:46 ` rguenth at gcc dot gnu.org
  2015-01-29 12:54 ` rguenth at gcc dot gnu.org
  2015-01-29 12:55 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-29  9:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 34613
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34613&action=edit
patch

Patch I am testing.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
                   ` (3 preceding siblings ...)
  2015-01-29  9:46 ` rguenth at gcc dot gnu.org
@ 2015-01-29 12:54 ` rguenth at gcc dot gnu.org
  2015-01-29 12:55 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-29 12:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Author: rguenth
Date: Thu Jan 29 12:53:39 2015
New Revision: 220244

URL: https://gcc.gnu.org/viewcvs?rev=220244&root=gcc&view=rev
Log:
2015-01-29  Richard Biener  <rguenther@suse.de>

    PR tree-optimization/64844
    * tree-vect-loop.c (vect_estimate_min_profitable_iters): Always
    dump cost model analysis.
    * tree-vect-data-refs.c (vect_enhance_data_refs_alignment):
    Do not register adjusted load/store costs here.

    * gcc.dg/vect/pr64844.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr64844.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
    trunk/gcc/tree-vect-loop.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/64844] [5 Regression] Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1
  2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
                   ` (4 preceding siblings ...)
  2015-01-29 12:54 ` rguenth at gcc dot gnu.org
@ 2015-01-29 12:55 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-29 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-01-29 12:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-28 19:52 [Bug tree-optimization/64844] New: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 chris_s_jones at yahoo dot com
2015-01-28 20:42 ` [Bug tree-optimization/64844] [5 Regression] " rguenth at gcc dot gnu.org
2015-01-29  2:44 ` [Bug target/64844] " pinskia at gcc dot gnu.org
2015-01-29  9:36 ` rguenth at gcc dot gnu.org
2015-01-29  9:46 ` rguenth at gcc dot gnu.org
2015-01-29 12:54 ` rguenth at gcc dot gnu.org
2015-01-29 12:55 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).