[Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
@ 2011-03-25 11:49 ` sebastian.hegler@tu-dresden.de
  2011-03-25 12:27 ` sebastian.hegler@tu-dresden.de
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: sebastian.hegler@tu-dresden.de @ 2011-03-25 11:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

sebastian.hegler@tu-dresden.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sebastian.hegler@tu-dresden
                   |                            |.de

--- Comment #10 from sebastian.hegler@tu-dresden.de 2011-03-25 10:45:47 UTC ---
This one, as well as PR 33133, should be handled by "-floop-interchange". 

Fortran is row-major, so interchanging inner and outer loop would allow the
loops to be coalesced into one, which in turn should be easily vectorized (if
complex numbers can be vectorized, see PR 40770). 

Can you please give me some hints on how to find out if "-floop-interchange"
actually does that? Thanks!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
  2011-03-25 11:49 ` [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized sebastian.hegler@tu-dresden.de
@ 2011-03-25 12:27 ` sebastian.hegler@tu-dresden.de
  2011-03-25 13:13 ` rguenther at suse dot de
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: sebastian.hegler@tu-dresden.de @ 2011-03-25 12:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #11 from sebastian.hegler@tu-dresden.de 2011-03-25 11:38:37 UTC ---
Forget that about folding stuff into one loop, I didn't have my morning coffee
yet. However, the rest still applies. 

I'm looking forward to some help in that regard.

Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
  2011-03-25 11:49 ` [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized sebastian.hegler@tu-dresden.de
  2011-03-25 12:27 ` sebastian.hegler@tu-dresden.de
@ 2011-03-25 13:13 ` rguenther at suse dot de
  2012-07-13  8:46 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2011-03-25 13:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> 2011-03-25 12:40:10 UTC ---
On Fri, 25 Mar 2011, sebastian.hegler@tu-dresden.de wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
> 
> --- Comment #11 from sebastian.hegler@tu-dresden.de 2011-03-25 11:38:37 UTC ---
> Forget that about folding stuff into one loop, I didn't have my morning coffee
> yet. However, the rest still applies. 
> 
> I'm looking forward to some help in that regard.

Look at dump files (-fdump-tree-all-details).

Richard.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2011-03-25 13:13 ` rguenther at suse dot de
@ 2012-07-13  8:46 ` rguenth at gcc dot gnu.org
  2013-02-13 15:58 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-13  8:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947

--- Comment #13 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-13 08:45:12 UTC ---
Link to vectorizer missed-optimization meta-bug.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2012-07-13  8:46 ` rguenth at gcc dot gnu.org
@ 2013-02-13 15:58 ` rguenth at gcc dot gnu.org
  2013-03-27 10:39 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-02-13 15:58 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> 2013-02-13 15:58:31 UTC ---
The following testcase shows the issue well:

_Complex double self[1024];
_Complex double a[1024][1024];
_Complex double b[1024];

void foo (void)
{
  int i, j;
  for (i = 0; i < 1024; i+=3)
    for (j = 0; j < 1024; j+=3)
      self[i] = self[i] + a[i][j]*b[j];
}

we have to get the complex multiplication pattern recognized by SLP
which looks like (without PRE):

  <bb 3>:

  <bb 4>:
  # j_21 = PHI <j_13(3), 0(7)>
  # self_I_RE_lsm.2_12 = PHI <_26(3), self_I_RE_lsm.2_7(7)>
  # self_I_IM_lsm.3_28 = PHI <_27(3), self_I_IM_lsm.3_8(7)>
  # ivtmp_16 = PHI <ivtmp_1(3), 342(7)>
  _2 = REALPART_EXPR <a[i_20][j_21]>;
  _18 = IMAGPART_EXPR <a[i_20][j_21]>;
  _19 = REALPART_EXPR <b[j_21]>;
  _17 = IMAGPART_EXPR <b[j_21]>;
  _4 = _19 * _2;
  _3 = _18 * _17;
  _6 = _17 * _2;
  _23 = _19 * _18;
  _24 = _4 - _3;
  _25 = _23 + _6;
  _26 = _24 + self_I_RE_lsm.2_12;
  _27 = _25 + self_I_IM_lsm.3_28;
  j_13 = j_21 + 3;
  ivtmp_1 = ivtmp_16 - 1;
  if (ivtmp_1 != 0)
    goto <bb 3>;

we fail to build the SLP tree for _25 = _23 + _6 because the matching
stmt is _24 = _4 - _3 which has a different operation (SSE4 addsub
would support vectorizing this).  I don't see how we can easily
make this supported with the current pattern support ... the
support doesn't allow tieing together two SLP group members.
Simply allowing analysis to proceeed here reveals the fact that
the interleaving has a gap of 6 which makes the analysis fail.
Allowing it to proceed for ncopies == 1 (thus no actual interleaving
required) reveals the next check is slightly bogus in that case.
Fixing that ends us with

t.c:9: note: Load permutation 0 0 1 0 1 1 0 1
t.c:9: note: Build SLP failed: unsupported load permutation _27 = _25 +
self_I_IM_lsm.3_28;

... (to be continued)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2013-02-13 15:58 ` rguenth at gcc dot gnu.org
@ 2013-03-27 10:39 ` rguenth at gcc dot gnu.org
  2013-03-27 10:40 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-03-27 10:39 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> 2013-03-27 10:39:00 UTC ---
Author: rguenth
Date: Wed Mar 27 10:38:29 2013
New Revision: 197158

URL: http://gcc.gnu.org/viewcvs?rev=197158&root=gcc&view=rev
Log:
2013-03-27  Richard Biener  <rguenther@suse.de>

    PR tree-optimization/37021
    * tree-vect-data-refs.c (vect_check_strided_load): Allow
    REALPART/IMAGPART_EXPRs around the supported refs.
    * tree-ssa-structalias.c (find_func_aliases): Assume that
    floating-point values are not used to transfer pointers.

    * gfortran.dg/vect/fast-math-pr37021.f90: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/vect/fast-math-pr37021.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
    trunk/gcc/tree-vect-data-refs.c


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2013-03-27 10:39 ` rguenth at gcc dot gnu.org
@ 2013-03-27 10:40 ` rguenth at gcc dot gnu.org
  2013-04-07 13:18 ` dominiq at lps dot ens.fr
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-03-27 10:40 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> 2013-03-27 10:40:40 UTC ---
We now vectorize this testcase by means of using strided loads, relying on
store motion turning the store to self(i) in the innermost look into a
reduction (no support for vectorized strided stores).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2013-03-27 10:40 ` rguenth at gcc dot gnu.org
@ 2013-04-07 13:18 ` dominiq at lps dot ens.fr
  2015-05-12 11:56 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: dominiq at lps dot ens.fr @ 2013-04-07 13:18 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #17 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2013-04-07 13:18:27 UTC ---
The test gfortran.dg/vect/fast-math-pr37021.f90 fails on powerpc*-* (see
http://gcc.gnu.org/ml/gcc-testresults/2013-04/msg00677.html ). Isn't it
expected? AFAICT doubles are not vectorized, at least on a G5.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2013-04-07 13:18 ` dominiq at lps dot ens.fr
@ 2015-05-12 11:56 ` rguenth at gcc dot gnu.org
  2015-06-10 10:45 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-12 11:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
Author: rguenth
Date: Tue May 12 11:55:40 2015
New Revision: 223059

URL: https://gcc.gnu.org/viewcvs?rev=223059&root=gcc&view=rev
Log:
2015-05-12  Richard Biener  <rguenther@suse.de>

        PR tree-optimization/37021
        * tree-vectorizer.h (struct _slp_tree): Add two_operators flag.
        (SLP_TREE_TWO_OPERATORS): New define.
        * tree-vect-slp.c (vect_create_new_slp_node): Initialize
        SLP_TREE_TWO_OPERATORS.
        (vect_build_slp_tree_1): Allow two mixing plus/minus in an
        SLP node.
        (vect_build_slp_tree): Adjust.
        (vect_analyze_slp_cost_1): Likewise.
        (vect_schedule_slp_instance): Vectorize mixing plus/minus by
        emitting two vector stmts and mixing the results.

        * gcc.target/i386/vect-addsub.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.target/i386/vect-addsub.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-slp.c
    trunk/gcc/tree-vectorizer.h


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2015-05-12 11:56 ` rguenth at gcc dot gnu.org
@ 2015-06-10 10:45 ` rguenth at gcc dot gnu.org
  2015-08-25  8:11 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-10 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Depends on|                            |56766
         Resolution|---                         |FIXED
   Target Milestone|---                         |6.0

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
We now vectorize the original testcase with SLP.  There is still PR56766 which
causes us to fail to use addsubpd on x86_64 with SSE3.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56766
[Bug 56766] Fails to combine (vec_select (vec_concat ...)) to (vec_merge ...)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2015-06-10 10:45 ` rguenth at gcc dot gnu.org
@ 2015-08-25  8:11 ` rguenth at gcc dot gnu.org
  2015-08-27 22:09 ` wschmidt at gcc dot gnu.org
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-08-25  8:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #20)
> We still don't vectorize the original code example on Power.  It appears
> that this is being disabled because of an alignment issue.  The data
> references are being rejected by:
> 
> product.f:9:0: note: can't force alignment of ref: REALPART_EXPR
> <*a.0_24[_50]>
> 
> and similar for the other three DRs.  This happens due to this code in
> vect_compute_data_ref_alignment:
> 
>   if (base_alignment < TYPE_ALIGN (vectype))
>     {
>       /* Strip an inner MEM_REF to a bare decl if possible.  */
>       if (TREE_CODE (base) == MEM_REF
>           && integer_zerop (TREE_OPERAND (base, 1))
>           && TREE_CODE (TREE_OPERAND (base, 0)) == ADDR_EXPR)
>         base = TREE_OPERAND (TREE_OPERAND (base, 0), 0);
> 
>       if (!vect_can_force_dr_alignment_p (base, TYPE_ALIGN (vectype)))
>         {
>           if (dump_enabled_p ())
>             {
>               dump_printf_loc (MSG_NOTE, vect_location,
>                                "can't force alignment of ref: ");
>               dump_generic_expr (MSG_NOTE, TDF_SLIM, ref);
>               dump_printf (MSG_NOTE, "\n");
>             }
>           return true;
>         }
> 
> Here TYPE_ALIGN (vectype) is 128 (Power vectors are normally aligned on a
> 128-bit value), and base_alignment is 64.  a.0 is defined as:
> 
> complex(kind=8) [0:D.1831] * restrict a.0;
> 
> In both ELFv1 and ELFv2 ABIs for Power, a complex type is defined to have
> the same alignment as the underlying type.  So "complex double" has 8-byte
> alignment.
> 
> On earlier versions of Power, the decision is fine, because unaligned
> accesses are expensive prior to POWER8.  With POWER8, though, an unaligned
> access will (most of the time) perform as well as an aligned access.  So
> ideally we would like to teach the vectorizer to allow vectorization here.
> 
> It seems like vect_supportable_dr_alignment ought to be considered as part
> of the SLP vectorization decision here, rather than just comparing the base
> alignment with the vector type alignment.  Adding a check for that allows
> things to get a little further, but we still don't vectorize the block.  (I
> haven't yet looked into why, but I assume more needs to be done downstream
> to handle this case.)
> 
> My understanding of the vectorizer is not yet very deep, so before going too
> far down the wrong path, I'd like your opinion on the best approach to
> fixing the problem.  Thanks!

I see it only failing due to cost issues (tried ppc64le and -mcpu=power8).
The unaligned loads cost 3 and we end up with

t.f90:8:0: note: Cost model analysis:
  Vector inside of loop cost: 40
  Vector prologue cost: 8
  Vector epilogue cost: 4
  Scalar iteration cost: 12
  Scalar outside cost: 6
  Vector outside cost: 12
  prologue iterations: 0
  epilogue iterations: 0
t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the
scalar iteration cost = 12 is greater or equal to the vectorization factor = 1.

Note that we are (still) not very good in estimating the SLP cost as we
account 4 vector loads here (because we essentially will end up with
4 different permutations used), so the "unaligned" part is accounted for
too much and likely the permutation cost as well.  Both are a limitation
of the SLP data structures and not easily fixable.  With
-fvect-cost-model=unlimited I see both loops vectorized.

> Bill


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2015-08-25  8:11 ` rguenth at gcc dot gnu.org
@ 2015-08-27 22:09 ` wschmidt at gcc dot gnu.org
  2015-08-28  7:46 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-27 22:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #23 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
Created attachment 36261
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36261&action=edit
tree-slp-details dump

Ah, I was looking at the code in the test suite this time, rather than the raw
posted code, so the line numbers changed for the dejagnu commands.  The
statement number is now 12.

Attaching the details dump for SLP.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2015-08-27 22:09 ` wschmidt at gcc dot gnu.org
@ 2015-08-28  7:46 ` rguenther at suse dot de
  2015-08-28 13:20 ` wschmidt at gcc dot gnu.org
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2015-08-28  7:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Aug 2015, wschmidt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
> 
> --- Comment #22 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #21)
> > (In reply to Bill Schmidt from comment #20)
> 
> ...<snip>...
> > 
> > I see it only failing due to cost issues (tried ppc64le and -mcpu=power8).
> > The unaligned loads cost 3 and we end up with
> > 
> > t.f90:8:0: note: Cost model analysis:
> >   Vector inside of loop cost: 40
> >   Vector prologue cost: 8
> >   Vector epilogue cost: 4
> >   Scalar iteration cost: 12
> >   Scalar outside cost: 6
> >   Vector outside cost: 12
> >   prologue iterations: 0
> >   epilogue iterations: 0
> > t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the
> > scalar iteration cost = 12 is greater or equal to the vectorization factor =
> > 1.
> > 
> > Note that we are (still) not very good in estimating the SLP cost as we
> > account 4 vector loads here (because we essentially will end up with
> > 4 different permutations used), so the "unaligned" part is accounted for
> > too much and likely the permutation cost as well.  Both are a limitation
> > of the SLP data structures and not easily fixable.  With
> > -fvect-cost-model=unlimited I see both loops vectorized.
> 
> Yes, I get these same results for the loop vectorizer (using -O2
> -ftree-vectorize -mcpu=power8 -ffast-math).  But I was looking at the failure
> to do SLP vectorization.  In comment 19 you indicated this was now working,
> presumably on x86, but for Power we fail to SLP-vectorize
> fast-math-pr37021.f90:9:0.

Err, I meant loop SLP vectorization as opposed to loop vectorization
with interleaving...  Basic-block SLP doesn't work because (at least)
it does not handle reductions yet (I have done some early work here
but wasn't able to finish it)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2015-08-28  7:46 ` rguenther at suse dot de
@ 2015-08-28 13:20 ` wschmidt at gcc dot gnu.org
  2015-08-28 13:31 ` wschmidt at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-28 13:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #25 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
Ah, thank you for the clarification.  So does this require
-fvect-cost-model=unlimited on all targets?  If so, then I'll move on;
otherwise I'll have a look at the Power-specific cost issues.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2015-08-28 13:20 ` wschmidt at gcc dot gnu.org
@ 2015-08-28 13:31 ` wschmidt at gcc dot gnu.org
  2015-10-22 10:03 ` rguenth at gcc dot gnu.org
  2023-07-21 12:28 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 26+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2015-08-28 13:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #26 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #25)
> Ah, thank you for the clarification.  So does this require
> -fvect-cost-model=unlimited on all targets?  If so, then I'll move on;
> otherwise I'll have a look at the Power-specific cost issues.

Though, reading back, I see your comment on this not being easily fixable, so I
guess I know the answer.  Thanks again for your help.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2015-08-28 13:31 ` wschmidt at gcc dot gnu.org
@ 2015-10-22 10:03 ` rguenth at gcc dot gnu.org
  2023-07-21 12:28 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-10-22 10:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
Bug 37021 depends on bug 56902, which changed state.

Bug 56902 Summary: Fails to SLP with mismatched +/- and negatable constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56902

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
       [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2015-10-22 10:03 ` rguenth at gcc dot gnu.org
@ 2023-07-21 12:28 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-21 12:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
Bug 37021 depends on bug 54939, which changed state.

Bug 54939 Summary: Very poor vectorization of loops with complex arithmetic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (7 preceding siblings ...)
  2009-01-25 12:17 ` irar at il dot ibm dot com
@ 2009-01-27 12:40 ` dorit at gcc dot gnu dot org
  8 siblings, 0 replies; 26+ messages in thread
From: dorit at gcc dot gnu dot org @ 2009-01-27 12:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from dorit at gcc dot gnu dot org  2009-01-27 12:40 -------
(In reply to comment #4)
> The testcase should be
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
> to be meaningful - otherwise we are accessing a in non-continuous ways in the
> inner loop which would prevent vectorization.

this change from a(i,j) to a(j,i) is not required if we try to vectorize the
outer-loop, where the stride is 1. It's also a better way to vectorize the
reduction. A few limitations on the way though are:

1) somehow don't let gcc create guard code around the innermost loop to check
that it executes more than zero iterations. This creates a complicated control
flow structure within the outer-loop. For now you have to have  constant number
of iterations for the inner-loop because of that, or insert a statement like
"if (a2<=0) return;" before the loop...

2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
latch block and the vectorizer likes to have the latch block empty...

(see also PR33113 for related reference).


> With the versioning for stride == 1 I get then
> .L13:
>         movupd  16(%rax), %xmm1
>         movupd  (%rax), %xmm3
>         incl    %ecx
>         movupd  (%rdx), %xmm4
>         addq    $32, %rax
>         movapd  %xmm3, %xmm0
>         unpckhpd        %xmm1, %xmm3
>         unpcklpd        %xmm1, %xmm0
>         movupd  16(%rdx), %xmm1
>         movapd  %xmm4, %xmm2
>         addq    $32, %rdx
>         movapd  %xmm3, %xmm9
>         cmpl    %ecx, %r8d
>         unpcklpd        %xmm1, %xmm2
>         unpckhpd        %xmm1, %xmm4
>         movapd  %xmm4, %xmm1
>         movapd  %xmm2, %xmm4
>         mulpd   %xmm1, %xmm9
>         mulpd   %xmm0, %xmm4
>         mulpd   %xmm3, %xmm2
>         mulpd   %xmm1, %xmm0
>         subpd   %xmm9, %xmm4
>         addpd   %xmm2, %xmm0
>         addpd   %xmm4, %xmm6
>         addpd   %xmm0, %xmm5
>         ja      .L13
>         haddpd  %xmm5, %xmm5
>         cmpl    %r15d, %edi
>         movl    -4(%rsp), %ecx
>         haddpd  %xmm6, %xmm6
>         addsd   %xmm5, %xmm8
>         addsd   %xmm6, %xmm7
>         jne     .L12
>         jmp     .L14
> for the innermost loop, followed by a tail loop (peel for niters).  This is
> about 15% faster on AMD K10 than the non-vectorized loop (if you disable
> the cost-model and make sure to have enough iterations in the inner loop
> to pay back for the extra guarding conditions).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (6 preceding siblings ...)
  2009-01-25 11:04 ` rguenther at suse dot de
@ 2009-01-25 12:17 ` irar at il dot ibm dot com
  2009-01-27 12:40 ` dorit at gcc dot gnu dot org
  8 siblings, 0 replies; 26+ messages in thread
From: irar at il dot ibm dot com @ 2009-01-25 12:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from irar at il dot ibm dot com  2009-01-25 12:17 -------
(In reply to comment #7)
> > > Q1: does SLP work with reductions at all?
> > 
> > No. SLP currently originates from groups of strided stores.
> Ah, I see.  In this loop we have two reductions, so to apply SLP
> we would need to see that we can use a group of reductions for SLP?

Yes, I think this will work.

> > > Q2: does SLP do pattern recognition?
> > 
> > Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> > as a part of a pattern. There is no SLP specific pattern recoginition.
> Ok, but with a reduction it won't help me here.
> Can a loop be vectorized with just pattern recognition?  Hm, if I
> remember correctly we detect scalar patterns and then vectorize them.
> We don't support detecting "vector patterns" from scalar code, correct?

Yes, if I understand you correctly, we detect scalar patterns, but adding
vector pattern detection does not seem to be complicated.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (5 preceding siblings ...)
  2009-01-25  9:13 ` irar at il dot ibm dot com
@ 2009-01-25 11:04 ` rguenther at suse dot de
  2009-01-25 12:17 ` irar at il dot ibm dot com
  2009-01-27 12:40 ` dorit at gcc dot gnu dot org
  8 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2009-01-25 11:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from rguenther at suse dot de  2009-01-25 11:04 -------
Subject: Re:  Fortran Complex reduction /
 multiplication not vectorized

On Sun, 25 Jan 2009, irar at il dot ibm dot com wrote:

> 
> 
> ------- Comment #6 from irar at il dot ibm dot com  2009-01-25 09:12 -------
> (In reply to comment #5)
> > So,
> >  4) The vectorized version sucks because we have to use peeling for niters
> >     because we need to unroll the loop once and cannot apply SLP here.
> 
> What do you mean by "unroll the loop once"?

The vectorization factor is two, so we need two copies of the loop body
(which means unrolling it once and creating an epilogue loop because
niter may be odd)

> > Q1: does SLP work with reductions at all?
> 
> No. SLP currently originates from groups of strided stores.

Ah, I see.  In this loop we have two reductions, so to apply SLP
we would need to see that we can use a group of reductions for SLP?

> > Q2: does SLP do pattern recognition?
> 
> Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> as a part of a pattern. There is no SLP specific pattern recoginition.

Ok, but with a reduction it won't help me here.

Can a loop be vectorized with just pattern recognition?  Hm, if I
remember correctly we detect scalar patterns and then vectorize them.
We don't support detecting "vector patterns" from scalar code, correct?

Thanks,
Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (4 preceding siblings ...)
  2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
@ 2009-01-25  9:13 ` irar at il dot ibm dot com
  2009-01-25 11:04 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: irar at il dot ibm dot com @ 2009-01-25  9:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from irar at il dot ibm dot com  2009-01-25 09:12 -------
(In reply to comment #5)
> So,
>  4) The vectorized version sucks because we have to use peeling for niters
>     because we need to unroll the loop once and cannot apply SLP here.

What do you mean by "unroll the loop once"?

> Q1: does SLP work with reductions at all?

No. SLP currently originates from groups of strided stores.

> Q2: does SLP do pattern recognition?

Pattern recoginition is done before SLP, and SLP handles stmts that were marked
as a part of a pattern. There is no SLP specific pattern recoginition.

> First of all we would need to recognize a complex reduction as a single
> vectorized reduction.  Second we need to vectorize the complex multiplication
> with SLP, feeding the reduction with one resulting complex vector.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (3 preceding siblings ...)
  2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
@ 2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
  2009-01-25  9:13 ` irar at il dot ibm dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-23 15:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from rguenth at gcc dot gnu dot org  2009-01-23 15:36 -------
So,

 4) The vectorized version sucks because we have to use peeling for niters
    because we need to unroll the loop once and cannot apply SLP here.

Q1: does SLP work with reductions at all?
Q2: does SLP do pattern recognition?

First of all we would need to recognize a complex reduction as a single
vectorized reduction.  Second we need to vectorize the complex multiplication
with SLP, feeding the reduction with one resulting complex vector.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
                   ` (2 preceding siblings ...)
  2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
@ 2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
  2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-23 15:33 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from rguenth at gcc dot gnu dot org  2009-01-23 15:33 -------
The testcase should be

subroutine to_product_of(self,a,b,a1,a2)
  complex(kind=8) :: self (:)
  complex(kind=8), intent(in) :: a(:,:)
  complex(kind=8), intent(in) :: b(:)
  integer a1,a2
  do i = 1,a1
    do j = 1,a2
      self(i) = self(i) + a(j,i)*b(j)
    end do
  end do
end subroutine

to be meaningful - otherwise we are accessing a in non-continuous ways in the
inner loop which would prevent vectorization.

With the versioning for stride == 1 I get then

.L13:
        movupd  16(%rax), %xmm1
        movupd  (%rax), %xmm3
        incl    %ecx
        movupd  (%rdx), %xmm4
        addq    $32, %rax
        movapd  %xmm3, %xmm0
        unpckhpd        %xmm1, %xmm3
        unpcklpd        %xmm1, %xmm0
        movupd  16(%rdx), %xmm1
        movapd  %xmm4, %xmm2
        addq    $32, %rdx
        movapd  %xmm3, %xmm9
        cmpl    %ecx, %r8d
        unpcklpd        %xmm1, %xmm2
        unpckhpd        %xmm1, %xmm4
        movapd  %xmm4, %xmm1
        movapd  %xmm2, %xmm4
        mulpd   %xmm1, %xmm9
        mulpd   %xmm0, %xmm4
        mulpd   %xmm3, %xmm2
        mulpd   %xmm1, %xmm0
        subpd   %xmm9, %xmm4
        addpd   %xmm2, %xmm0
        addpd   %xmm4, %xmm6
        addpd   %xmm0, %xmm5
        ja      .L13
        haddpd  %xmm5, %xmm5
        cmpl    %r15d, %edi
        movl    -4(%rsp), %ecx
        haddpd  %xmm6, %xmm6
        addsd   %xmm5, %xmm8
        addsd   %xmm6, %xmm7
        jne     .L12
        jmp     .L14

for the innermost loop, followed by a tail loop (peel for niters).  This is
about 15% faster on AMD K10 than the non-vectorized loop (if you disable
the cost-model and make sure to have enough iterations in the inner loop
to pay back for the extra guarding conditions).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
  2008-08-04 17:59 ` [Bug tree-optimization/37021] " rguenth at gcc dot gnu dot org
  2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
@ 2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
  2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-21 15:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rguenth at gcc dot gnu dot org  2009-01-21 15:43 -------
Mine.  I am working on adding versioning for non-constant strides.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm dot com
         AssignedTo|unassigned at gcc dot gnu   |rguenth at gcc dot gnu dot
                   |dot org                     |org
             Status|UNCONFIRMED                 |ASSIGNED
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2009-01-21 15:43:08
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
  2008-08-04 17:59 ` [Bug tree-optimization/37021] " rguenth at gcc dot gnu dot org
@ 2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
  2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-19 15:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from rguenth at gcc dot gnu dot org  2008-08-19 15:29 -------
3) is because data-ref requires a constant step

  else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "failed: evolution of offset is not affine.\n");
      return;

but the step is (<unnamed-signed:64>) ((<unnamed-unsigned:64>) stride.3_36 *
16)

as we are dealing with general incoming arrays which are arbitrary striped.
Fixing this requires for example versioning for a constant stride.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
  2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
@ 2008-08-04 17:59 ` rguenth at gcc dot gnu dot org
  2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-04 17:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2008-08-04 17:58 -------
Patch for 1) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00221.html
Patch for 2) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00226.html

I didn't yet start on 3), so 4) is unknown yet (as is 5, 6, ... ;))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-07-21 12:28 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-37021-4@http.gcc.gnu.org/bugzilla/>
2011-03-25 11:49 ` [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized sebastian.hegler@tu-dresden.de
2011-03-25 12:27 ` sebastian.hegler@tu-dresden.de
2011-03-25 13:13 ` rguenther at suse dot de
2012-07-13  8:46 ` rguenth at gcc dot gnu.org
2013-02-13 15:58 ` rguenth at gcc dot gnu.org
2013-03-27 10:39 ` rguenth at gcc dot gnu.org
2013-03-27 10:40 ` rguenth at gcc dot gnu.org
2013-04-07 13:18 ` dominiq at lps dot ens.fr
2015-05-12 11:56 ` rguenth at gcc dot gnu.org
2015-06-10 10:45 ` rguenth at gcc dot gnu.org
2015-08-25  8:11 ` rguenth at gcc dot gnu.org
2015-08-27 22:09 ` wschmidt at gcc dot gnu.org
2015-08-28  7:46 ` rguenther at suse dot de
2015-08-28 13:20 ` wschmidt at gcc dot gnu.org
2015-08-28 13:31 ` wschmidt at gcc dot gnu.org
2015-10-22 10:03 ` rguenth at gcc dot gnu.org
2023-07-21 12:28 ` rguenth at gcc dot gnu.org
2008-08-04 17:57 [Bug tree-optimization/37021] New: " rguenth at gcc dot gnu dot org
2008-08-04 17:59 ` [Bug tree-optimization/37021] " rguenth at gcc dot gnu dot org
2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
2009-01-25  9:13 ` irar at il dot ibm dot com
2009-01-25 11:04 ` rguenther at suse dot de
2009-01-25 12:17 ` irar at il dot ibm dot com
2009-01-27 12:40 ` dorit at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).