From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-474927-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 114965 invoked by alias); 19 Mar 2018 09:11:17 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 114946 invoked by uid 89); 19 Mar 2018 09:11:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_NUMSUBJECT,SPF_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=
X-HELO: mx2.suse.de
Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 19 Mar 2018 09:11:14 +0000
Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254])	by mx2.suse.de (Postfix) with ESMTP id F2958AD95;	Mon, 19 Mar 2018 09:11:11 +0000 (UTC)
Date: Mon, 19 Mar 2018 11:55:00 -0000
From: Richard Biener <rguenther@suse.de>
To: Tom de Vries <Tom_deVries@mentor.com>
cc: gcc-patches@gcc.gnu.org, Jan Hubicka <jh@suse.de>
Subject: Re: [PATCH] Fix PR84512
In-Reply-To: <b880705a-5e9f-7d57-13a7-6b90ac14d9cf@mentor.com>
Message-ID: <alpine.LSU.2.20.1803191010080.18265@zhemvz.fhfr.qr>
References: <alpine.LSU.2.20.1802271339480.18265@zhemvz.fhfr.qr> <c7abe956-1c76-d987-8a25-a852d2feabaf@mentor.com> <alpine.LSU.2.20.1803161254060.18265@zhemvz.fhfr.qr> <b880705a-5e9f-7d57-13a7-6b90ac14d9cf@mentor.com>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-SW-Source: 2018-03/txt/msg00863.txt.bz2

On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 03/16/2018 12:55 PM, Richard Biener wrote:
> > On Fri, 16 Mar 2018, Tom de Vries wrote:
> > 
> > > On 02/27/2018 01:42 PM, Richard Biener wrote:
> > > > Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> > > > ===================================================================
> > > > --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(nonexistent)
> > > > +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c	(working copy)
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O3 -fdump-tree-optimized" } */
> > > > +
> > > > +int foo()
> > > > +{
> > > > +  int a[10];
> > > > +  for(int i = 0; i < 10; ++i)
> > > > +    a[i] = i*i;
> > > > +  int res = 0;
> > > > +  for(int i = 0; i < 10; ++i)
> > > > +    res += a[i];
> > > > +  return res;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
> > > 
> > > This fails for nvptx, because it doesn't have the required vector
> > > operations.
> > > To fix the fail, I've added requiring effective target vect_int_mult.
> > 
> > On targets that do not vectorize you should see the scalar loops unrolled
> > instead.  Or do you have only one loop vectorized?
> 
> Sort of. Loop vectorization has no effect, and the scalar loops are completely
> unrolled. But then slp vectorization vectorizes the stores.
> 
> So at optimized we have:
> ...
>   MEM[(int *)&a] = { 0, 1 };
>   MEM[(int *)&a + 8B] = { 4, 9 };
>   MEM[(int *)&a + 16B] = { 16, 25 };
>   MEM[(int *)&a + 24B] = { 36, 49 };
>   MEM[(int *)&a + 32B] = { 64, 81 };
>   _6 = a[0];
>   _28 = a[1];
>   res_29 = _6 + _28;
>   _35 = a[2];
>   res_36 = res_29 + _35;
>   _42 = a[3];
>   res_43 = res_36 + _42;
>   _49 = a[4];
>   res_50 = res_43 + _49;
>   _56 = a[5];
>   res_57 = res_50 + _56;
>   _63 = a[6];
>   res_64 = res_57 + _63;
>   _70 = a[7];
>   res_71 = res_64 + _70;
>   _77 = a[8];
>   res_78 = res_71 + _77;
>   _2 = a[9];
>   res_11 = _2 + res_78;
>   a ={v} {CLOBBER};
>   return res_11;
> ...
> 
> The stores and loads are eliminated by dse1 in the rtl phase, and in the end
> we have:
> ...
> .visible .func (.param.u32 %value_out) foo
> {
>         .reg.u32 %value;
>         .local .align 16 .b8 %frame_ar[48];
>         .reg.u64 %frame;
>         cvta.local.u64 %frame, %frame_ar;
>         mov.u32 %value, 285;
>         st.param.u32    [%value_out], %value;
>         ret;
> }
> ...
> 
> > That's precisely
> > what the PR was about...  which means it isn't fixed for nvptx :/
> 
> Indeed the assembly is not optimal, and would be optimal if we'd have optimal
> code at optimized.
> 
> FWIW, using this patch we generate optimal code at optimized:
> ...
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 3ebcfc30349..6b64f600c4a 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_tracer);
>        NEXT_PASS (pass_thread_jumps);
>        NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> +      NEXT_PASS (pass_fre);
>        NEXT_PASS (pass_strlen);
>        NEXT_PASS (pass_thread_jumps);
>        NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
> ...
> 
> and we get:
> ...
> .visible .func (.param.u32 %value_out) foo
> {
>         .reg.u32 %value;
>         mov.u32 %value, 285;
>         st.param.u32    [%value_out], %value;
>         ret;
> }
> ...
> 
> I could file a missing optimization PR for nvptx, but I'm not sure where this
> should be fixed.

Ah, yeah... the usual issue then.

Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

Thanks,
Richard.