From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-180418-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 26407 invoked by alias); 11 Oct 2013 15:05:33 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 26395 invoked by uid 89); 11 Oct 2013 15:05:33 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-4.2 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 11 Oct 2013 15:05:32 +0000
Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r9BF5Seg010768	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);	Fri, 11 Oct 2013 11:05:28 -0400
Received: from tucnak.zalov.cz (vpn1-4-130.ams2.redhat.com [10.36.4.130])	by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r9BF5QTP020623	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);	Fri, 11 Oct 2013 11:05:28 -0400
Received: from tucnak.zalov.cz (localhost [127.0.0.1])	by tucnak.zalov.cz (8.14.7/8.14.7) with ESMTP id r9BF5PYg015539;	Fri, 11 Oct 2013 17:05:25 +0200
Received: (from jakub@localhost)	by tucnak.zalov.cz (8.14.7/8.14.7/Submit) id r9BF5ODE015538;	Fri, 11 Oct 2013 17:05:24 +0200
Date: Fri, 11 Oct 2013 15:05:00 -0000
From: Jakub Jelinek <jakub@redhat.com>
To: Vidya Praveen <vidyapraveen@arm.com>
Cc: Richard Biener <rguenther@suse.de>, "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>,        "ook@ucw.cz" <ook@ucw.cz>, marc.glisse@inria.fr
Subject: Re: [RFC] Vectorization of indexed elements
Message-ID: <20131011150524.GX30970@tucnak.zalov.cz>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <alpine.DEB.2.10.1309091949090.3565@laptop-mg.saclay.inria.fr> <20130924150425.GE22907@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309251123490.29411@zhemvz.fhfr.qr> <20130927145008.GA861@e103625-lin.cambridge.arm.com> <20130927151945.GB861@e103625-lin.cambridge.arm.com> <20130930125454.GD3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1309301504120.5759@zhemvz.fhfr.qr> <20130930140001.GF3460@e103625-lin.cambridge.arm.com> <alpine.LNX.2.00.1310011022420.5759@zhemvz.fhfr.qr> <20131011145408.GB23850@e103625-lin.cambridge.arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20131011145408.GB23850@e103625-lin.cambridge.arm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-IsSubscribed: yes
X-SW-Source: 2013-10/txt/msg00117.txt.bz2

On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> Here's a compilable example:
> 
> void 
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int *__restrict__ c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c[2];
> }
> 
> This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> of target instructions that can take a vector register as second argument but
> use only one element (by using the same value for all the lanes) of the 
> vector register.
> 
> Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
>     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> 
> But for a loop like the one in the C example given, I will have to load the
> c[2] in one element of the vector register (leaving the remaining unused)
> rather. This is why I was proposing to load just one element in a vector 
> register (what I meant as "lane specific load"). The benefit of doing this is
> that we avoid explicit duplication, however such a simplification can only
> be done where such support is available - the reason why I was thinking in
> terms of optional standard pattern name. Another benefit is we will also be
> able to support scalars in the expression like in the following example:
> 
> void
> foo (int *__restrict__ a,
>      int *__restrict__ b,
>      int c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
>     a[i] = b[i] * c;
> }

So just during combine let the broadcast operation be combined with the
arithmetics?  Intel AVX512 ISA has similar feature, not sure what exactly
they are doing for this.  That said, the broadcast is likely going to be
hoisted before the loop, and in that case is it really cheaper to have
it unbroadcasted in a vector register rather than to broadcast it before the
loop and just use there?

	Jakub