From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-343067-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 18599 invoked by alias); 5 Jun 2013 20:24:32 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 18586 invoked by uid 89); 5 Jun 2013 20:24:31 -0000
X-Spam-SWARE-Status: No, score=-3.6 required=5.0 tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,RCVD_IN_HOSTKARMA_W,RCVD_IN_HOSTKARMA_WL,TW_TX,TW_XV autolearn=no version=3.3.1
Received: from e9.ny.us.ibm.com (HELO e9.ny.us.ibm.com) (32.97.182.139)    by sourceware.org (qpsmtpd/0.84/v0.84-167-ge50287c) with ESMTP; Wed, 05 Jun 2013 20:24:30 +0000
Received: from /spool/local	by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted	for <gcc-patches@gcc.gnu.org> from <meissner@ibm-tiger.the-meissners.org>;	Wed, 5 Jun 2013 16:24:28 -0400
Received: from d01dlp02.pok.ibm.com (9.56.250.167)	by e9.ny.us.ibm.com (192.168.1.109) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;	Wed, 5 Jun 2013 16:24:26 -0400
Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147])	by d01dlp02.pok.ibm.com (Postfix) with ESMTP id F05556E804C	for <gcc-patches@gcc.gnu.org>; Wed,  5 Jun 2013 16:24:21 -0400 (EDT)
Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216])	by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r55KOPhB65536244	for <gcc-patches@gcc.gnu.org>; Wed, 5 Jun 2013 16:24:25 -0400
Received: from d01av02.pok.ibm.com (loopback [127.0.0.1])	by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r55KOOBd027606	for <gcc-patches@gcc.gnu.org>; Wed, 5 Jun 2013 17:24:24 -0300
Received: from ibm-tiger.the-meissners.org (dhcp-9-32-77-206.usma.ibm.com [9.32.77.206])	by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r55KOLG3027354;	Wed, 5 Jun 2013 17:24:22 -0300
Received: by ibm-tiger.the-meissners.org (Postfix, from userid 500)	id 9FC0E4266A; Wed,  5 Jun 2013 16:24:20 -0400 (EDT)
Date: Wed, 05 Jun 2013 20:24:00 -0000
From: Michael Meissner <meissner@linux.vnet.ibm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Michael Meissner <meissner@linux.vnet.ibm.com>,        David Edelsohn <dje.gcc@gmail.com>,        GCC Patches <gcc-patches@gcc.gnu.org>,        Pat Haugen <pthaugen@us.ibm.com>, Peter Bergner <bergner@vnet.ibm.com>
Subject: Re: [PATCH, rs6000] power8 patches, patch #4 (revised), new power8 builtins
Message-ID: <20130605202420.GA8002@ibm-tiger.the-meissners.org>
Mail-Followup-To: Michael Meissner <meissner@linux.vnet.ibm.com>,	Segher Boessenkool <segher@kernel.crashing.org>,	David Edelsohn <dje.gcc@gmail.com>,	GCC Patches <gcc-patches@gcc.gnu.org>,	Pat Haugen <pthaugen@us.ibm.com>,	Peter Bergner <bergner@vnet.ibm.com>
References: <20130520204053.GA21090@ibm-tiger.the-meissners.org> <20130521234717.GA27879@ibm-tiger.the-meissners.org> <20130604184853.GA12768@ibm-tiger.the-meissners.org> <CAGWvnyni0U+749Y-_aS1mB9vMQ1pKHhiikGhxJmV26KVP2Rbtw@mail.gmail.com> <1A6C76BF-AB76-4471-9F80-462FC0EBCB60@kernel.crashing.org> <20130605160427.GA5774@ibm-tiger.the-meissners.org> <1389948D-B28E-45B4-ADB4-50E2DA258331@kernel.crashing.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1389948D-B28E-45B4-ADB4-50E2DA258331@kernel.crashing.org>
User-Agent: Mutt/1.5.20 (2009-12-10)
X-TM-AS-MML: No
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 13060520-7182-0000-0000-00000728CEA8
X-SW-Source: 2013-06/txt/msg00279.txt.bz2

On Wed, Jun 05, 2013 at 10:06:08PM +0200, Segher Boessenkool wrote:
> >I also wonder whether it would be useful to have 32-bit do the
> >vector logical
> >ops in gprs as well.  At the moment, the patches don't allow it
> >(vector types
> >must be done in the altivec/vsx registers, an TImode is done by
> >splitting the
> >operation into 4 separate categories).  On the 64-bit side, having
> >__int128_t
> >passed in GPRs, means you want to avoid ping-ponging between the
> >GPRs and VSX
> >registers.  In addition, the atomic quad word support (patch #7)
> >has to run in
> >GPRs, so we need add/subtract/logical to have versions that run in
> >GPRs.
> 
> It might work better if you added a mode V1TI for TI in vector
> regs, and then used plain TI only for GPRs.  It certainly will
> make things a lot more regular; whether it actually works better,
> I have no idea.
> 
> The way you have things now, only after reload the vector patterns
> are split to GPR patterns; much too late to do most optimisations
> on it.  On the other hand, deciding early what register set some
> op should go to isn't too pleasant either; is it always the best
> choice to use the vector regs when possible?

It depends.  For example consider:

#ifndef TYPE
#define TYPE __int128_t
#endif

TYPE a_and (TYPE p, TYPE q) { return p & q; }
void p_and (TYPE *p, TYPE *q, TYPE *r) { *p = *q & *r; }

In a_and, p and q are passed in GPRs, so you want to use the GPR based
instructions.  In p_and, it is simpler to do the instruction in the VSX
registers.

This is what my code from patch 4 generates:

.L.a_and:
        and 3,3,5
        and 4,4,6
        blr

.L.p_and:
        lxvd2x 12,0,4
        lxvd2x 0,0,5
        xxland 0,12,0
        stxvd2x 0,0,3
        blr

Unfortunately when I added the TImode in VSX registers, I didn't notice this,
and the current code generates:

.L.a_and:
        addi 9,1,-16
        std 3,0(9)
        std 4,8(9)
        ori 2,2,0
        lxvd2x 12,0,9
        std 5,0(9)
        std 6,8(9)
        ori 2,2,0
        lxvd2x 0,0,9
        xxland 0,12,0
        stxvd2x 0,0,9
        ori 2,2,0
        ld 3,0(9)
        ld 4,8(9)
        blr

.L.p_and:
        lxvd2x 12,0,4
        lxvd2x 0,0,5
        xxland 0,12,0
        stxvd2x 0,0,3
        blr

Previous versions (and -mno-vsx-timode) generate:

.L.a_and:
        and 3,3,5
        and 4,4,6
        blr

.L.p_and:
        ld 10,0(4)
        ld 9,0(5)
        and 9,10,9
        std 9,0(3)
        ld 10,8(4)
        ld 9,8(5)
        and 9,10,9
        std 9,8(3)
        blr

Note, that the scheduler does not interleave the loads and the and's, instead
it does ld/ld/and/std.

This bouncing back and forth will get somewhat worse when the support for doing
128int_t add/subtract in the vector registers is added.  We don't want to hard
wire doing all of TImode in vector registers, because this breaks the 8-byte
atomic fetch_and_add functions (without having to use an UNSPEC to hide the
add).

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797