From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-416440-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20696 invoked by alias); 4 Dec 2015 18:23:04 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 20686 invoked by uid 89); 4 Dec 2015 18:23:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 04 Dec 2015 18:23:02 +0000
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-17-Nzsw9WFCQlyVKox8-eMjdg-1; Fri, 04 Dec 2015 18:22:57 +0000
Received: from e105915-lin.cambridge.arm.com ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 4 Dec 2015 18:22:57 +0000
Subject: Re: [PATCH][1/2] Fix PR68553
To: Ramana Radhakrishnan <ramana.radhakrishnan@foss.arm.com>, Richard Biener <rguenther@suse.de>, gcc-patches@gcc.gnu.org
References: <alpine.LSU.2.11.1511270927260.4884@t29.fhfr.qr> <5661B211.8000402@arm.com> <B27570FD-855F-47DD-94B8-A08477430A97@suse.de> <5661D165.6080406@foss.arm.com>
From: Alan Lawrence <alan.lawrence@arm.com>
Message-ID: <5661DA01.7080006@arm.com>
Date: Fri, 04 Dec 2015 18:23:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <5661D165.6080406@foss.arm.com>
X-MC-Unique: Nzsw9WFCQlyVKox8-eMjdg-1
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2015-12/txt/msg00616.txt.bz2

On 04/12/15 17:46, Ramana Radhakrishnan wrote:
>
>
> On 04/12/15 16:04, Richard Biener wrote:
>> On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawrence@a=
rm.com> wrote:
>>> On 27/11/15 08:30, Richard Biener wrote:
>>>>
>>>> This is part 1 of a fix for PR68533 which shows that some targets
>>>> cannot can_vec_perm_p on an identity permutation.  I chose to fix
>>>> this in the vectorizer by detecting the identity itself but with
>>>> the current structure of vect_transform_slp_perm_load this is
>>>> somewhat awkward.  Thus the following no-op patch simplifies it
>>>> greatly (from the times it was restricted to do interleaving-kind
>>>> of permutes).  It turned out to not be 100% no-op as we now can
>>>> handle non-adjacent source operands so I split it out from the
>>>> actual fix.
>>>>
>>>> The two adjusted testcases no longer fail to vectorize because
>>>> of "need three vectors" but unadjusted would fail because there
>>>> are simply not enough scalar iterations in the loop.  I adjusted
>>>> that and now we vectorize it just fine (running into PR68559
>>>> which I filed).
>>>>
>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>>>
>>>> Richard.
>>>>
>>>> 2015-11-27  Richard Biener  <rguenther@suse.de>
>>>>
>>>> 	PR tree-optimization/68553
>>>> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
>>>> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>>>>
>>>> 	* gcc.dg/vect/pr45752.c: Adjust.
>>>> 	* gcc.dg/vect/slp-perm-4.c: Likewise.
>>>
>>> On aarch64 and ARM targets, this causes
>>>
>>> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
>>> "vectorizing
>>> stmts using SLP" 0
>>>
>>> That is, we now vectorize using SLP, when previously we did not.
>>>
>>> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
>>> without
>>> unrolling,
>> but now we unroll * 4, and vectorize using 3 loads and
>>> permutes:
>>
>> Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll hav=
e to start introducing much more fine-grained target-supports for vect_perm=
 to reliably guard all targets.
>
> I don't know enough about SSE4.1 to know whether it's a problem there or =
not. This is an actual regression on AArch64 and ARM and not just a testism=
, you now get :
>
> .L5:
>          ldr     q0, [x5, 16]
>          add     x4, x4, 48
>          ldr     q1, [x5, 32]
>          add     w6, w6, 1
>          ldr     q4, [x5, 48]
>          cmp     w3, w6
>          ldr     q2, [x5], 64
>          orr     v3.16b, v0.16b, v0.16b
>          orr     v5.16b, v4.16b, v4.16b
>          orr     v4.16b, v1.16b, v1.16b
>          tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
>          tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
>          tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
>          str     q0, [x4, -32]
>          str     q2, [x4, -48]
>          str     q4, [x4, -16]
>          bhi     .L5
>
> instead of
>
> .L5:
>          ld4     {v4.4s - v7.4s}, [x7], 64
>          add     w4, w4, 1
>          cmp     w3, w4
>          orr     v1.16b, v4.16b, v4.16b
>          orr     v2.16b, v5.16b, v5.16b
>          orr     v3.16b, v6.16b, v6.16b
>          st3     {v1.4s - v3.4s}, [x6], 48
>          bhi     .L5
>
> LD4 and ST3 do all the permutes without needing actual permute instructio=
ns - a strategy that favours generic permutes avoiding the load_lanes case =
is likely to be more expensive on most implementations. I think worth a PR =
atleast.
>
> regards
> Ramana
>

Yes, quite right. PR 68707.

--Alan