From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20696 invoked by alias); 4 Dec 2015 18:23:04 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 20686 invoked by uid 89); 4 Dec 2015 18:23:04 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 04 Dec 2015 18:23:02 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-17-Nzsw9WFCQlyVKox8-eMjdg-1; Fri, 04 Dec 2015 18:22:57 +0000 Received: from e105915-lin.cambridge.arm.com ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 4 Dec 2015 18:22:57 +0000 Subject: Re: [PATCH][1/2] Fix PR68553 To: Ramana Radhakrishnan , Richard Biener , gcc-patches@gcc.gnu.org References: <5661B211.8000402@arm.com> <5661D165.6080406@foss.arm.com> From: Alan Lawrence Message-ID: <5661DA01.7080006@arm.com> Date: Fri, 04 Dec 2015 18:23:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <5661D165.6080406@foss.arm.com> X-MC-Unique: Nzsw9WFCQlyVKox8-eMjdg-1 Content-Type: text/plain; charset=WINDOWS-1252; format=flowed Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2015-12/txt/msg00616.txt.bz2 On 04/12/15 17:46, Ramana Radhakrishnan wrote: > > > On 04/12/15 16:04, Richard Biener wrote: >> On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence wrote: >>> On 27/11/15 08:30, Richard Biener wrote: >>>> >>>> This is part 1 of a fix for PR68533 which shows that some targets >>>> cannot can_vec_perm_p on an identity permutation. I chose to fix >>>> this in the vectorizer by detecting the identity itself but with >>>> the current structure of vect_transform_slp_perm_load this is >>>> somewhat awkward. Thus the following no-op patch simplifies it >>>> greatly (from the times it was restricted to do interleaving-kind >>>> of permutes). It turned out to not be 100% no-op as we now can >>>> handle non-adjacent source operands so I split it out from the >>>> actual fix. >>>> >>>> The two adjusted testcases no longer fail to vectorize because >>>> of "need three vectors" but unadjusted would fail because there >>>> are simply not enough scalar iterations in the loop. I adjusted >>>> that and now we vectorize it just fine (running into PR68559 >>>> which I filed). >>>> >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. >>>> >>>> Richard. >>>> >>>> 2015-11-27 Richard Biener >>>> >>>> PR tree-optimization/68553 >>>> * tree-vect-slp.c (vect_get_mask_element): Remove. >>>> (vect_transform_slp_perm_load): Implement in a simpler way. >>>> >>>> * gcc.dg/vect/pr45752.c: Adjust. >>>> * gcc.dg/vect/slp-perm-4.c: Likewise. >>> >>> On aarch64 and ARM targets, this causes >>> >>> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect >>> "vectorizing >>> stmts using SLP" 0 >>> >>> That is, we now vectorize using SLP, when previously we did not. >>> >>> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, >>> without >>> unrolling, >> but now we unroll * 4, and vectorize using 3 loads and >>> permutes: >> >> Happens on x86_64 as well with at least Sse4.1. Unfortunately we'll hav= e to start introducing much more fine-grained target-supports for vect_perm= to reliably guard all targets. > > I don't know enough about SSE4.1 to know whether it's a problem there or = not. This is an actual regression on AArch64 and ARM and not just a testism= , you now get : > > .L5: > ldr q0, [x5, 16] > add x4, x4, 48 > ldr q1, [x5, 32] > add w6, w6, 1 > ldr q4, [x5, 48] > cmp w3, w6 > ldr q2, [x5], 64 > orr v3.16b, v0.16b, v0.16b > orr v5.16b, v4.16b, v4.16b > orr v4.16b, v1.16b, v1.16b > tbl v0.16b, {v0.16b - v1.16b}, v6.16b > tbl v2.16b, {v2.16b - v3.16b}, v7.16b > tbl v4.16b, {v4.16b - v5.16b}, v16.16b > str q0, [x4, -32] > str q2, [x4, -48] > str q4, [x4, -16] > bhi .L5 > > instead of > > .L5: > ld4 {v4.4s - v7.4s}, [x7], 64 > add w4, w4, 1 > cmp w3, w4 > orr v1.16b, v4.16b, v4.16b > orr v2.16b, v5.16b, v5.16b > orr v3.16b, v6.16b, v6.16b > st3 {v1.4s - v3.4s}, [x6], 48 > bhi .L5 > > LD4 and ST3 do all the permutes without needing actual permute instructio= ns - a strategy that favours generic permutes avoiding the load_lanes case = is likely to be more expensive on most implementations. I think worth a PR = atleast. > > regards > Ramana > Yes, quite right. PR 68707. --Alan