From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 59549 invoked by alias); 14 May 2019 09:13:59 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 59541 invoked by uid 89); 14 May 2019 09:13:59 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-4.4 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=PACK X-HELO: mail-lf1-f67.google.com Received: from mail-lf1-f67.google.com (HELO mail-lf1-f67.google.com) (209.85.167.67) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 14 May 2019 09:13:56 +0000 Received: by mail-lf1-f67.google.com with SMTP id l26so5618664lfh.13 for ; Tue, 14 May 2019 02:13:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=wWtJ9bKBbAw59HvQAvGgiczdc0HF5r4CK1SpQIlKSvg=; b=LRV4/cw+7hrBarLUmLAurtiIjDPXR+hEF58pKwvV9jx/eDBcy7gXuTaBMFMWT3lm4w 0649QietxIbTB+Hyrt0O8ukhgmOse9Y624ey1dmnwFIY3SyBn+EaNCNIctahwR3Xy1vY vjVy/O+k4/e7r9J7HL5GBKWW4u6ThsBbp2tREkRlU1ezIRhqTApv0HrB9JiCYz2AFCbd s37jBO0ezSFZAHTIOnX5Kfi5PzC28XL7wBWWjSErJjJuqX1snnSXqzpyMDrjDTEQlyO+ ImC3Av3iUQQQ0GqGnEv5CDKdyFCFFW7Dhy/6MrAfghTXNwWfVYCe0++h8T09bBzcOxaE zSLw== MIME-Version: 1.0 References: <20190303143230.19742-1-hjl.tools@gmail.com> <20190303211325.GA24057@gmail.com> <20190304174625.GA25613@gmail.com> In-Reply-To: From: Richard Biener Date: Tue, 14 May 2019 09:13:00 -0000 Message-ID: Subject: Re: V6 [PATCH] Optimize vector constructor To: "H.J. Lu" Cc: Hongtao Liu , Andrew Pinski , GCC Patches Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2019-05/txt/msg00674.txt.bz2 On Wed, May 8, 2019 at 2:04 PM Richard Biener wrote: > > On Fri, May 3, 2019 at 6:54 PM H.J. Lu wrote: > > > > On Thu, May 2, 2019 at 10:53 AM H.J. Lu wrote: > > > > > > On Thu, May 2, 2019 at 7:55 AM Richard Biener > > > wrote: > > > > > > > > On Thu, May 2, 2019 at 4:54 PM Richard Biener > > > > wrote: > > > > > > > > > > On Mon, Mar 11, 2019 at 8:03 AM H.J. Lu wrote: > > > > > > > > > > > > On Fri, Mar 8, 2019 at 7:03 PM Richard Biener > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Mar 8, 2019 at 9:49 AM H.J. Lu wrote: > > > > > > > > > > > > > > > > On Thu, Mar 7, 2019 at 9:51 AM H.J. Lu wrote: > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2019 at 8:33 PM Richard Biener > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu wrote: > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener wrote: > > > > > > > > > > > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew Pinski wrote: > > > > > > > > > > > > > > > ) > > > > > > > > > > > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For vector init constructor: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > typedef float __v4sf __attribute__ ((__vector_size__ (16))); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > __v4sf > > > > > > > > > > > > > > > > foo (__v4sf x, float f) > > > > > > > > > > > > > > > > { > > > > > > > > > > > > > > > > __v4sf y = { f, x[1], x[2], x[3] }; > > > > > > > > > > > > > > > > return y; > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > we can optimize vector init constructor with vector copy or permute > > > > > > > > > > > > > > > > followed by a single scalar insert: > > > > > > > > > > > > > > > > > > > > > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR here. The easiest way > > > > > > > > > > > > > is to emit a new stmt for _2 = copy ...; and do the set_rhs with the > > > > > > > > > > > > > BIT_INSERT_EXPR. > > > > > > > > > > > > > > > > > > > > > > > > Thanks for BIT_INSERT_EXPR suggestion. I am testing this patch. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > H.J. > > > > > > > > > > > > --- > > > > > > > > > > > > We can optimize vector constructor with vector copy or permute followed > > > > > > > > > > > > by a single scalar insert: > > > > > > > > > > > > > > > > > > > > > > > > __v4sf y; > > > > > > > > > > > > __v4sf D.1930; > > > > > > > > > > > > float _1; > > > > > > > > > > > > float _2; > > > > > > > > > > > > float _3; > > > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > _1 = BIT_FIELD_REF ; > > > > > > > > > > > > _2 = BIT_FIELD_REF ; > > > > > > > > > > > > _3 = BIT_FIELD_REF ; > > > > > > > > > > > > y_6 = {f_5(D), _3, _2, _1}; > > > > > > > > > > > > return y_6; > > > > > > > > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > > > > > > > __v4sf y; > > > > > > > > > > > > __v4sf D.1930; > > > > > > > > > > > > float _1; > > > > > > > > > > > > float _2; > > > > > > > > > > > > float _3; > > > > > > > > > > > > vector(4) float _8; > > > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > _1 = BIT_FIELD_REF ; > > > > > > > > > > > > _2 = BIT_FIELD_REF ; > > > > > > > > > > > > _3 = BIT_FIELD_REF ; > > > > > > > > > > > > _8 = x_9(D); > > > > > > > > > > > > y_6 = BIT_INSERT_EXPR ; > > > > > > > > > > > > return y_6; > > > > > > > > > > > > > > > > > > > > > > > > gcc/ > > > > > > > > > > > > > > > > > > > > > > > > PR tree-optimization/88828 > > > > > > > > > > > > * tree-ssa-forwprop.c (simplify_vector_constructor): Optimize > > > > > > > > > > > > vector init constructor with vector copy or permute followed > > > > > > > > > > > > by a single scalar insert. > > > > > > > > > > > > > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > > > > > > > > > > > > > PR tree-optimization/88828 > > > > > > > > > > > > * gcc.target/i386/pr88828-1a.c: New test. > > > > > > > > > > > > * gcc.target/i386/pr88828-2b.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-2.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-3a.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-3b.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-3c.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-3d.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-4a.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-4b.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-5a.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-5b.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-6a.c: Likewise. > > > > > > > > > > > > * gcc.target/i386/pr88828-6b.c: Likewise. > > > > > > > > > > > > > > > > > > > > > > Here is the updated patch with run-time tests. > > > > > > > > > > > > > > > > > > > > - if (TREE_CODE (elt->value) != SSA_NAME) > > > > > > > > > > + if (TREE_CODE (ce->value) != SSA_NAME) > > > > > > > > > > return false; > > > > > > > > > > > > > > > > > > > > hmm, so it doesn't allow { 0, v[1], v[2], v[3] }? I think the single > > > > > > > > > > scalar value can be a constant as well. > > > > > > > > > > > > > > > > > > Fixed. > > > > > > > > > > > > > > > > > > > if (!def_stmt) > > > > > > > > > > - return false; > > > > > > > > > > + { > > > > > > > > > > + if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value))) > > > > > > > > > > > > > > > > > > > > if (SSA_NAME_IS_DEFAULT_DEF (ce->value)) > > > > > > > > > > > > > > > > > > > > + { > > > > > > > > > > > > > > > > > > > > also you seem to disallow > > > > > > > > > > > > > > > > > > > > { i + 1, v[1], v[2], v[3] } > > > > > > > > > > > > > > > > > > Fixed by > > > > > > > > > > > > > > > > > > if (code != BIT_FIELD_REF) > > > > > > > > > { > > > > > > > > > /* Only allow one scalar insert. */ > > > > > > > > > if (nscalars != 0) > > > > > > > > > return false; > > > > > > > > > > > > > > > > > > nscalars = 1; > > > > > > > > > insert = true; > > > > > > > > > scalar_idx = i; > > > > > > > > > sel.quick_push (i); > > > > > > > > > scalar_element = ce->value; > > > > > > > > > continue; > > > > > > > > > } > > > > > > > > > > > > > > > > > > > because get_prop_source_stmt will return the definition computing > > > > > > > > > > i + 1 in this case and your code will be skipped? > > > > > > > > > > > > > > > > > > > > I think you can simplify the code by treating scalar_element != NULL > > > > > > > > > > as nscalars == 1 and eliding nscalars. > > > > > > > > > > > > > > > > > > It works only if > > > > > > > > > > > > > > > > > > TYPE_VECTOR_SUBPARTS (type).to_constant () == (nscalars + nvectors) > > > > > > > > > > > > > > > > > > We need to check both nscalars and nvectors. Elide nscalar > > > > > > > > > check doesn't help much here. > > > > > > > > > > > > > > > > > > > - if (conv_code == ERROR_MARK) > > > > > > > > > > + if (conv_code == ERROR_MARK && !insert) > > > > > > > > > > gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig[0], > > > > > > > > > > orig[1], op2); > > > > > > > > > > else > > > > > > > > > > @@ -2148,10 +2198,25 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) > > > > > > > > > > VEC_PERM_EXPR, orig[0], orig[1], op2); > > > > > > > > > > orig[0] = gimple_assign_lhs (perm); > > > > > > > > > > gsi_insert_before (gsi, perm, GSI_SAME_STMT); > > > > > > > > > > - gimple_assign_set_rhs_with_ops (gsi, conv_code, orig[0], > > > > > > > > > > + gimple_assign_set_rhs_with_ops (gsi, > > > > > > > > > > + (conv_code != ERROR_MARK > > > > > > > > > > + ? conv_code > > > > > > > > > > + : NOP_EXPR), > > > > > > > > > > + orig[0], > > > > > > > > > > NULL_TREE, NULL_TREE); > > > > > > > > > > > > > > > > > > > > I believe you should elide the last stmt for conv_code == ERROR_MARK, > > > > > > > > > > that is, why did you need to add the && !insert check in the guarding condition > > > > > > > > > > > > > > > > > > When conv_code == ERROR_MARK, we still need > > > > > > > > > > > > > > > > > > gimple *perm > > > > > > > > > = gimple_build_assign (make_ssa_name (TREE_TYPE (orig[0])), > > > > > > > > > VEC_PERM_EXPR, orig[0], orig[1], op2); > > > > > > > > > orig[0] = gimple_assign_lhs (perm); > > > > > > > > > gsi_insert_before (gsi, perm, GSI_SAME_STMT); > > > > > > > > > gimple_assign_set_rhs_with_ops (gsi, NOP_EXPR, > > > > > > > > > orig[0], > > > > > > > > > NULL_TREE, NULL_TREE); > > > > > > > > > > > > > > > > > > Otherwise, scalar insert won't work. > > > > > > > > > > > > > > > > > > > (this path should already do the correct thing?). Note that in all > > > > > > > > > > cases it looks > > > > > > > > > > that with conv_code != ERROR_MARK you may end up doing a float->int > > > > > > > > > > or int->float conversion on a value it wasn't done on before which might > > > > > > > > > > raise exceptions? That is, do we need to make sure we permute a > > > > > > > > > > value we already do convert into the place we're going to insert to? > > > > > > > > > > > > > > > > > > This couldn't happen: > > > > > > > > > > > > > > > > > > if (type == TREE_TYPE (ref)) > > > > > > > > > { > > > > > > > > > /* The RHS vector has the same type as LHS. */ > > > > > > > > > if (rhs_vector == NULL) > > > > > > > > > rhs_vector = ref; > > > > > > > > > /* Check if all RHS vector elements come fome the same > > > > > > > > > vector. */ > > > > > > > > > if (rhs_vector == ref) > > > > > > > > > nvectors++; > > > > > > > > > } > > > > > > > > > ... > > > > > > > > > if (insert > > > > > > > > > && (nvectors == 0 > > > > > > > > > || (TYPE_VECTOR_SUBPARTS (type).to_constant () > > > > > > > > > != (nscalars + nvectors)))) > > > > > > > > > return false; > > > > > > > > > > > > > > I see - that looks like a missed case then? > > > > > > > > > > > > > > { 1., (float)v[1], (float)v[2], (float)v[3] } > > > > > > > > > > > > > > with integer vector v? > > > > > > > > > > > > True. > > > > > > > > > > > > > I'll have a look at the full patch next week (it's GCC 10 material in any case). > > > > > > > > > > > > > > > > > Now looking again. I still don't like the new "structure" of the loop > > > > > very much. > > > > > A refactoring like the attached should make it easier to clearly separate the > > > > > cases where we reach a vector def and where not. > > > > > > > > Now attached. > > > > > > > > > Do you want me to take over the patch? > > > > > > > > > > > > Here is the updated patch on top of your patch plus my fix. > > Thanks - when doing the constant vector I was thinking of the following patch > to handle your cases. It doesn't use insertion but eventually leaves that > to a separate transform. Instead it handles non-constants similar to constants > by permuting from a uniform vector. Thus > > __attribute__((noinline, noclone)) > __v4sf > foo2 (__v4sf x, float f) > { > __v4sf y = { f, x[1], x[2], x[3] }; > return y; > } > > becomes > > _4 = {f_2(D), f_2(D), f_2(D), f_2(D)}; > y_3 = VEC_PERM_EXPR ; > return y_3; > > this allows us to handle an arbitrary number of inserts of this > single value. It also ensures we can actually perform the > permutation while for the insertion we currently do not have > a convenient way to query whether the target can perform > it efficiently (IIRC x86 needs AVX to insert to arbitrary lanes > with a single instruction?). Similarly if the user writes the above > in source using __builtin_shuffle we'd want to optimize it as well. > > The patch as attached only passes some of your testcases, > the following FAIL: > > FAIL: gcc.target/i386/pr88828-2a.c scan-assembler movss > FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movaps > FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movlhps > FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not unpcklps > FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vpermilps 1 > FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vmovss 1 > FAIL: gcc.target/i386/pr88828-2c.c scan-assembler movss > FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movaps > FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movlhps > FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not unpcklps > FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vpermilps 1 > FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vmovss 1 > FAIL: gcc.target/i386/pr88828-3a.c scan-assembler movss > FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times shufps 2 > FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times movaps 1 > FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not movlhps > FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not unpcklps > FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vpermilps 1 > FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vinsertps 1 > FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-not vshufps > FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vpermilps 1 > FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vinsertps 1 > FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-not vshufps > > Making the patch emit inserts for single insert locations is of course > still possible but you get to arrive at heuristics like your choice > of permuting the original lane into the later overwritten lane which > might be a choice making the permute impossible or more expensive? > > The original purpose of simplify_vector_constructor was to simplify > the IL, not so much optimal code-generation in the end but I wonder > if we can rely on RTL expansion or later RTL optimization to do > the optimal choices here? > > I guess simplify_permutation could perform a VEC_PERM > into an insert if the remaining permutation would be a no-op > but RTL optimization handles this case well already. > > Whether code-generation for a one vector permute plus insert or > a two-vector permute is better in the end I don't know - at least > the permute expansion has a chance to see the combined > instruction. > > Do you think the remaining cases above can be handled in the > backend? > > Comments? I have now applied this after bootstrap / regtest on x86_64-unknown-linux-gnu together with the part of the testcases that PASS. Note I had to add -fexcess-precision=standard to the -7.c one as it otherwise fails to execute both patched and unpatched. r271153. I didn't check whether the remaining testcases simply need adjustments (thus their code-gen is OK) or if there's something to do on the target or in a GIMPLE transform. That needs to be evaluated still. There's also still the missed optimization of using VEC_UNPACK/PACK codes for conversions. Richard. > Thanks, > Richard. > > 2019-05-08 Richard Biener > > PR tree-optimization/88828 > * tree-ssa-forwprop.c (simplify_vector_constructor): Handle > permuting in a single non-constant element not extracted > from a vector. > > > > -- > > H.J.