From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-500621-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 59549 invoked by alias); 14 May 2019 09:13:59 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 59541 invoked by uid 89); 14 May 2019 09:13:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-4.4 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=PACK
X-HELO: mail-lf1-f67.google.com
Received: from mail-lf1-f67.google.com (HELO mail-lf1-f67.google.com) (209.85.167.67) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 14 May 2019 09:13:56 +0000
Received: by mail-lf1-f67.google.com with SMTP id l26so5618664lfh.13        for <gcc-patches@gcc.gnu.org>; Tue, 14 May 2019 02:13:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=wWtJ9bKBbAw59HvQAvGgiczdc0HF5r4CK1SpQIlKSvg=;        b=LRV4/cw+7hrBarLUmLAurtiIjDPXR+hEF58pKwvV9jx/eDBcy7gXuTaBMFMWT3lm4w         0649QietxIbTB+Hyrt0O8ukhgmOse9Y624ey1dmnwFIY3SyBn+EaNCNIctahwR3Xy1vY         vjVy/O+k4/e7r9J7HL5GBKWW4u6ThsBbp2tREkRlU1ezIRhqTApv0HrB9JiCYz2AFCbd         s37jBO0ezSFZAHTIOnX5Kfi5PzC28XL7wBWWjSErJjJuqX1snnSXqzpyMDrjDTEQlyO+         ImC3Av3iUQQQ0GqGnEv5CDKdyFCFFW7Dhy/6MrAfghTXNwWfVYCe0++h8T09bBzcOxaE         zSLw==
MIME-Version: 1.0
References: <20190303143230.19742-1-hjl.tools@gmail.com> <CA+=Sn1mpV+MX6u5ERi_4yjG1+pr2EBh8ft-2QJSUBGKWPY6O7w@mail.gmail.com> <20190303211325.GA24057@gmail.com> <CAFiYyc17ECEUCXzV4-rVupBjwWhCuVQfTRGgmGtt1uThh12Prg@mail.gmail.com> <20190304174625.GA25613@gmail.com> <CAMe9rOrCAJv663ontj0sBtn7p1vuH0i-+mmcLyzzFv0xszfLJA@mail.gmail.com> <CAFiYyc2d0zaOMK1nAqvwvGxaiYTHjN56S21rRrsE=_10TG6Dqg@mail.gmail.com> <CAMe9rOrnfZVr=6ZEiNvgCZVhUOcbt0G2Ob_-QyXpj_u8-qzJpA@mail.gmail.com> <CAMe9rOpQehpx=sxjvE1wgZYd+NrQUaVn0eEPgDcr5TuPTEjj5w@mail.gmail.com> <CAFiYyc3BD=EXejU3MYs0E583FfFeQ6s+tLOwVWTxDmKVfwwx8A@mail.gmail.com> <CAMe9rOrr6Mb7RdYb_W7gsDxH0bmRCG1BO8JJ-5eUTrXmvFT+jA@mail.gmail.com> <CAFiYyc2iK51C47Kisj5kC3q6B9v-4Sb6jeR5Qt3EgT81b_U3cw@mail.gmail.com> <CAFiYyc2LA51_Tj2CfAXUqXVaQYmRpV+Hp-L4aoePpSNSFSXfBQ@mail.gmail.com> <CAMe9rOoq2VHdguD_ZHH3wiF43-aJkj4BXuW6GUD5wKA06M+yxg@mail.gmail.com> <CAMe9rOqWNH6kxAsXekjh5Bbr69FOUDAgyyZsciOXZXJexoijQw@mail.gmail.com> <CAFiYyc1tgJ2SuWPr4CPjccGskLig8axsRF=EDn1Ke0QrOOSUqA@mail.gmail.com>
In-Reply-To: <CAFiYyc1tgJ2SuWPr4CPjccGskLig8axsRF=EDn1Ke0QrOOSUqA@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 14 May 2019 09:13:00 -0000
Message-ID: <CAFiYyc25_cbm6CRwzShYoengz57V69O_iFZ57oF5ZcCmdyxPig@mail.gmail.com>
Subject: Re: V6 [PATCH] Optimize vector constructor
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Hongtao Liu <crazylht@gmail.com>, Andrew Pinski <pinskia@gmail.com>, 	GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2019-05/txt/msg00674.txt.bz2

On Wed, May 8, 2019 at 2:04 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Fri, May 3, 2019 at 6:54 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, May 2, 2019 at 10:53 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Thu, May 2, 2019 at 7:55 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Thu, May 2, 2019 at 4:54 PM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 11, 2019 at 8:03 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 8, 2019 at 7:03 PM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 8, 2019 at 9:49 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 7, 2019 at 9:51 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Mar 6, 2019 at 8:33 PM Richard Biener
> > > > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener wrote:
> > > > > > > > > > > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew Pinski wrote:
> > > > > > > > > > > > > > > )
> > > > > > > > > > > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For vector init constructor:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > typedef float __v4sf __attribute__ ((__vector_size__ (16)));
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > __v4sf
> > > > > > > > > > > > > > > > foo (__v4sf x, float f)
> > > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > > >   __v4sf y = { f, x[1], x[2], x[3] };
> > > > > > > > > > > > > > > >   return y;
> > > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > we can optimize vector init constructor with vector copy or permute
> > > > > > > > > > > > > > > > followed by a single scalar insert:
> > > > > > > > > > > >
> > > > > > > > > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR here.  The easiest way
> > > > > > > > > > > > > is to emit a new stmt for _2 = copy ...; and do the set_rhs with the
> > > > > > > > > > > > > BIT_INSERT_EXPR.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for BIT_INSERT_EXPR suggestion.  I am testing this patch.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > H.J.
> > > > > > > > > > > > ---
> > > > > > > > > > > > We can optimize vector constructor with vector copy or permute followed
> > > > > > > > > > > > by a single scalar insert:
> > > > > > > > > > > >
> > > > > > > > > > > >   __v4sf y;
> > > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > > >   float _1;
> > > > > > > > > > > >   float _2;
> > > > > > > > > > > >   float _3;
> > > > > > > > > > > >
> > > > > > > > > > > >   <bb 2> :
> > > > > > > > > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > > > > > > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > > > > > > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > > > > > > > > >   y_6 = {f_5(D), _3, _2, _1};
> > > > > > > > > > > >   return y_6;
> > > > > > > > > > > >
> > > > > > > > > > > > with
> > > > > > > > > > > >
> > > > > > > > > > > >  __v4sf y;
> > > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > > >   float _1;
> > > > > > > > > > > >   float _2;
> > > > > > > > > > > >   float _3;
> > > > > > > > > > > >   vector(4) float _8;
> > > > > > > > > > > >
> > > > > > > > > > > >   <bb 2> :
> > > > > > > > > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > > > > > > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > > > > > > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > > > > > > > > >   _8 = x_9(D);
> > > > > > > > > > > >   y_6 = BIT_INSERT_EXPR <x_9(D), f_5(D), 0 (32 bits)>;
> > > > > > > > > > > >   return y_6;
> > > > > > > > > > > >
> > > > > > > > > > > > gcc/
> > > > > > > > > > > >
> > > > > > > > > > > >         PR tree-optimization/88828
> > > > > > > > > > > >         * tree-ssa-forwprop.c (simplify_vector_constructor): Optimize
> > > > > > > > > > > >         vector init constructor with vector copy or permute followed
> > > > > > > > > > > >         by a single scalar insert.
> > > > > > > > > > > >
> > > > > > > > > > > > gcc/testsuite/
> > > > > > > > > > > >
> > > > > > > > > > > >         PR tree-optimization/88828
> > > > > > > > > > > >         * gcc.target/i386/pr88828-1a.c: New test.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-2b.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-2.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-3a.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-3b.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-3c.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-3d.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-4a.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-4b.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-5a.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-5b.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-6a.c: Likewise.
> > > > > > > > > > > >         * gcc.target/i386/pr88828-6b.c: Likewise.
> > > > > > > > > > >
> > > > > > > > > > > Here is the updated patch with run-time tests.
> > > > > > > > > >
> > > > > > > > > > -      if (TREE_CODE (elt->value) != SSA_NAME)
> > > > > > > > > > +      if (TREE_CODE (ce->value) != SSA_NAME)
> > > > > > > > > >         return false;
> > > > > > > > > >
> > > > > > > > > > hmm, so it doesn't allow { 0, v[1], v[2], v[3] }?  I think the single
> > > > > > > > > > scalar value can be a constant as well.
> > > > > > > > >
> > > > > > > > > Fixed.
> > > > > > > > >
> > > > > > > > > >        if (!def_stmt)
> > > > > > > > > > -       return false;
> > > > > > > > > > +       {
> > > > > > > > > > +         if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value)))
> > > > > > > > > >
> > > > > > > > > > if (SSA_NAME_IS_DEFAULT_DEF (ce->value))
> > > > > > > > > >
> > > > > > > > > > +           {
> > > > > > > > > >
> > > > > > > > > > also you seem to disallow
> > > > > > > > > >
> > > > > > > > > >   { i + 1, v[1], v[2], v[3] }
> > > > > > > > >
> > > > > > > > > Fixed by
> > > > > > > > >
> > > > > > > > >      if (code != BIT_FIELD_REF)
> > > > > > > > >         {
> > > > > > > > >           /* Only allow one scalar insert.  */
> > > > > > > > >           if (nscalars != 0)
> > > > > > > > >             return false;
> > > > > > > > >
> > > > > > > > >           nscalars = 1;
> > > > > > > > >           insert = true;
> > > > > > > > >           scalar_idx = i;
> > > > > > > > >           sel.quick_push (i);
> > > > > > > > >           scalar_element = ce->value;
> > > > > > > > >           continue;
> > > > > > > > >         }
> > > > > > > > >
> > > > > > > > > > because get_prop_source_stmt will return the definition computing
> > > > > > > > > > i + 1 in this case and your code will be skipped?
> > > > > > > > > >
> > > > > > > > > > I think you can simplify the code by treating scalar_element != NULL
> > > > > > > > > > as nscalars == 1 and eliding nscalars.
> > > > > > > > >
> > > > > > > > > It works only if
> > > > > > > > >
> > > > > > > > > TYPE_VECTOR_SUBPARTS (type).to_constant ()  == (nscalars + nvectors)
> > > > > > > > >
> > > > > > > > > We need to check both nscalars and nvectors.  Elide nscalar
> > > > > > > > > check doesn't help much here.
> > > > > > > > >
> > > > > > > > > > -      if (conv_code == ERROR_MARK)
> > > > > > > > > > +      if (conv_code == ERROR_MARK && !insert)
> > > > > > > > > >         gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig[0],
> > > > > > > > > >                                         orig[1], op2);
> > > > > > > > > >        else
> > > > > > > > > > @@ -2148,10 +2198,25 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
> > > > > > > > > >                                    VEC_PERM_EXPR, orig[0], orig[1], op2);
> > > > > > > > > >           orig[0] = gimple_assign_lhs (perm);
> > > > > > > > > >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> > > > > > > > > > -         gimple_assign_set_rhs_with_ops (gsi, conv_code, orig[0],
> > > > > > > > > > +         gimple_assign_set_rhs_with_ops (gsi,
> > > > > > > > > > +                                         (conv_code != ERROR_MARK
> > > > > > > > > > +                                          ? conv_code
> > > > > > > > > > +                                          : NOP_EXPR),
> > > > > > > > > > +                                         orig[0],
> > > > > > > > > >                                           NULL_TREE, NULL_TREE);
> > > > > > > > > >
> > > > > > > > > > I believe you should elide the last stmt for conv_code == ERROR_MARK,
> > > > > > > > > > that is, why did you need to add the && !insert check in the guarding condition
> > > > > > > > >
> > > > > > > > > When conv_code == ERROR_MARK, we still need
> > > > > > > > >
> > > > > > > > >        gimple *perm
> > > > > > > > >             = gimple_build_assign (make_ssa_name (TREE_TYPE (orig[0])),
> > > > > > > > >                                    VEC_PERM_EXPR, orig[0], orig[1], op2);
> > > > > > > > >           orig[0] = gimple_assign_lhs (perm);
> > > > > > > > >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> > > > > > > > >           gimple_assign_set_rhs_with_ops (gsi,  NOP_EXPR,
> > > > > > > > >                                           orig[0],
> > > > > > > > >                                           NULL_TREE, NULL_TREE);
> > > > > > > > >
> > > > > > > > > Otherwise, scalar insert won't work.
> > > > > > > > >
> > > > > > > > > > (this path should already do the correct thing?).  Note that in all
> > > > > > > > > > cases it looks
> > > > > > > > > > that with conv_code != ERROR_MARK you may end up doing a float->int
> > > > > > > > > > or int->float conversion on a value it wasn't done on before which might
> > > > > > > > > > raise exceptions?  That is, do we need to make sure we permute a
> > > > > > > > > > value we already do convert into the place we're going to insert to?
> > > > > > > > >
> > > > > > > > > This couldn't happen:
> > > > > > > > >
> > > > > > > > >       if (type == TREE_TYPE (ref))
> > > > > > > > >          {
> > > > > > > > >            /* The RHS vector has the same type as LHS.  */
> > > > > > > > >            if (rhs_vector == NULL)
> > > > > > > > >              rhs_vector = ref;
> > > > > > > > >            /* Check if all RHS vector elements come fome the same
> > > > > > > > >               vector.  */
> > > > > > > > >            if (rhs_vector == ref)
> > > > > > > > >              nvectors++;
> > > > > > > > >          }
> > > > > > > > > ...
> > > > > > > > >   if (insert
> > > > > > > > >       && (nvectors == 0
> > > > > > > > >           || (TYPE_VECTOR_SUBPARTS (type).to_constant ()
> > > > > > > > >               != (nscalars + nvectors))))
> > > > > > > > >     return false;
> > > > > > >
> > > > > > > I see - that looks like a missed case then?
> > > > > > >
> > > > > > >  { 1., (float)v[1], (float)v[2], (float)v[3] }
> > > > > > >
> > > > > > > with integer vector v?
> > > > > >
> > > > > > True.
> > > > > >
> > > > > > > I'll have a look at the full patch next week (it's GCC 10 material in any case).
> > > > > > >
> > > > >
> > > > > Now looking again.  I still don't like the new "structure" of the loop
> > > > > very much.
> > > > > A refactoring like the attached should make it easier to clearly separate the
> > > > > cases where we reach a vector def and where not.
> > > >
> > > > Now attached.
> > > >
> > > > > Do you want me to take over the patch?
> > > > >
> > >
> >
> > Here is the updated patch on top of your patch plus my fix.
>
> Thanks - when doing the constant vector I was thinking of the following patch
> to handle your cases.  It doesn't use insertion but eventually leaves that
> to a separate transform.  Instead it handles non-constants similar to constants
> by permuting from a uniform vector.  Thus
>
> __attribute__((noinline, noclone))
> __v4sf
> foo2 (__v4sf x, float f)
> {
>   __v4sf y = { f, x[1], x[2], x[3] };
>   return y;
> }
>
> becomes
>
>   _4 = {f_2(D), f_2(D), f_2(D), f_2(D)};
>   y_3 = VEC_PERM_EXPR <x_5(D), _4, { 4, 1, 2, 3 }>;
>   return y_3;
>
> this allows us to handle an arbitrary number of inserts of this
> single value.  It also ensures we can actually perform the
> permutation while for the insertion we currently do not have
> a convenient way to query whether the target can perform
> it efficiently (IIRC x86 needs AVX to insert to arbitrary lanes
> with a single instruction?).  Similarly if the user writes the above
> in source using __builtin_shuffle we'd want to optimize it as well.
>
> The patch as attached only passes some of your testcases,
> the following FAIL:
>
> FAIL: gcc.target/i386/pr88828-2a.c scan-assembler movss
> FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movaps
> FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movlhps
> FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not unpcklps
> FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vpermilps 1
> FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vmovss 1
> FAIL: gcc.target/i386/pr88828-2c.c scan-assembler movss
> FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movaps
> FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movlhps
> FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not unpcklps
> FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vpermilps 1
> FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vmovss 1
> FAIL: gcc.target/i386/pr88828-3a.c scan-assembler movss
> FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times shufps 2
> FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times movaps 1
> FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not movlhps
> FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not unpcklps
> FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vpermilps 1
> FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vinsertps 1
> FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-not vshufps
> FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vpermilps 1
> FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vinsertps 1
> FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-not vshufps
>
> Making the patch emit inserts for single insert locations is of course
> still possible but you get to arrive at heuristics like your choice
> of permuting the original lane into the later overwritten lane which
> might be a choice making the permute impossible or more expensive?
>
> The original purpose of simplify_vector_constructor was to simplify
> the IL, not so much optimal code-generation in the end but I wonder
> if we can rely on RTL expansion or later RTL optimization to do
> the optimal choices here?
>
> I guess simplify_permutation could perform a VEC_PERM
> into an insert if the remaining permutation would be a no-op
> but RTL optimization handles this case well already.
>
> Whether code-generation for a one vector permute plus insert or
> a two-vector permute is better in the end I don't know - at least
> the permute expansion has a chance to see the combined
> instruction.
>
> Do you think the remaining cases above can be handled in the
> backend?
>
> Comments?

I have now applied this after bootstrap / regtest on x86_64-unknown-linux-gnu
together with the part of the testcases that PASS.  Note I had to
add -fexcess-precision=standard to the -7.c one as it otherwise fails to
execute both patched and unpatched.

r271153.

I didn't check whether the remaining testcases simply need adjustments
(thus their code-gen is OK) or if there's something to do on the target
or in a GIMPLE transform.  That needs to be evaluated still.

There's also still the missed optimization of using VEC_UNPACK/PACK
codes for conversions.

Richard.

> Thanks,
> Richard.
>
> 2019-05-08  Richard Biener  <rguenther@suse.de>
>
>         PR tree-optimization/88828
>         * tree-ssa-forwprop.c (simplify_vector_constructor): Handle
>         permuting in a single non-constant element not extracted
>         from a vector.
>
>
> > --
> > H.J.