On Mon, 27 Dec 2021 at 15:54, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 17 Dec 2021 at 17:03, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > Hi,
> > > The patch folds:
> > > lhs = svld1rq ({-1, -1, -1, ...}, &v[0])
> > > into:
> > > lhs = vec_perm_expr<v, v, {0, 1, 2, 3, ... }>
> > > and expands above vec_perm_expr using aarch64_expand_sve_dupq.
> > >
> > > With patch, for following test:
> > > #include <arm_sve.h>
> > > #include <arm_neon.h>
> > >
> > > svint32_t
> > > foo (int32x4_t x)
> > > {
> > >   return svld1rq (svptrue_b8 (), &x[0]);
> > > }
> > >
> > > it generates following code:
> > > foo:
> > > .LFB4350:
> > >         dup     z0.q, z0.q[0]
> > >         ret
> > >
> > > and passes bootstrap+test on aarch64-linux-gnu.
> > > But I am not sure if the changes to aarch64_evpc_sve_tbl
> > > are correct.
> >
> > Just in case: I was only using int32x4_t in the PR as an example.
> > The same thing should work for all element types.
> >
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > > index 02e42a71e5e..e21bbec360c 100644
> > > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > > @@ -1207,6 +1207,56 @@ public:
> > >      insn_code icode = code_for_aarch64_sve_ld1rq (e.vector_mode (0));
> > >      return e.use_contiguous_load_insn (icode);
> > >    }
> > > +
> > > +  gimple *
> > > +  fold (gimple_folder &f) const OVERRIDE
> > > +  {
> > > +    tree arg0 = gimple_call_arg (f.call, 0);
> > > +    tree arg1 = gimple_call_arg (f.call, 1);
> > > +
> > > +    /* Transform:
> > > +       lhs = svld1rq ({-1, -1, ... }, &v[0])
> > > +       into:
> > > +       lhs = vec_perm_expr<v, v, {0, 1, 2, 3, ...}>.
> > > +       on little endian target.  */
> > > +
> > > +    if (!BYTES_BIG_ENDIAN
> > > +     && integer_all_onesp (arg0)
> > > +     && TREE_CODE (arg1) == ADDR_EXPR)
> > > +      {
> > > +     tree t = TREE_OPERAND (arg1, 0);
> > > +     if (TREE_CODE (t) == ARRAY_REF)
> > > +       {
> > > +         tree index = TREE_OPERAND (t, 1);
> > > +         t = TREE_OPERAND (t, 0);
> > > +         if (integer_zerop (index) && TREE_CODE (t) == VIEW_CONVERT_EXPR)
> > > +           {
> > > +             t = TREE_OPERAND (t, 0);
> > > +             tree vectype = TREE_TYPE (t);
> > > +             if (VECTOR_TYPE_P (vectype)
> > > +                 && known_eq (TYPE_VECTOR_SUBPARTS (vectype), 4u)
> > > +                 && wi::to_wide (TYPE_SIZE (vectype)) == 128)
> > > +               {
> >
> > Since this is quite a specific pattern match, and since we now lower
> > arm_neon.h vld1* to normal gimple accesses, I think we should try the
> > “more generally” approach mentioned in the PR and see what the fallout
> > is.  That is, keep:
> >
> >     if (!BYTES_BIG_ENDIAN
> >         && integer_all_onesp (arg0)
> >
> > If those conditions pass, create an Advanced SIMD access at address arg1,
> > using similar code to the handling of:
> >
> >      BUILTIN_VALL_F16 (LOAD1, ld1, 0, LOAD)
> >      BUILTIN_VDQ_I (LOAD1_U, ld1, 0, LOAD)
> >      BUILTIN_VALLP_NO_DI (LOAD1_P, ld1, 0, LOAD)
> >
> > in aarch64_general_gimple_fold_builtin.  (Would be good to move the
> > common code to aarch64.c so that both files can use it.)
> >
> > > +                 tree lhs = gimple_call_lhs (f.call);
> > > +                 tree lhs_type = TREE_TYPE (lhs);
> > > +                 int source_nelts = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > > +                 vec_perm_builder sel (TYPE_VECTOR_SUBPARTS (lhs_type), source_nelts, 1);
> > > +                 for (int i = 0; i < source_nelts; i++)
> > > +                   sel.quick_push (i);
> > > +
> > > +                 vec_perm_indices indices (sel, 1, source_nelts);
> > > +                 if (!can_vec_perm_const_p (TYPE_MODE (lhs_type), indices))
> > > +                   return NULL;
> >
> > I don't think we need to check this: it should always be true.
> > Probably worth keeping as a gcc_checking_assert though.
> >
> > > +
> > > +                 tree mask = vec_perm_indices_to_tree (lhs_type, indices);
> > > +                 return gimple_build_assign (lhs, VEC_PERM_EXPR, t, t, mask);
> > > +               }
> > > +           }
> > > +       }
> > > +      }
> > > +
> > > +    return NULL;
> > > +  }
> > >  };
> > >
> > >  class svld1ro_impl : public load_replicate
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index f07330cff4f..af27f550be3 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -23002,8 +23002,32 @@ aarch64_evpc_sve_tbl (struct expand_vec_perm_d *d)
> > >
> > >    machine_mode sel_mode = related_int_vector_mode (d->vmode).require ();
> > >    rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm);
> > > +
> > >    if (d->one_vector_p)
> > > -    emit_unspec2 (d->target, UNSPEC_TBL, d->op0, force_reg (sel_mode, sel));
> > > +    {
> > > +      bool use_dupq = false;
> > > +      /* Check if sel is dup vector with encoded elements {0, 1, 2, ... nelts}  */
> > > +      if (GET_CODE (sel) == CONST_VECTOR
> > > +       && !GET_MODE_NUNITS (GET_MODE (sel)).is_constant ()
> > > +       && CONST_VECTOR_DUPLICATE_P (sel))
> > > +       {
> > > +         unsigned nelts = const_vector_encoded_nelts (sel);
> > > +         unsigned i;
> > > +         for (i = 0; i < nelts; i++)
> > > +           {
> > > +             rtx elem = CONST_VECTOR_ENCODED_ELT(sel, i);
> > > +             if (!(CONST_INT_P (elem) && INTVAL(elem) == i))
> > > +               break;
> > > +           }
> > > +         if (i == nelts)
> > > +           use_dupq = true;
> > > +       }
> > > +
> > > +      if (use_dupq)
> > > +     aarch64_expand_sve_dupq (d->target, GET_MODE (d->target), d->op0);
> > > +      else
> > > +     emit_unspec2 (d->target, UNSPEC_TBL, d->op0, force_reg (sel_mode, sel));
> > > +    }
> >
> > This shouldn't be a TBL but a new operation, handled by its own
> > aarch64_evpc_sve_* routine.  The check for the mask should then
> > be done on d->perm, to detect whether the permutation is one
> > that the new routine supports.
> >
> > I think the requirements are:
> >
> > - !BYTES_BIG_ENDIAN
> > - the source must be an Advanced SIMD vector
> > - the destination must be an SVE vector
> > - the permutation must be a duplicate (tested in the code above)
> > - the number of “patterns” in the permutation must equal the number of
> >   source elements
> > - element X of the permutation must equal X (tested in the code above)
> >
> > The existing aarch64_evpc_* routines expect the source and target modes
> > to be the same, so we should only call them when that's true.
> Hi Richard,
> Thanks for the suggestions, and sorry for late reply.
> Does the following patch look OK (sans the refactoring of building mem_ref) ?
> Passes bootstrap+test on aarch64-linux-gnu.
Hi Richard,
Since stage-1 has reopened, does the attached patch look OK to commit ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard