On Fri, 17 Dec 2021 at 17:03, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi,
> > The patch folds:
> > lhs = svld1rq ({-1, -1, -1, ...}, &v[0])
> > into:
> > lhs = vec_perm_expr<v, v, {0, 1, 2, 3, ... }>
> > and expands above vec_perm_expr using aarch64_expand_sve_dupq.
> >
> > With patch, for following test:
> > #include <arm_sve.h>
> > #include <arm_neon.h>
> >
> > svint32_t
> > foo (int32x4_t x)
> > {
> >   return svld1rq (svptrue_b8 (), &x[0]);
> > }
> >
> > it generates following code:
> > foo:
> > .LFB4350:
> >         dup     z0.q, z0.q[0]
> >         ret
> >
> > and passes bootstrap+test on aarch64-linux-gnu.
> > But I am not sure if the changes to aarch64_evpc_sve_tbl
> > are correct.
>
> Just in case: I was only using int32x4_t in the PR as an example.
> The same thing should work for all element types.
>
> >
> > Thanks,
> > Prathamesh
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > index 02e42a71e5e..e21bbec360c 100644
> > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> > @@ -1207,6 +1207,56 @@ public:
> >      insn_code icode = code_for_aarch64_sve_ld1rq (e.vector_mode (0));
> >      return e.use_contiguous_load_insn (icode);
> >    }
> > +
> > +  gimple *
> > +  fold (gimple_folder &f) const OVERRIDE
> > +  {
> > +    tree arg0 = gimple_call_arg (f.call, 0);
> > +    tree arg1 = gimple_call_arg (f.call, 1);
> > +
> > +    /* Transform:
> > +       lhs = svld1rq ({-1, -1, ... }, &v[0])
> > +       into:
> > +       lhs = vec_perm_expr<v, v, {0, 1, 2, 3, ...}>.
> > +       on little endian target.  */
> > +
> > +    if (!BYTES_BIG_ENDIAN
> > +     && integer_all_onesp (arg0)
> > +     && TREE_CODE (arg1) == ADDR_EXPR)
> > +      {
> > +     tree t = TREE_OPERAND (arg1, 0);
> > +     if (TREE_CODE (t) == ARRAY_REF)
> > +       {
> > +         tree index = TREE_OPERAND (t, 1);
> > +         t = TREE_OPERAND (t, 0);
> > +         if (integer_zerop (index) && TREE_CODE (t) == VIEW_CONVERT_EXPR)
> > +           {
> > +             t = TREE_OPERAND (t, 0);
> > +             tree vectype = TREE_TYPE (t);
> > +             if (VECTOR_TYPE_P (vectype)
> > +                 && known_eq (TYPE_VECTOR_SUBPARTS (vectype), 4u)
> > +                 && wi::to_wide (TYPE_SIZE (vectype)) == 128)
> > +               {
>
> Since this is quite a specific pattern match, and since we now lower
> arm_neon.h vld1* to normal gimple accesses, I think we should try the
> “more generally” approach mentioned in the PR and see what the fallout
> is.  That is, keep:
>
>     if (!BYTES_BIG_ENDIAN
>         && integer_all_onesp (arg0)
>
> If those conditions pass, create an Advanced SIMD access at address arg1,
> using similar code to the handling of:
>
>      BUILTIN_VALL_F16 (LOAD1, ld1, 0, LOAD)
>      BUILTIN_VDQ_I (LOAD1_U, ld1, 0, LOAD)
>      BUILTIN_VALLP_NO_DI (LOAD1_P, ld1, 0, LOAD)
>
> in aarch64_general_gimple_fold_builtin.  (Would be good to move the
> common code to aarch64.c so that both files can use it.)
>
> > +                 tree lhs = gimple_call_lhs (f.call);
> > +                 tree lhs_type = TREE_TYPE (lhs);
> > +                 int source_nelts = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > +                 vec_perm_builder sel (TYPE_VECTOR_SUBPARTS (lhs_type), source_nelts, 1);
> > +                 for (int i = 0; i < source_nelts; i++)
> > +                   sel.quick_push (i);
> > +
> > +                 vec_perm_indices indices (sel, 1, source_nelts);
> > +                 if (!can_vec_perm_const_p (TYPE_MODE (lhs_type), indices))
> > +                   return NULL;
>
> I don't think we need to check this: it should always be true.
> Probably worth keeping as a gcc_checking_assert though.
>
> > +
> > +                 tree mask = vec_perm_indices_to_tree (lhs_type, indices);
> > +                 return gimple_build_assign (lhs, VEC_PERM_EXPR, t, t, mask);
> > +               }
> > +           }
> > +       }
> > +      }
> > +
> > +    return NULL;
> > +  }
> >  };
> >
> >  class svld1ro_impl : public load_replicate
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index f07330cff4f..af27f550be3 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -23002,8 +23002,32 @@ aarch64_evpc_sve_tbl (struct expand_vec_perm_d *d)
> >
> >    machine_mode sel_mode = related_int_vector_mode (d->vmode).require ();
> >    rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm);
> > +
> >    if (d->one_vector_p)
> > -    emit_unspec2 (d->target, UNSPEC_TBL, d->op0, force_reg (sel_mode, sel));
> > +    {
> > +      bool use_dupq = false;
> > +      /* Check if sel is dup vector with encoded elements {0, 1, 2, ... nelts}  */
> > +      if (GET_CODE (sel) == CONST_VECTOR
> > +       && !GET_MODE_NUNITS (GET_MODE (sel)).is_constant ()
> > +       && CONST_VECTOR_DUPLICATE_P (sel))
> > +       {
> > +         unsigned nelts = const_vector_encoded_nelts (sel);
> > +         unsigned i;
> > +         for (i = 0; i < nelts; i++)
> > +           {
> > +             rtx elem = CONST_VECTOR_ENCODED_ELT(sel, i);
> > +             if (!(CONST_INT_P (elem) && INTVAL(elem) == i))
> > +               break;
> > +           }
> > +         if (i == nelts)
> > +           use_dupq = true;
> > +       }
> > +
> > +      if (use_dupq)
> > +     aarch64_expand_sve_dupq (d->target, GET_MODE (d->target), d->op0);
> > +      else
> > +     emit_unspec2 (d->target, UNSPEC_TBL, d->op0, force_reg (sel_mode, sel));
> > +    }
>
> This shouldn't be a TBL but a new operation, handled by its own
> aarch64_evpc_sve_* routine.  The check for the mask should then
> be done on d->perm, to detect whether the permutation is one
> that the new routine supports.
>
> I think the requirements are:
>
> - !BYTES_BIG_ENDIAN
> - the source must be an Advanced SIMD vector
> - the destination must be an SVE vector
> - the permutation must be a duplicate (tested in the code above)
> - the number of “patterns” in the permutation must equal the number of
>   source elements
> - element X of the permutation must equal X (tested in the code above)
>
> The existing aarch64_evpc_* routines expect the source and target modes
> to be the same, so we should only call them when that's true.
Hi Richard,
Thanks for the suggestions, and sorry for late reply.
Does the following patch look OK (sans the refactoring of building mem_ref) ?
Passes bootstrap+test on aarch64-linux-gnu.

Thanks,
Prathamesh
>
> Thanks,
> Richard