Hi Jeff & Richard,

> If you can turn that example into a test, even if it's just in the
> aarch64 directory, that would be helpful

The second patch 2/2 has various tests for this as the cost model had to
be made more accurate for it to work.

> 
> As mentioned in the 2/2 thread, I think we should use subregs for
> the case where they're canonical.  It'd probably be worth adding a
> simplify-rtx.c helper to extract one element from a vector, e.g.:
> 
>   rtx simplify_gen_vec_select (rtx op, unsigned int index);
> 
> so that this is easier to do.
> 
> Does making the loop above per-element mean that, for 128-bit Advanced
> SIMD, the optimisation “only” kicks in for 64-bit element sizes?
> Perhaps for other element sizes we could do “top” and “bottom” halves.
> (There's obviously no need to do that as part of this work, was just
> wondering.)
> 

It should handle extraction of any element size, so it's able to use a value
in any abitrary location.  CSE already handles low/hi re-use optimally. So e.g.

#include <arm_neon.h>

extern int16x8_t bar (int16x8_t, int16x8_t);

int16x8_t foo ()
{
    int16_t s[4] = {1,2,3,4};
    int16_t d[8] = {1,2,3,4,5,6,7,8};

    int16x4_t r1 = vld1_s16 (s);
    int16x8_t r2 = vcombine_s16 (r1, r1);
    int16x8_t r3 = vld1q_s16 (d);
    return bar (r2, r3);
}

but our cost model is currently blocking it because we never costed vec_consts.
Without the 2/2 patch we generate:

foo:
        stp     x29, x30, [sp, -48]!
        adrp    x0, .LC0
        mov     x29, sp
        ldr     q1, [x0, #:lo12:.LC0]
        adrp    x0, .LC1
        ldr     q0, [x0, #:lo12:.LC1]
        adrp    x0, .LC2
        str     q1, [sp, 32]
        ldr     d2, [x0, #:lo12:.LC2]
        str     d2, [sp, 24]
        bl      bar
        ldp     x29, x30, [sp], 48
        ret
.LC0:
        .hword  1
        .hword  2
        .hword  3
        .hword  4
        .hword  5
        .hword  6
        .hword  7
        .hword  8
.LC1:
        .hword  1
        .hword  2
        .hword  3
        .hword  4
        .hword  1
        .hword  2
        .hword  3
        .hword  4

but with the 2/2 patch:

foo:
        stp     x29, x30, [sp, -48]!
        adrp    x0, .LC0
        mov     x29, sp
        ldr     d2, [x0, #:lo12:.LC0]
        adrp    x0, .LC1
        ldr     q1, [x0, #:lo12:.LC1]
        str     d2, [sp, 24]
        dup     d0, v2.d[0]
        str     q1, [sp, 32]
        ins     v0.d[1], v2.d[0]
        bl      bar
        ldp     x29, x30, [sp], 48
        ret
.LC1:
        .hword  1
        .hword  2
        .hword  3
        .hword  4
        .hword  5
        .hword  6
        .hword  7
        .hword  8

It's not entirely optimal of course, but is step forward. I think when we fix
the vld's this should then become optimal as current the MEMs are causing it to
not re-use those values.

> >        else
> >  	sets[n_sets++].rtl = x;
> >      }
> > @@ -4513,7 +4533,14 @@ cse_insn (rtx_insn *insn)
> >    struct set *sets = (struct set *) 0;
> >  
> >    if (GET_CODE (x) == SET)
> > -    sets = XALLOCA (struct set);
> > +    {
> > +      /* For CONST_VECTOR we wants to be able to CSE the vector itself along with
> > +	 elements inside the vector if the target says it's cheap.  */
> > +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > +	sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) + 1);
> > +      else
> > +	sets = XALLOCA (struct set);
> > +    }
> >    else if (GET_CODE (x) == PARALLEL)
> >      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> 
> I think this would be easier if “sets” was first converted to an
> auto_vec, say auto_vec<struct set, 8>.  We then wouldn't need to
> predict in advance how many elements are needed.
> 

Done.

> > @@ -4997,6 +5024,26 @@ cse_insn (rtx_insn *insn)
> >  	  src_related_is_const_anchor = src_related != NULL_RTX;
> >  	}
> >  
> > +      /* Try to re-materialize a vec_dup with an existing constant.   */
> > +      if (GET_CODE (src) == CONST_VECTOR
> > +	  && const_vector_encoded_nelts (src) == 1)
> > +	{
> > +	   rtx const_rtx = CONST_VECTOR_ELT (src, 0);
> 
> Would be simpler as:
> 
>   rtx src_elt;
>   if (const_vec_duplicate_p (src, &src_elt))
> 
> I think we should also check !src_eqv_here, or perhaps:
> 
>   (!src_eqv_here || CONSTANT_P (src_eqv_here))
> 
> so that we don't override any existing reg notes, which could have more
> chance of succeeding.
> 

Done.

> > +	   machine_mode const_mode = GET_MODE_INNER (GET_MODE (src));
> > +	   struct table_elt *related_elt
> > +		= lookup (const_rtx, HASH (const_rtx, const_mode), const_mode);
> > +	   if (related_elt)
> > +	    {
> > +	      for (related_elt = related_elt->first_same_value;
> > +		   related_elt; related_elt = related_elt->next_same_value)
> > +		if (REG_P (related_elt->exp))
> > +		  {
> > +		    src_eqv_here
> > +			= gen_rtx_VEC_DUPLICATE (GET_MODE (src),
> > +						 related_elt->exp);
> > +		  }
> 
> Other similar loops seem to break after the first match, instead of
> picking the last match.
> 

Done.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* cse.c (add_to_set): New.
	(find_sets_in_insn): Register constants in sets.
	(canonicalize_insn): Use auto_vec instead.
	(cse_insn): Try materializing using vec_dup.
	* rtl.h (simplify_context::simplify_gen_vec_select,
	simplify_gen_vec_select): New.
	* simplify-rtx.c (simplify_context::simplify_gen_vec_select): New.

> Thanks,
> Richard
> 
> > +	    }
> > +	}
> >  
> >        if (src == src_folded)
> >  	src_folded = 0;

--