public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Help with ivopts
@ 2011-07-06 11:35 Richard Sandiford
  2011-07-06 13:10 ` Michael Matz
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Sandiford @ 2011-07-06 11:35 UTC (permalink / raw)
  To: gcc

Consider:

  void
  loop (unsigned char *restrict ydst,
        unsigned char *restrict udst,
        unsigned char *restrict vdst,
        unsigned char *restrict src,
        int n)
  {
    int i;

    for (i = 0; i < n; i++)
      {
        ydst[2*i+0] = src[4*i+0];
        udst[i] = src[4*i+1];
        ydst[2*i+1] = src[4*i+2];
        vdst[i] = src[4*i+3];
      }
  }

(based on libav's yuy2toyv12 code).  Compiled for ARM with:

  -std=c99 -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp

the loop gets vectorised.  The code before ivopts looks like this:

<bb 4>:
  vect_p.31_116 = src_10(D);
  vect_p.40_123 = udst_14(D);
  vect_p.44_126 = ydst_6(D);
  vect_p.49_129 = vdst_31(D);
  [...]

<bb 5>:
  # vect_p.28_117 = PHI <vect_p.28_118(10), vect_p.31_116(4)>
  # vect_p.37_124 = PHI <vect_p.37_125(10), vect_p.40_123(4)>
  # vect_p.41_127 = PHI <vect_p.41_128(10), vect_p.44_126(4)>
  # vect_p.46_130 = PHI <vect_p.46_131(10), vect_p.49_129(4)>
  # ivtmp.50_132 = PHI <ivtmp.50_133(10), 0(4)>
  vect_array.32 = LOAD_LANES (MEM[(unsigned char *)vect_p.28_117]);
  vect_var_.33_119 = vect_array.32_I_lsm0.54_47;
  vect_var_.34_120 = vect_array.32_I_lsm0.53_109;
  vect_var_.35_121 = vect_array.32_I_lsm0.52_114;
  vect_var_.36_122 = vect_array.32_I_lsm0.51_113;
  MEM[(unsigned char *)vect_p.37_124] = vect_var_.34_120;
  vect_array.45[0] = vect_var_.33_119;
  vect_array.45[1] = vect_var_.35_121;
  MEM[(unsigned char *)vect_p.41_127] = STORE_LANES (vect_array.45);
  MEM[(unsigned char *)vect_p.46_130] = vect_var_.36_122;
  vect_p.28_118 = vect_p.28_117 + 32;
  vect_p.37_125 = vect_p.37_124 + 8;
  vect_p.41_128 = vect_p.41_127 + 16;
  vect_p.46_131 = vect_p.46_130 + 8;
  ivtmp.50_133 = ivtmp.50_132 + 1;
  if (ivtmp.50_133 < bnd.25_70)
    goto <bb 10>;
  else
    goto <bb 6>;
[...]
<bb 10>:
  goto <bb 5>;

We record these uses:

use 1
  generic
  in statement vect_p.37_124 = PHI <vect_p.37_125(10), vect_p.40_123(4)>

  at position 
  type vector(8) unsigned char * restrict
  base (vector(8) unsigned char * restrict) udst_14(D)
  step 8
  base object (void *) udst_14(D)
  is a biv
  related candidates 
use 3
  generic
  in statement vect_p.46_130 = PHI <vect_p.46_131(10), vect_p.49_129(4)>

  at position 
  type vector(8) unsigned char * restrict
  base (vector(8) unsigned char * restrict) vdst_31(D)
  step 8
  base object (void *) vdst_31(D)
  is a biv
  related candidates 

Note how they are "generic" rather than "address"es.

We also have the candidates:

candidate 5 (important)
  var_before ivtmp.81
  var_after ivtmp.81
  incremented before exit test
  type unsigned int
  base (unsigned int) udst_14(D)
  step 8
  base object (void *) udst_14(D)
[...]
candidate 7 (important)
  original biv
  type unsigned int
  base (unsigned int) udst_14(D)
  step 8
  base object (void *) udst_14(D)
[...]
candidate 13 (important)
  var_before ivtmp.87
  var_after ivtmp.87
  incremented before exit test
  type unsigned int
  base (unsigned int) vdst_31(D)
  step 8
  base object (void *) vdst_31(D)
candidate 14 (important)
  original biv
  type unsigned int
  base (unsigned int) vdst_31(D)
  step 8
  base object (void *) vdst_31(D)

The problem (from an ARM POV) is that we end using candidate 5 for use 3:

<bb 5>:
  [...]
  D.3678_95 = (unsigned int) vdst_31(D);
  D.3679_93 = (unsigned int) udst_14(D);
  D.3680_89 = D.3678_95 - D.3679_93;
  D.3681_87 = D.3680_89 + ivtmp.81_94;
  D.3682_83 = (vector(8) unsigned char * restrict) D.3681_87;
[vdst + i:]
  vect_p.46_130 = D.3682_83;
[udst + i:]
  vect_p.37_124 = (vector(8) unsigned char * restrict) ivtmp.81_94;
  [...]

This is based on the following costs:

Use 3:
  cand	cost	compl.	depends on
  0	8	0	 inv_expr:18
  5	4	0	 inv_expr:19
  6	4	0	 inv_expr:18
  7	4	0	 inv_expr:19
  8	4	1	 inv_expr:20
  13	0	0	
  14	0	0	
  15	4	1	
  16	8	0	 inv_expr:18
  17	8	1	 inv_expr:21

The cost of using candidate 5 for use 3 is calculated as:

  /* use = ubase + ratio * (var - cbase).  If either cbase is a constant
     or ratio == 1, it is better to handle this like

     ubase - ratio * cbase + ratio * var

     (also holds in the case ratio == -1, TODO.  */
  [...]
  else if (ratio == 1)
    {
      tree real_cbase = cbase;

      /* Check to see if any adjustment is needed.  */
      if (cstepi == 0 && stmt_is_after_inc)
        {
          aff_tree real_cbase_aff;
          aff_tree cstep_aff;

          tree_to_aff_combination (cbase, TREE_TYPE (real_cbase),
                                   &real_cbase_aff);
          tree_to_aff_combination (cstep, TREE_TYPE (cstep), &cstep_aff);

          aff_combination_add (&real_cbase_aff, &cstep_aff);
          real_cbase = aff_combination_to_tree (&real_cbase_aff);
        }

      cost = difference_cost (data,
			      ubase, real_cbase,
			      &symbol_present, &var_present, &offset,
			      depends_on);
      cost.cost /= avg_loop_niter (data->current_loop);
    }
  [...]
  cost.cost += add_cost (TYPE_MODE (ctype), speed);

The individual difference_cost and add_cost seem reasonable (4 in each case).
I don't understand the reasoning behind the division though.  Is the idea
that this should be hoisted?  If so, then:

(a) That doesn't happen at the tree level.  The subtraction is still inside
    the loop at RTL generation time.

(b) What's the advantage of introducing a new hoisted subtraction that
    is going to be live throughout the loop, and then adding another IV
    to it inside the loop, over using the original IV and incrementing it
    in the normal way?

There is code to stop this happening for uses that are marked as addresses:

  if (address_p)
    {
      /* Do not try to express address of an object with computation based
	 on address of a different object.  This may cause problems in rtl
	 level alias analysis (that does not expect this to be happening,
	 as this is illegal in C), and would be unlikely to be useful
	 anyway.  */
      if (use->iv->base_object
	  && cand->iv->base_object
	  && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0))
	return infinite_cost;
    }

but it doesn't trigger for these uses because we treat them as "generic".
That could fixed by, e.g., treating them as addresses, or by using
POINTER_TYPE_P instead of or as well as address_p in the condition above.
But it seems like a slightly odd transformation in any case.

The reason I'm interested is that NEON has auto-increment instructions.
This transformation stops us using them for vdst and (because of
unfortunate instruction ordering) on udst as well.

Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-07-06 17:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-06 11:35 Help with ivopts Richard Sandiford
2011-07-06 13:10 ` Michael Matz
2011-07-06 13:32   ` Richard Sandiford
2011-07-06 13:43     ` Richard Sandiford
2011-07-06 14:40       ` Michael Matz
2011-07-06 15:04         ` Richard Sandiford
2011-07-06 15:17           ` Michael Matz
2011-07-06 15:34             ` Richard Sandiford
2011-07-06 15:56               ` Michael Matz
2011-07-06 17:55                 ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).