public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Vladimir Makarov <vmakarov@redhat.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>, rdsandiford@googlemail.com
Subject: Re: RFC: LRA for x86/x86-64 [7/9] -- continuation
Date: Thu, 11 Oct 2012 01:04:00 -0000	[thread overview]
Message-ID: <507615D7.10601@redhat.com> (raw)
In-Reply-To: <871uhfvmgw.fsf@richards-thinkpad-2.stglab.manchester.uk.ibm.com>

On 12-10-03 7:11 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> Some comments on lra-spills.c and lra-coalesce.c.
>
>> +   The pass creates necessary stack slots and assign spilled pseudos
>> +   to the stack slots in following way:
> s/assign/assigns/
Fixed.
>> +   (or insn memory constraints) might be not satisfied any more.
> s/might be not/might not be/
Fixed.
>> +   For some targets, the pass can spill some pseudos into hard
>> +   registers of different class (usually into vector registers)
>> +   instead of spilling them into memory if it is possible and
>> +   profitable.	Spilling GENERAL_REGS pseudo into SSE registers for
>> +   modern Intel x86/x86-64 processors is an example of such
>> +   optimization.  And this is actually recommended by Intel
>> +   optimization guide.
> Maybe mention core i7 specifically?  "Modern" is a bit dangerous
> in code that'll live a long time.
Yes, right.  Fixed.  Another bad thing would be an usage of word new.
>> +/* The structure describes a stack slot which can be used for several
>> +   spilled pseudos.  */
>> +struct slot
>> +{
> Looks like this describes "a register or stack slot" given the hard_regno case.
Fixed
>> +/* Array containing info about the stack slots.	 The array element is
>> +   indexed by the stack slot number in the range [0..slost_num).  */
> Typo: slots_num
Fixed.
>> +  /* Each pseudo has an inherent size which comes from its own mode,
>> +     and a total size which provides room for paradoxical subregs
>> +     which refer to the pseudo reg in wider modes.
>> +
>> +     We can use a slot already allocated if it provides both enough
>> +     inherent space and enough total space.  Otherwise, we allocate a
>> +     new slot, making sure that it has no less inherent space, and no
>> +     less total space, then the previous slot.	*/
> The second part of the comment seems a bit misplaced, since the following
> code doesn't reuse stack slots.  This is done elsewhere instead.
> Maybe the first part would be better above the inherent_size assignment.
Right.  I've changed comment to reflect the current state of the code.
>> +  /* If we have any adjustment to make, or if the stack slot is the
>> +     wrong mode, make a new stack slot.	 */
>> +  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);
> We don't make a new slot here.
I removed the comment.  The same comment is present in reload1.c and 
probably should be also removed.
>> +/* Sort pseudos according their slot numbers putting ones with smaller
>> +   numbers first, or last when the frame pointer is not needed.	 So
>> +   pseudos with the first slot will be finally addressed with smaller
>> +   address displacement.  */
>> +static int
>> +pseudo_reg_slot_compare (const void *v1p, const void *v2p)
>> +{
>> +  const int regno1 = *(const int *) v1p;
>> +  const int regno2 = *(const int *) v2p;
>> +  int diff, slot_num1, slot_num2;
>> +  int total_size1, total_size2;
>> +
>> +  slot_num1 = pseudo_slots[regno1].slot_num;
>> +  slot_num2 = pseudo_slots[regno2].slot_num;
>> +  if ((diff = slot_num1 - slot_num2) != 0)
>> +    return (frame_pointer_needed
>> +	    || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);
> The comment doesn't quite describe the condition.  Maybe:
>
> /* Sort pseudos according to their slots, putting the slots in the order
>     that they should be allocated.  Slots with lower numbers have the highest
>     priority and should get the smallest displacement from the stack or
>     frame pointer (whichever is being used).
>
>     The first allocated slot is always closest to the frame pointer,
>     so prefer lower slot numbers when frame_pointer_needed.  If the stack
>     and frame grow in the same direction, then the first allocated slot is
>     always closest to the initial stack pointer and furthest away from the
>     final stack pointer, so allocate higher numbers first when using the
>     stack pointer in that case.  The reverse is true if the stack and
>     frame grow in opposite directions.  */
I used your comment.  Thanks.
>> +  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
>> +		     GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
>> +  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
>> +		     GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
>> +  if ((diff = total_size2 - total_size1) != 0)
>> +    return diff;
> I think this could do with a bit more commentary.  When is biggest_mode
> ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
> ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
> matter for those registers here but not when calculating biggest_mode?
The MAX code has no sense to me too (probably it was wrongly adapted 
from somewhere).  So I removed MAX.
>> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
>> +   pseudos which did not get a spill hard register at the beginning of
>> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */
> It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
> frequency first.
Fixed.
>> +  bitmap set_jump_crosses = regstat_get_setjmp_crosses ();
> I notice you use "set_jump" here and "setjump" in parts of 7a.patch.
> Probably better to use setjmp across the board.
Fixed.  setjump is also used in other parts of GCC.
>> +  /* Hard registers which can not be used for any purpose at given
>> +     program point because they are unallocatable or already allocated
>> +     for other pseudos.	 */
>> +  HARD_REG_SET *reserved_hard_regs;
>> +
>> +  if (! lra_reg_spill_p)
>> +    return n;
>> +  /* Set up reserved hard regs for every program point.	 */
>> +  reserved_hard_regs = (HARD_REG_SET *) xmalloc (sizeof (HARD_REG_SET)
>> +						 * lra_live_max_point);
>> +  for (p = 0; p < lra_live_max_point; p++)
>> +    COPY_HARD_REG_SET (reserved_hard_regs[p], lra_no_alloc_regs);
>> +  for (i = FIRST_PSEUDO_REGISTER; i < regs_num; i++)
>> +    if (lra_reg_info[i].nrefs != 0
>> +	&& (hard_regno = lra_get_regno_hard_regno (i)) >= 0)
>> +      for (r = lra_reg_info[i].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[i].biggest_mode,
>> +				&reserved_hard_regs[p]);
> Since compilation time seems to be all the rage, I wonder if it would be
> quicker to have one live range list per hard register.  Then:
>> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  IOR_HARD_REG_SET (conflict_hard_regs, reserved_hard_regs[p]);
> would just be checking for live range intersection and:
>
>> +      /* Update reserved_hard_regs.  */
>> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[regno].biggest_mode,
>> +				&reserved_hard_regs[p]);
> would again be a merge.
>
> Just an idea, not a merge requirement.  If you've already tried this and
> found it to be worse, that might be worth a comment.
I checked profiles and coverage for different tests (including huge 
ones) and did not see this code is critical.  But probably it is worth 
to try.  It might a bit more complicated for multi-register pseudos.  
I've just used the same pattern as in IRA fast allocation.  On the other 
hand, stack slots are allocated as you propose.  It might be good to 
unify the code.  I'll put it on my (long) todo list.
>> +      first = pseudo_slots[regno].first = &pseudo_slots[slots[slot_num].regno];
>> +      pseudo_slots[regno].next = pseudo_slots[slots[slot_num].regno].next;
>> +      first->next = &pseudo_slots[regno];
> Very minor nit, but I think this would be easier to read if the middle
> line also used "first->next".
Fixed.
>> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
>> +   pseudos which did not get a spill hard register at the beginning of
>> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */
> Here too I think it's worth mentioning that PSEUDO_REGNOS is sorted
> with highest frequency first.
Fixed.
>> +/* Recursively process LOC in INSN and change spilled pseudos to the
>> +   corresponding memory or spilled hard reg.  Ignore spilled pseudos
>> +   created from the scratches.	*/
>> +static bool
>> +remove_pseudos (rtx *loc, rtx insn)
> The return value is now ignored -- we know in advance which insns need
> changing -- so this could be simplified.
Fixed.
>> +/* Change spilled pseudos into memory or spill hard regs.  The
>> +   function put changed insns on the constraint stack (these insns
>> +   will be considered on the next constraint pass).  The changed insns
>> +   are all insns in which pseudos were changed.	 */
> s/The function put/Put/
Fixed
>> +/* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
>> +   LR_BITMAP (a BB live info bitmap).  */
>> +static void
>> +update_live_info (bitmap lr_bitmap)
>> +{
>> +  unsigned int j;
>> +  bitmap_iterator bi;
>> +
>> +  bitmap_clear (&removed_pseudos_bitmap);
>> +  bitmap_clear (&used_pseudos_bitmap);
>> +  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
>> +			    FIRST_PSEUDO_REGISTER, j, bi)
>> +    {
>> +      bitmap_set_bit (&removed_pseudos_bitmap, j);
>> +      bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
>> +    }
>> +  if (! bitmap_empty_p (&removed_pseudos_bitmap))
>> +    {
>> +      bitmap_and_compl_into (lr_bitmap, &removed_pseudos_bitmap);
>> +      bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
>> +    }
>> +}
> Might be wrong, but it looks like nothing really uses removed_pseudos_bitmap
> outside this function.  I think this could simply be:
> /* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
>     LR_BITMAP (a BB live info bitmap).  */
> static void
> update_live_info (bitmap lr_bitmap)
> {
>    unsigned int j;
>    bitmap_iterator bi;
>
>    bitmap_clear (&used_pseudos_bitmap);
>    EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
> 			    FIRST_PSEUDO_REGISTER, j, bi)
>      bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
>    if (! bitmap_empty_p (&used_pseudos_bitmap))
>      {
>        bitmap_and_compl_into (lr_bitmap, &coalesced_pseudos_bitmap);
>        bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
>      }
> }
Yes.  Thanks for finding such nontrivial change.  I fixed it.
>> +	    && mem_move_p (sregno, dregno)
>> +	    /* Don't coalesce inheritance pseudos because spilled
>> +	       inheritance pseudos will be removed in subsequent 'undo
>> +	       inheritance' pass.  */
>> +	    && lra_reg_info[sregno].restore_regno < 0
>> +	    && lra_reg_info[dregno].restore_regno < 0
>> +	    /* We undo splits for spilled pseudos whose live ranges
>> +	       were split.  So don't coalesce them, it is not
>> +	       necessary and the undo transformations would be
>> +	       wrong.  */
>> +	    && ! bitmap_bit_p (&split_origin_bitmap, sregno)
>> +	    && ! bitmap_bit_p (&split_origin_bitmap, dregno)
>> +	    && ! side_effects_p (set)
>> +	    /* Don't coalesces bound pseudos.  Bound pseudos has own
>> +	       rules for finding live ranges.  It is hard to maintain
>> +	       this info with coalescing and it is not worth to do
>> +	       it.  */
>> +	    && ! bitmap_bit_p (&lra_bound_pseudos, sregno)
>> +	    && ! bitmap_bit_p (&lra_bound_pseudos, dregno)
>> +	    /* We don't want to coalesce regnos with equivalences,
>> +	       at least without updating this info.  */
>> +	    && ira_reg_equiv[sregno].constant == NULL_RTX
>> +	    && ira_reg_equiv[sregno].memory == NULL_RTX
>> +	    && ira_reg_equiv[sregno].invariant == NULL_RTX
>> +	    && ira_reg_equiv[dregno].constant == NULL_RTX
>> +	    && ira_reg_equiv[dregno].memory == NULL_RTX
>> +	    && ira_reg_equiv[dregno].invariant == NULL_RTX
> Probably personal preference, but I think this would be easier
> to read as:
>
> 	    && coalescable_reg_p (sregno)
> 	    && coalescable_reg_p (dregno)
> 	    && !side_effects_p (set)
>
> with coalescable_reg_p checking reg_renumber (from mem_move_p)
> and the open-coded stuff in the quote above.
Ok.  Fixed.
>> +  for (; mv_num != 0;)
>> +    {
>> +      for (i = 0; i < mv_num; i++)
>> +	{
>> +	  mv = sorted_moves[i];
>> +	  set = single_set (mv);
>> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
>> +		      && REG_P (SET_DEST (set)));
>> +	  sregno = REGNO (SET_SRC (set));
>> +	  dregno = REGNO (SET_DEST (set));
>> +	  if (! lra_intersected_live_ranges_p
>> +		(lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
>> +		 lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges))
>> +	    {
>> +	      coalesced_moves++;
>> +	      if (lra_dump_file != NULL)
>> +		fprintf
>> +		  (lra_dump_file,
>> +		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
>> +		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
>> +		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
>> +		   BLOCK_FOR_INSN (mv)->frequency);
>> +	      bitmap_ior_into (&involved_insns_bitmap,
>> +			       &lra_reg_info[sregno].insn_bitmap);
>> +	      bitmap_ior_into (&involved_insns_bitmap,
>> +			       &lra_reg_info[dregno].insn_bitmap);
>> +	      merge_pseudos (sregno, dregno);
>> +	      i++;
>> +	      break;
>> +	    }
>> +	}
>> +      /* Collect the rest of copies.  */
>> +      for (n = 0; i < mv_num; i++)
>> +	{
>> +	  mv = sorted_moves[i];
>> +	  set = single_set (mv);
>> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
>> +		      && REG_P (SET_DEST (set)));
>> +	  sregno = REGNO (SET_SRC (set));
>> +	  dregno = REGNO (SET_DEST (set));
>> +	  if (first_coalesced_pseudo[sregno] != first_coalesced_pseudo[dregno])
>> +	    sorted_moves[n++] = mv;
>> +	  else if (lra_dump_file != NULL)
>> +	    {
>> +	      coalesced_moves++;
>> +	      fprintf
>> +		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
>> +		 INSN_UID (mv), sregno, dregno,
>> +		 BLOCK_FOR_INSN (mv)->frequency);
>> +	    }
>> +	}
>> +      mv_num = n;
> I'm probably being dense here, sorry, but why the nested loops?
> Why can't we have one loop along the lines of:
>
>        for (i = 0; i < mv_num; i++)
> 	{
> 	  mv = sorted_moves[i];
> 	  set = single_set (mv);
> 	  lra_assert (set != NULL && REG_P (SET_SRC (set))
> 		      && REG_P (SET_DEST (set)));
> 	  sregno = REGNO (SET_SRC (set));
> 	  dregno = REGNO (SET_DEST (set));
> 	  if (first_coalesced_pseudo[sregno] == first_coalesced_pseudo[dregno])
> 	    {
> 	      coalesced_moves++;
> 	      fprintf
> 		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
> 		 INSN_UID (mv), sregno, dregno,
> 		 BLOCK_FOR_INSN (mv)->frequency);
> 	      /* We updated involved_insns_bitmap when doing the mrege */
> 	    }
> 	  else if (!(lra_intersected_live_ranges_p
> 		     (lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
> 		      lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges)))
> 	    {
> 	      coalesced_moves++;
> 	      if (lra_dump_file != NULL)
> 		fprintf
> 		  (lra_dump_file,
> 		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
> 		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
> 		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
> 		   BLOCK_FOR_INSN (mv)->frequency);
> 	      bitmap_ior_into (&involved_insns_bitmap,
> 			       &lra_reg_info[sregno].insn_bitmap);
> 	      bitmap_ior_into (&involved_insns_bitmap,
> 			       &lra_reg_info[dregno].insn_bitmap);
> 	      merge_pseudos (sregno, dregno);
> 	    }
> 	}
>
> (completely untested)
As I remember, it was more complicated coalesced algorithm where sorting 
was done on each iteration after one move coalesce.

I changed the code.
>
>> +	    if ((set = single_set (insn)) != NULL_RTX
>> +		&& REG_P (SET_DEST (set)) && REG_P (SET_SRC (set))
>> +		&& REGNO (SET_SRC (set)) == REGNO (SET_DEST (set))
>> +		&& ! side_effects_p (set))
> Maybe use set_noop_p here?
>
Ok.  Why not.  The code is rarely executed so more generalize code could 
be used.  I changed it to set_noop_p.

  reply	other threads:[~2012-10-11  0:42 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-27 23:41 Vladimir Makarov
2012-10-03 11:12 ` Richard Sandiford
2012-10-11  1:04   ` Vladimir Makarov [this message]
2012-10-04 15:51 ` Richard Sandiford
2012-10-12  3:53   ` Vladimir Makarov
2012-10-12 16:19     ` Richard Sandiford
2012-10-10 15:36 ` Richard Sandiford
2012-10-10 19:53   ` Richard Sandiford
2012-10-14 17:44   ` Vladimir Makarov
2012-10-15 12:41     ` Richard Sandiford
2012-10-19  6:46       ` Vladimir Makarov
2012-10-12 14:44 ` Richard Sandiford
2012-10-13  8:52   ` Richard Sandiford
2012-10-17  1:18   ` Vladimir Makarov
2012-10-17 11:42     ` Richard Sandiford
2012-10-19  6:13       ` Vladimir Makarov
2012-10-15 17:06 ` Richard Sandiford
2012-10-17 19:59   ` Vladimir Makarov
2012-10-17 20:30     ` Steven Bosscher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=507615D7.10601@redhat.com \
    --to=vmakarov@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rdsandiford@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).