public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* PATCH: PR target/46519: Missing vzeroupper
@ 2010-12-17 19:45 H.J. Lu
  2010-12-18 19:36 ` Uros Bizjak
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-12-17 19:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: Uros Bizjak

Hi,

This patch fixes another missing vzeroupper.  OK for trunk?

Thanks.


H.J.
---
gcc/

2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (rescan_move_or_delete_vzeroupper): Stop
	rescanning predecessor edges if one of them uses upper 128bits.

gcc/testsuite/

2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-2.f90: New.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a5603e6..99ba823 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -349,7 +349,10 @@ rescan_move_or_delete_vzeroupper (basic_block block)
       rescan_move_or_delete_vzeroupper (e->src);
       /* For rescan, UKKNOWN state is treated as UNUSED.  */
       if (BLOCK_INFO (e->src)->state == used)
-	state = used;
+	{
+	  state = used;
+	  break;
+	}
     }
 
   /* Rescan this block only if there are vzerouppers or the upper
diff --git a/gcc/testsuite/gfortran.dg/pr46519-2.f90 b/gcc/testsuite/gfortran.dg/pr46519-2.f90
new file mode 100644
index 0000000..b4d6980
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr46519-2.f90
@@ -0,0 +1,31 @@
+! { dg-do compile { target i?86-*-* x86_64-*-* } }
+! { dg-options "-O3 -mavx -mvzeroupper -mtune=generic -dp" }
+
+  SUBROUTINE func(kts, kte, qrz, qiz, rho)
+  IMPLICIT NONE
+  INTEGER, INTENT(IN)               :: kts, kte
+  REAL,    DIMENSION(kts:kte), INTENT(INOUT) :: qrz, qiz, rho
+  INTEGER                              :: k
+  REAL, DIMENSION(kts:kte)    ::  praci, vtiold
+  REAL                          :: fluxout
+  INTEGER                       :: min_q, max_q, var
+  do k=kts,kte
+    praci(k)=0.0
+  enddo
+  min_q=kte
+  max_q=kts-1
+  DO var=1,20
+    do k=max_q,min_q,-1
+       fluxout=rho(k)*qrz(k)
+    enddo
+    qrz(min_q-1)=qrz(min_q-1)+fluxout
+  ENDDO
+  DO var=1,20
+    do k=kts,kte-1
+      vtiold(k)= (rho(k))**0.16
+    enddo
+  ENDDO
+  STOP
+  END SUBROUTINE func
+
+! { dg-final { scan-assembler "avx_vzeroupper" } }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-17 19:45 PATCH: PR target/46519: Missing vzeroupper H.J. Lu
@ 2010-12-18 19:36 ` Uros Bizjak
  2010-12-18 20:11   ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Uros Bizjak @ 2010-12-18 19:36 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gcc-patches

On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:

> This patch fixes another missing vzeroupper.  OK for trunk?
>
> Thanks.
>
>
> H.J.
> ---
> gcc/
>
> 2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * config/i386/i386.c (rescan_move_or_delete_vzeroupper): Stop
>        rescanning predecessor edges if one of them uses upper 128bits.
>
> gcc/testsuite/
>
> 2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * gfortran.dg/pr46519-2.f90: New.

OK.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-18 19:36 ` Uros Bizjak
@ 2010-12-18 20:11   ` H.J. Lu
  2010-12-29 11:03     ` Uros Bizjak
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-12-18 20:11 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1619 bytes --]

On Sat, Dec 18, 2010 at 9:48 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>
>> This patch fixes another missing vzeroupper.  OK for trunk?
>>
>> Thanks.
>>
>>
>> H.J.
>> ---
>> gcc/
>>
>> 2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * config/i386/i386.c (rescan_move_or_delete_vzeroupper): Stop
>>        rescanning predecessor edges if one of them uses upper 128bits.
>>
>> gcc/testsuite/
>>
>> 2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * gfortran.dg/pr46519-2.f90: New.
>
> OK.
>
> Thanks,
> Uros.
>

I'd like to apply this patch instead. It removes escan_move_or_delete_vzeroupper
and rewrites move_or_delete_vzeroupper_1 to avoid recursive call. It first scans
all basic blocks repeatedly until no basic block changes the upper
128bits of AVX
to used at exit.  Then it rescans all basic blocks with unknown upper
128bit state.
OK for trunk?

Thanks.

-- 
H.J.
---
gcc/

2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (block_info_def): Remove referenced, count
	and rescanned.
	(move_or_delete_vzeroupper_2): Updated.
	(move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
	(rescan_move_or_delete_vzeroupper): Removed.
	(move_or_delete_vzeroupper): Repeat processing all basic blocks
	until no basic block state is changed to used at exit.

gcc/testsuite/

2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-2.f90: New.

[-- Attachment #2: gcc-pr46519-8.patch --]
[-- Type: text/plain, Size: 9203 bytes --]

gcc/

2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (block_info_def): Remove referenced, count
	and rescanned.
	(move_or_delete_vzeroupper_2): Updated.
	(move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
	(rescan_move_or_delete_vzeroupper): Removed.
	(move_or_delete_vzeroupper): Repeat processing all basic blocks
	until no basic block state is changed to used at exit.

gcc/testsuite/

2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-2.f90: New.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 40999c8..28b26ef 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -68,14 +68,8 @@ typedef struct block_info_def
 {
   /* State of the upper 128bits of any AVX registers at exit.  */
   enum upper_128bits_state state;
-  /* If the upper 128bits of any AVX registers are referenced.  */
-  enum upper_128bits_state referenced;
-  /* Number of vzerouppers in this block.  */
-  unsigned int count;
   /* TRUE if block has been processed.  */
   bool processed;
-  /* TRUE if block has been rescanned.  */
-  bool rescanned;
 } *block_info;
 
 #define BLOCK_INFO(B)   ((block_info) (B)->aux)
@@ -127,8 +121,6 @@ move_or_delete_vzeroupper_2 (basic_block bb,
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
-  enum upper_128bits_state referenced = BLOCK_INFO (bb)->referenced;
-  int count = BLOCK_INFO (bb)->count;
 
   if (dump_file)
     fprintf (dump_file, " [bb %i] entry: upper 128bits: %d\n",
@@ -191,24 +183,20 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	      /* Delete pending vzeroupper insertion.  */
 	      if (vzeroupper_insn)
 		{
-		  count--;
 		  delete_insn (vzeroupper_insn);
 		  vzeroupper_insn = NULL_RTX;
 		}
 	    }
-	  else if (state != used && referenced != unused)
+	  else if (state != used)
 	    {
 	      /* No need to call note_stores if the upper 128bits of
 		 AVX registers are never referenced.  */
 	      note_stores (pat, check_avx256_stores, &state);
-	      if (state == used)
-		referenced = used;
 	    }
 	  continue;
 	}
 
       /* Process vzeroupper intrinsic.  */
-      count++;
       avx256 = INTVAL (XVECEXP (pat, 0, 0));
 
       if (state == unused)
@@ -226,7 +214,6 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	      fprintf (dump_file, "Delete redundant vzeroupper:\n");
 	      print_rtl_single (dump_file, insn);
 	    }
-	  count--;
 	  delete_insn (insn);
 	}
       else
@@ -246,7 +233,6 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 		  fprintf (dump_file, "Delete callee pass vzeroupper:\n");
 		  print_rtl_single (dump_file, insn);
 		}
-	      count--;
 	      delete_insn (insn);
 	    }
 	  else
@@ -256,30 +242,22 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 
   BLOCK_INFO (bb)->state = state;
 
-  if (BLOCK_INFO (bb)->referenced == unknown)
-    {
-      /* The upper 128bits of AVX registers are never referenced if
-	 REFERENCED isn't updated.  */
-      if (referenced == unknown)
-	referenced = unused;
-      BLOCK_INFO (bb)->referenced = referenced;
-      BLOCK_INFO (bb)->count = count;
-    }
-
   if (dump_file)
     fprintf (dump_file, " [bb %i] exit: upper 128bits: %d\n",
 	     bb->index, state);
 }
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper
-   in BLOCK and its predecessor blocks recursively.  */
+   in BLOCK and check its predecessor blocks.  Treat UNKNOWN state
+   as USED if UNKNOWN_IS_UNUSED is true.  */
 
 static void
-move_or_delete_vzeroupper_1 (basic_block block)
+move_or_delete_vzeroupper_1 (basic_block block, bool unknown_is_unused)
 {
   edge e;
   edge_iterator ei;
-  enum upper_128bits_state state;
+  enum upper_128bits_state state, old_state, new_state;
+  bool seen_unknown;
 
   if (dump_file)
     fprintf (dump_file, " Process [bb %i]: status: %d\n",
@@ -288,83 +266,42 @@ move_or_delete_vzeroupper_1 (basic_block block)
   if (BLOCK_INFO (block)->processed)
     return;
 
-  BLOCK_INFO (block)->processed = true;
-
-  state = unknown;
+  state = unused;
 
-  /* Process all predecessor edges of this block.  */
+  /* Check all predecessor edges of this block.  */
+  seen_unknown = false;
   FOR_EACH_EDGE (e, ei, block->preds)
     {
       if (e->src == block)
 	continue;
-      move_or_delete_vzeroupper_1 (e->src);
       switch (BLOCK_INFO (e->src)->state)
 	{
 	case unknown:
-	  if (state == unused)
-	    state = unknown;
+	  if (!unknown_is_unused)
+	    seen_unknown = true;
+	case unused:
 	  break;
 	case used:
 	  state = used;
-	  break;
-	case unused:
-	  break;
+	  goto done;
 	}
     }
 
-  /* If state of any predecessor edges is unknown, we need to rescan.  */
-  if (state == unknown)
-    cfun->machine->rescan_vzeroupper_p = 1;
+  if (seen_unknown)
+    state = unknown;
 
-  /* Process this block.  */
+done:
+  old_state = BLOCK_INFO (block)->state;
   move_or_delete_vzeroupper_2 (block, state);
-}
-
-/* Helper function for move_or_delete_vzeroupper.  Rescan vzeroupper
-   in BLOCK and its predecessor blocks recursively.  */
-
-static void
-rescan_move_or_delete_vzeroupper (basic_block block)
-{
-  edge e;
-  edge_iterator ei;
-  enum upper_128bits_state state;
-
-  if (dump_file)
-    fprintf (dump_file, " Rescan [bb %i]: status: %d\n",
-	     block->index, BLOCK_INFO (block)->rescanned);
-
-  if (BLOCK_INFO (block)->rescanned)
-    return;
-
-  BLOCK_INFO (block)->rescanned = true;
-
-  state = unused;
+  new_state = BLOCK_INFO (block)->state;
 
-  /* Rescan all predecessor edges of this block.  */
-  FOR_EACH_EDGE (e, ei, block->preds)
-    {
-      if (e->src == block)
-	continue;
-      rescan_move_or_delete_vzeroupper (e->src);
-      /* For rescan, UKKNOWN state is treated as UNUSED.  */
-      if (BLOCK_INFO (e->src)->state == used)
-	state = used;
-    }
+  if (state != unknown || new_state == used)
+    BLOCK_INFO (block)->processed = true;
 
-  /* Rescan this block only if there are vzerouppers or the upper
-     128bits of AVX registers are referenced.  */
-  if (BLOCK_INFO (block)->count == 0
-      && (state == used || BLOCK_INFO (block)->referenced != used))
-    {
-      if (state == used)
-	BLOCK_INFO (block)->state = state;
-      if (dump_file)
-	fprintf (dump_file, " [bb %i] exit: upper 128bits: %d\n",
-		 block->index, BLOCK_INFO (block)->state);
-    }
-  else
-    move_or_delete_vzeroupper_2 (block, state);
+  /* Need to rescan if the upper 128bits of AVX registers are changed
+     to USED at exit.  */
+  if (new_state != old_state && new_state == used)
+    cfun->machine->rescan_vzeroupper_p = 1;
 }
 
 /* Go through the instruction stream looking for vzeroupper.  Delete
@@ -377,7 +314,7 @@ move_or_delete_vzeroupper (void)
   edge e;
   edge_iterator ei;
   basic_block bb;
-  unsigned int count = 0;
+  unsigned int count;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
@@ -392,28 +329,30 @@ move_or_delete_vzeroupper (void)
 				   cfun->machine->caller_pass_avx256_p
 				   ? used : unused);
       BLOCK_INFO (e->dest)->processed = true;
-      BLOCK_INFO (e->dest)->rescanned = true;
     }
 
   /* Process all basic blocks.  */
-  if (dump_file)
-    fprintf (dump_file, "Process all basic blocks\n");
-
-  FOR_EACH_BB (bb)
-    {
-      move_or_delete_vzeroupper_1 (bb);
-      count += BLOCK_INFO (bb)->count;
-    }
-
-  /* Rescan all basic blocks if needed.  */
-  if (count && cfun->machine->rescan_vzeroupper_p)
+  count = 0;
+  do
     {
       if (dump_file)
-	fprintf (dump_file, "Rescan all basic blocks\n");
-
+	fprintf (dump_file, "Process all basic blocks: trip %d\n",
+		 count);
+      cfun->machine->rescan_vzeroupper_p = 0;
       FOR_EACH_BB (bb)
-	rescan_move_or_delete_vzeroupper (bb);
+	move_or_delete_vzeroupper_1 (bb, false);
     }
+  while (cfun->machine->rescan_vzeroupper_p && count++ < 20);
+
+  /* FIXME: Is 20 big enough?  */
+  if (count >= 20)
+    gcc_unreachable ();
+
+  if (dump_file)
+    fprintf (dump_file, "Process all basic blocks\n");
+
+  FOR_EACH_BB (bb)
+    move_or_delete_vzeroupper_1 (bb, true);
 
   free_aux_for_blocks ();
 }
diff --git a/gcc/testsuite/gfortran.dg/pr46519-2.f90 b/gcc/testsuite/gfortran.dg/pr46519-2.f90
new file mode 100644
index 0000000..b4d6980
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr46519-2.f90
@@ -0,0 +1,31 @@
+! { dg-do compile { target i?86-*-* x86_64-*-* } }
+! { dg-options "-O3 -mavx -mvzeroupper -mtune=generic -dp" }
+
+  SUBROUTINE func(kts, kte, qrz, qiz, rho)
+  IMPLICIT NONE
+  INTEGER, INTENT(IN)               :: kts, kte
+  REAL,    DIMENSION(kts:kte), INTENT(INOUT) :: qrz, qiz, rho
+  INTEGER                              :: k
+  REAL, DIMENSION(kts:kte)    ::  praci, vtiold
+  REAL                          :: fluxout
+  INTEGER                       :: min_q, max_q, var
+  do k=kts,kte
+    praci(k)=0.0
+  enddo
+  min_q=kte
+  max_q=kts-1
+  DO var=1,20
+    do k=max_q,min_q,-1
+       fluxout=rho(k)*qrz(k)
+    enddo
+    qrz(min_q-1)=qrz(min_q-1)+fluxout
+  ENDDO
+  DO var=1,20
+    do k=kts,kte-1
+      vtiold(k)= (rho(k))**0.16
+    enddo
+  ENDDO
+  STOP
+  END SUBROUTINE func
+
+! { dg-final { scan-assembler "avx_vzeroupper" } }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-18 20:11   ` H.J. Lu
@ 2010-12-29 11:03     ` Uros Bizjak
  2010-12-29 16:23       ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Uros Bizjak @ 2010-12-29 11:03 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gcc-patches, Richard Guenther, Jakub Jelinek, Mark Mitchell

On Sat, Dec 18, 2010 at 7:10 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sat, Dec 18, 2010 at 9:48 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>
>>> This patch fixes another missing vzeroupper.  OK for trunk?

> I'd like to apply this patch instead. It removes escan_move_or_delete_vzeroupper
> and rewrites move_or_delete_vzeroupper_1 to avoid recursive call. It first scans
> all basic blocks repeatedly until no basic block changes the upper
> 128bits of AVX
> to used at exit.  Then it rescans all basic blocks with unknown upper
> 128bit state.
> OK for trunk?

H.J. explained me in a private mail about the importance of this
patch. I think that the quote below explains it:

<quote>
> I'm not sure that the algorithm is correct (and I don't have enough
> experience in this area), so I'd rather leave the review to someone
> else. AFAICS, there can be 20 passes, and from comments, it is
> questionable if this is enough.

I tried several benchmarks which failed before my patch.  The most pass
I saw is 2. I can change it to 2 and re-run SPEC CPU 2K/2006 to find
out what the smallest pass should be.

> I propose that you commit your previous (simple) patch, since IMO this

My simple patch doesn't work on SPEC CPU 2K/2006. It isn't very
useful for 4.6.

> one is too invasive for this development stage. However, I still think

The old algorithm is obviously incorrect. The new algorithm removes the
recursive calls and is simpler/faster than the old one.  vzeroupper optimization
is a very important new feature for AVX. The current implementation is
incorrect.  I'd like to fix it before 4.6 is released.

> that LCM infrastructure (see lcm.c) should be used to place
> vzerouppers at optimum points.

We will investigate LCM for 4.7.
</qoute>

I think that due to these reasons, the patch should be committed to
SVN even in this development stage. Even if the algorithm is not
optimal, the patch demonstrably produces substantially better code.
This feature has no impact on generic code without -mvzeroupper /
-mavx switch, and since there are currently very few AVX users,
negligible overall impact.

> gcc/
>
> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * config/i386/i386.c (block_info_def): Remove referenced, count
>        and rescanned.
>        (move_or_delete_vzeroupper_2): Updated.
>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>        (rescan_move_or_delete_vzeroupper): Removed.
>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>        until no basic block state is changed to used at exit.
>
> gcc/testsuite/
>
> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * gfortran.dg/pr46519-2.f90: New.
>

The patch is OK, but please allow a day or two for RMs (CC'd) to
eventually comment.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-29 11:03     ` Uros Bizjak
@ 2010-12-29 16:23       ` H.J. Lu
  2010-12-30 12:42         ` Uros Bizjak
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-12-29 16:23 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Richard Guenther, Jakub Jelinek, Mark Mitchell

[-- Attachment #1: Type: text/plain, Size: 3675 bytes --]

On Wed, Dec 29, 2010 at 1:10 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Dec 18, 2010 at 7:10 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Sat, Dec 18, 2010 at 9:48 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>
>>>> This patch fixes another missing vzeroupper.  OK for trunk?
>
>> I'd like to apply this patch instead. It removes escan_move_or_delete_vzeroupper
>> and rewrites move_or_delete_vzeroupper_1 to avoid recursive call. It first scans
>> all basic blocks repeatedly until no basic block changes the upper
>> 128bits of AVX
>> to used at exit.  Then it rescans all basic blocks with unknown upper
>> 128bit state.
>> OK for trunk?
>
> H.J. explained me in a private mail about the importance of this
> patch. I think that the quote below explains it:
>
> <quote>
>> I'm not sure that the algorithm is correct (and I don't have enough
>> experience in this area), so I'd rather leave the review to someone
>> else. AFAICS, there can be 20 passes, and from comments, it is
>> questionable if this is enough.
>
> I tried several benchmarks which failed before my patch.  The most pass
> I saw is 2. I can change it to 2 and re-run SPEC CPU 2K/2006 to find
> out what the smallest pass should be.
>
>> I propose that you commit your previous (simple) patch, since IMO this
>
> My simple patch doesn't work on SPEC CPU 2K/2006. It isn't very
> useful for 4.6.
>
>> one is too invasive for this development stage. However, I still think
>
> The old algorithm is obviously incorrect. The new algorithm removes the
> recursive calls and is simpler/faster than the old one.  vzeroupper optimization
> is a very important new feature for AVX. The current implementation is
> incorrect.  I'd like to fix it before 4.6 is released.
>
>> that LCM infrastructure (see lcm.c) should be used to place
>> vzerouppers at optimum points.
>
> We will investigate LCM for 4.7.
> </qoute>
>
> I think that due to these reasons, the patch should be committed to
> SVN even in this development stage. Even if the algorithm is not
> optimal, the patch demonstrably produces substantially better code.
> This feature has no impact on generic code without -mvzeroupper /
> -mavx switch, and since there are currently very few AVX users,
> negligible overall impact.
>
>> gcc/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * config/i386/i386.c (block_info_def): Remove referenced, count
>>        and rescanned.
>>        (move_or_delete_vzeroupper_2): Updated.
>>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>>        (rescan_move_or_delete_vzeroupper): Removed.
>>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>>        until no basic block state is changed to used at exit.
>>
>> gcc/testsuite/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * gfortran.dg/pr46519-2.f90: New.
>>
>
> The patch is OK, but please allow a day or two for RMs (CC'd) to
> eventually comment.

We will investigate LCM for 4.7.  In the meantime, here is  a small patch
on top of the current one. If the upper 128bits are never changed in a basic
block, we can skip it in the later passes.  OK for trunk together with the
current patch?

Thanks.


-- 
H.J.
---
2010-12-29  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386.c (upper_128bits_state): Update comments.
	(block_info_def): Add unchanged.
	(move_or_delete_vzeroupper_2): Short circuit if upper 128bits
	are unchanged in the block.

[-- Attachment #2: gcc-pr46519-9.patch --]
[-- Type: text/plain, Size: 3719 bytes --]

2010-12-29  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386.c (upper_128bits_state): Update comments.
	(block_info_def): Add unchanged.
	(move_or_delete_vzeroupper_2): Short circuit if upper 128bits
	are unchanged in the block.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 28b26ef..2d06c04 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -60,14 +60,17 @@ along with GCC; see the file COPYING3.  If not see
 enum upper_128bits_state
 {
   unknown = 0,		/* Unknown.  */
-  unused,		/* Not used or not referenced.  */
-  used			/* Used or referenced.  */
+  unused,		/* Not used.  */
+  used			/* Used.  */
 };
 
 typedef struct block_info_def
 {
-  /* State of the upper 128bits of any AVX registers at exit.  */
+  /* State of the upper 128bits of AVX registers at exit.  */
   enum upper_128bits_state state;
+  /* TRUE if state of the upper 128bits of AVX registers is unchanged
+     in this block.  */
+  bool unchanged;
   /* TRUE if block has been processed.  */
   bool processed;
 } *block_info;
@@ -110,8 +113,7 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
    in basic block BB.  Delete it if upper 128bit AVX registers are
    unused.  If it isn't deleted, move it to just before a jump insn.
    
-   UPPER_128BITS_LIVE is TRUE if the upper 128bits of any AVX registers
-   are live at entry.  */
+   STATE is state of the upper 128bits of AVX registers at entry.  */
 
 static void
 move_or_delete_vzeroupper_2 (basic_block bb,
@@ -121,11 +123,24 @@ move_or_delete_vzeroupper_2 (basic_block bb,
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
+  bool unchanged;
+
+  if (BLOCK_INFO (bb)->unchanged)
+    {
+      if (dump_file)
+	fprintf (dump_file, " [bb %i] unchanged: upper 128bits: %d\n",
+		 bb->index, state);
+
+      BLOCK_INFO (bb)->state = state;
+      return;
+    }
 
   if (dump_file)
     fprintf (dump_file, " [bb %i] entry: upper 128bits: %d\n",
 	     bb->index, state);
 
+  unchanged = true;
+
   /* BB_END changes when it is deleted.  */
   bb_end = BB_END (bb);
   insn = BB_HEAD (bb);
@@ -179,6 +194,7 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	      && XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL)
 	    {
 	      state = unused;
+	      unchanged = false;
 
 	      /* Delete pending vzeroupper insertion.  */
 	      if (vzeroupper_insn)
@@ -189,9 +205,9 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	    }
 	  else if (state != used)
 	    {
-	      /* No need to call note_stores if the upper 128bits of
-		 AVX registers are never referenced.  */
 	      note_stores (pat, check_avx256_stores, &state);
+	      if (state == used)
+		unchanged = false;
 	    }
 	  continue;
 	}
@@ -205,7 +221,10 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	     256bit AVX register.  We only need to check if callee
 	     returns 256bit AVX register.  */
 	  if (avx256 == callee_return_avx256)
-	    state = used;
+	    {
+	      state = used;
+	      unchanged = false;
+	    }
 
 	  /* Remove unnecessary vzeroupper since upper 128bits are
 	     cleared.  */
@@ -236,15 +255,20 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 	      delete_insn (insn);
 	    }
 	  else
-	    vzeroupper_insn = insn;
+	    {
+	      vzeroupper_insn = insn;
+	      unchanged = false;
+	    }
 	}
     }
 
   BLOCK_INFO (bb)->state = state;
+  BLOCK_INFO (bb)->unchanged = unchanged;
 
   if (dump_file)
-    fprintf (dump_file, " [bb %i] exit: upper 128bits: %d\n",
-	     bb->index, state);
+    fprintf (dump_file, " [bb %i] exit: %s: upper 128bits: %d\n",
+	     bb->index, unchanged ? "unchanged" : "changed",
+	     state);
 }
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-29 16:23       ` H.J. Lu
@ 2010-12-30 12:42         ` Uros Bizjak
  2011-01-01  1:05           ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: Uros Bizjak @ 2010-12-30 12:42 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gcc-patches, Richard Guenther, Jakub Jelinek, Mark Mitchell

On Wed, Dec 29, 2010 at 4:32 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>> I think that due to these reasons, the patch should be committed to
>> SVN even in this development stage. Even if the algorithm is not
>> optimal, the patch demonstrably produces substantially better code.
>> This feature has no impact on generic code without -mvzeroupper /
>> -mavx switch, and since there are currently very few AVX users,
>> negligible overall impact.
>>
>>> gcc/
>>>
>>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * config/i386/i386.c (block_info_def): Remove referenced, count
>>>        and rescanned.
>>>        (move_or_delete_vzeroupper_2): Updated.
>>>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>>>        (rescan_move_or_delete_vzeroupper): Removed.
>>>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>>>        until no basic block state is changed to used at exit.
>>>
>>> gcc/testsuite/
>>>
>>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * gfortran.dg/pr46519-2.f90: New.
>>>
>>
>> The patch is OK, but please allow a day or two for RMs (CC'd) to
>> eventually comment.
>
> We will investigate LCM for 4.7.  In the meantime, here is  a small patch
> on top of the current one. If the upper 128bits are never changed in a basic
> block, we can skip it in the later passes.  OK for trunk together with the
> current patch?
>
> 2010-12-29  H.J. Lu  <hongjiu.lu@intel.com>
>
>        * config/i386/i386.c (upper_128bits_state): Update comments.
>        (block_info_def): Add unchanged.
>        (move_or_delete_vzeroupper_2): Short circuit if upper 128bits
>        are unchanged in the block.
>

OK, but please remove now redundant coments in

@@ -60,14 +60,17 @@ along with GCC; see the file COPYING3.  If not see
 enum upper_128bits_state
 {
   unknown = 0,		/* Unknown.  */
-  unused,		/* Not used or not referenced.  */
-  used			/* Used or referenced.  */
+  unused,		/* Not used.  */
+  used			/* Used.  */
 };

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-12-30 12:42         ` Uros Bizjak
@ 2011-01-01  1:05           ` Mark Mitchell
  2011-01-01  1:38             ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-01  1:05 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: H.J. Lu, gcc-patches, Richard Guenther, Jakub Jelinek

On 12/30/2010 3:19 AM, Uros Bizjak wrote:

>> We will investigate LCM for 4.7.  In the meantime, here is  a small patch
>> on top of the current one. If the upper 128bits are never changed in a basic
>> block, we can skip it in the later passes.  OK for trunk together with the
>> current patch?

For avoidance of doubt, since Uros explicitly asked for RM comments: I
have no objections to the patch, if the x86 maintainers are happy with it.

However, this comment:

>> I'm not sure that the algorithm is correct (and I don't have enough
>> experience in this area), so I'd rather leave the review to someone
>> else. AFAICS, there can be 20 passes, and from comments, it is
>> questionable if this is enough.

concerns me.

Do someone have confidence that this algorithm is correct, in the sense
that we will not generate wrong code?  And are we talking about 20
passes over the complete set of basic blocks?  That sounds pretty expensive.

Thank you,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01  1:05           ` Mark Mitchell
@ 2011-01-01  1:38             ` H.J. Lu
  2011-01-01  1:39               ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-01  1:38 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek

On Fri, Dec 31, 2010 at 5:05 PM, Mark Mitchell <mark@codesourcery.com> wrote:

>>> I'm not sure that the algorithm is correct (and I don't have enough
>>> experience in this area), so I'd rather leave the review to someone
>>> else. AFAICS, there can be 20 passes, and from comments, it is
>>> questionable if this is enough.
>
> concerns me.
>
> Do someone have confidence that this algorithm is correct, in the sense
> that we will not generate wrong code?  And are we talking about 20
> passes over the complete set of basic blocks?  That sounds pretty expensive.
>

I believe algorithm is correct, but probably not optimal. What we want to know
is the precise state of the upper 128bits at exit of basic block.  It rescans a
basic block only if the exit state is unknown and the upper 128bits may be
modified in the basic block.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01  1:38             ` H.J. Lu
@ 2011-01-01  1:39               ` Mark Mitchell
  2011-01-01  2:08                 ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-01  1:39 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek

On 12/31/2010 5:38 PM, H.J. Lu wrote:

> I believe algorithm is correct, but probably not optimal. What we want to know
> is the precise state of the upper 128bits at exit of basic block.  It rescans a
> basic block only if the exit state is unknown and the upper 128bits may be
> modified in the basic block.

What is the limit, then, on the number of iterations required for a
function with N basic blocks?

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01  1:39               ` Mark Mitchell
@ 2011-01-01  2:08                 ` H.J. Lu
  2011-01-01  2:17                   ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-01  2:08 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek

On Fri, Dec 31, 2010 at 5:39 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/31/2010 5:38 PM, H.J. Lu wrote:
>
>> I believe algorithm is correct, but probably not optimal. What we want to know
>> is the precise state of the upper 128bits at exit of basic block.  It rescans a
>> basic block only if the exit state is unknown and the upper 128bits may be
>> modified in the basic block.
>
> What is the limit, then, on the number of iterations required for a
> function with N basic blocks?
>

The limit depends on how complex CFG is. I don't think it will go into an
infinite loop.  However, I limit it to 20 just in case. I have built SPEC CPU
2K/2006 using -mavx -O2/-O3 without problems.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01  2:08                 ` H.J. Lu
@ 2011-01-01  2:17                   ` Mark Mitchell
  2011-01-01 16:01                     ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-01  2:17 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek

On 12/31/2010 6:08 PM, H.J. Lu wrote:

> The limit depends on how complex CFG is. I don't think it will go into an
> infinite loop.

HJ, I think you should think about the algorithm in enough detail that
you can be sure whether or not it's a terminating algorithm.  Even if it
is terminating, a limit is probably a good idea to contain compile-time
cost for pathological cases, but you're not providing me with confidence
that you've got an algorithm that you really understand if you aren't
sure whether or not in terminates.

> However, I limit it to 20 just in case.

20 what?  20 passes over the entire BB tree?  If so, that seems like an
awful lot.  Please provide a statement like "the run time is at most 20
* (N + E) where N is the number of basic blocks and E is maximum number
of outgoing edges from any basic block", i.e., in the standard form
given in computer science texts?

Thank you,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01  2:17                   ` Mark Mitchell
@ 2011-01-01 16:01                     ` H.J. Lu
  2011-01-04  1:15                       ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-01 16:01 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek

On Fri, Dec 31, 2010 at 6:17 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/31/2010 6:08 PM, H.J. Lu wrote:
>
>> The limit depends on how complex CFG is. I don't think it will go into an
>> infinite loop.
>
> HJ, I think you should think about the algorithm in enough detail that
> you can be sure whether or not it's a terminating algorithm.  Even if it
> is terminating, a limit is probably a good idea to contain compile-time
> cost for pathological cases, but you're not providing me with confidence
> that you've got an algorithm that you really understand if you aren't
> sure whether or not in terminates.

We can remove a vzeroupper insn if we can determine the upper 128bits
of AVX registers are unused at the location. We track the upper 128bits
of AVX registers by examining each insn. To determine the state of  the
upper 128bits of AVX registers at basic block entry, we need to know the
exit states of its incoming edges. We start at the function entry with
the known state of the upper 128bits of AVX registers.  We repeat all basic
blocks, in which the upper 128bits of AVX registers are changed, with the
unknown entry state, if the upper 128bits of AVX registers of any basic blocks
are changed to used at exit.  Since the number of basic blocks is fixed and
we don't repeat basic blocks with the known entry state, the algorithm should
terminate.

>> However, I limit it to 20 just in case.
>
> 20 what?  20 passes over the entire BB tree?  If so, that seems like an
> awful lot.  Please provide a statement like "the run time is at most 20
> * (N + E) where N is the number of basic blocks and E is maximum number
> of outgoing edges from any basic block", i.e., in the standard form
> given in computer science texts?
>

The run-tme is at most 20 * N where N is the number of basic blocks
since we only look at the exit state of the incoming edges. We repeat
the scan until all incoming edges are stabilized.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-01 16:01                     ` H.J. Lu
@ 2011-01-04  1:15                       ` Mark Mitchell
  2011-01-04  3:59                         ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-04  1:15 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On 1/1/2011 8:01 AM, H.J. Lu wrote:

> We start at the function entry with
> the known state of the upper 128bits of AVX registers.  We repeat all basic
> blocks, in which the upper 128bits of AVX registers are changed, with the
> unknown entry state, if the upper 128bits of AVX registers of any basic blocks
> are changed to used at exit.  Since the number of basic blocks is fixed and
> we don't repeat basic blocks with the known entry state, the algorithm should
> terminate.

OK, that's a good argument.  Isn't this just a standard forward
data-flow problem?  Are we using the machinery we have for such
problems?  And, in fact, isn't this essentially similar to getting rid
of unnecessary sign extensions, which Tom de Vries has been working on?

>>> However, I limit it to 20 just in case.

> The run-tme is at most 20 * N where N is the number of basic blocks
> since we only look at the exit state of the incoming edges. We repeat
> the scan until all incoming edges are stabilized.

20 sounds high to me, but I'll leave that to the x86 maintainers since
this is an x86-specific pass.  If you solve this as a forward data-flow
problem (visiting the blocks in the appropriate order), it seems
unlikely to me that you'd need more than two or three iterations for the
vast majority of non-pathological cases.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-04  1:15                       ` Mark Mitchell
@ 2011-01-04  3:59                         ` H.J. Lu
  2011-01-04  5:54                           ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-04  3:59 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On Mon, Jan 3, 2011 at 4:47 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/1/2011 8:01 AM, H.J. Lu wrote:
>
>> We start at the function entry with
>> the known state of the upper 128bits of AVX registers.  We repeat all basic
>> blocks, in which the upper 128bits of AVX registers are changed, with the
>> unknown entry state, if the upper 128bits of AVX registers of any basic blocks
>> are changed to used at exit.  Since the number of basic blocks is fixed and
>> we don't repeat basic blocks with the known entry state, the algorithm should
>> terminate.
>
> OK, that's a good argument.  Isn't this just a standard forward
> data-flow problem?  Are we using the machinery we have for such
> problems?  And, in fact, isn't this essentially similar to getting rid
> of unnecessary sign extensions, which Tom de Vries has been working on?

My problem is if the portion of any vector register is live. My understandings
are data-flow is expensive and overkill. And I couldn't find a way to get the
answer from data-flow for my question.

There are 2 different questions:

1. If sign/zero extension on a given GPR is redundant.
2. If upper 128bit of any vector registers is live, zero or non-zero.

There are few overlaps.  I would appreciate any suggestions.

>>>> However, I limit it to 20 just in case.
>
>> The run-tme is at most 20 * N where N is the number of basic blocks
>> since we only look at the exit state of the incoming edges. We repeat
>> the scan until all incoming edges are stabilized.
>
> 20 sounds high to me, but I'll leave that to the x86 maintainers since
> this is an x86-specific pass.  If you solve this as a forward data-flow
> problem (visiting the blocks in the appropriate order), it seems
> unlikely to me that you'd need more than two or three iterations for the
> vast majority of non-pathological cases.
>

I counted 8 iterations in some SPEC CPU 2K/2006 benchmarks.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-04  3:59                         ` H.J. Lu
@ 2011-01-04  5:54                           ` Mark Mitchell
  2011-01-04 22:17                             ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-04  5:54 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On 1/3/2011 7:33 PM, H.J. Lu wrote:

>> OK, that's a good argument.  Isn't this just a standard forward
>> data-flow problem?  Are we using the machinery we have for such
>> problems?  And, in fact, isn't this essentially similar to getting rid
>> of unnecessary sign extensions, which Tom de Vries has been working on?
> 
> My problem is if the portion of any vector register is live. My understandings
> are data-flow is expensive and overkill.

I don't understand this statement.  Data flow problems are ones where
the output of a basic block is dependent on its inputs along incoming
edges and on the behavior of the block itself.  That sounds like exactly
what you have here.  There are standard work-list algorithms for walking
through the basic blocks in the right order and iterating only where
necessary.  Why is that more expensive than walking over all of the blocks?

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-04  5:54                           ` Mark Mitchell
@ 2011-01-04 22:17                             ` H.J. Lu
  2011-01-04 23:53                               ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-04 22:17 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On Mon, Jan 3, 2011 at 7:41 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/3/2011 7:33 PM, H.J. Lu wrote:
>
>>> OK, that's a good argument.  Isn't this just a standard forward
>>> data-flow problem?  Are we using the machinery we have for such
>>> problems?  And, in fact, isn't this essentially similar to getting rid
>>> of unnecessary sign extensions, which Tom de Vries has been working on?
>>
>> My problem is if the portion of any vector register is live. My understandings
>> are data-flow is expensive and overkill.
>
> I don't understand this statement.  Data flow problems are ones where
> the output of a basic block is dependent on its inputs along incoming
> edges and on the behavior of the block itself.  That sounds like exactly
> what you have here.  There are standard work-list algorithms for walking
> through the basic blocks in the right order and iterating only where
> necessary.  Why is that more expensive than walking over all of the blocks?
>

I tried to use DF.  DF can tell me if a register is live or dead at the basic
block entry.  But what I want to know is if the upper 128bit of a vector
register is zero, dead or live, at the basic block entry.  Can DF tell me
that?

Thanks.



-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-04 22:17                             ` H.J. Lu
@ 2011-01-04 23:53                               ` Mark Mitchell
  2011-01-05  0:06                                 ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-04 23:53 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On 1/4/2011 2:09 PM, H.J. Lu wrote:

> I tried to use DF.  DF can tell me if a register is live or dead at the basic
> block entry.  But what I want to know is if the upper 128bit of a vector
> register is zero, dead or live, at the basic block entry.  Can DF tell me
> that?

DF, in the sense of math, can certainly tell you that.  I'm not sure
about DF, in the sense of code presently in GCC.  But, if not, it could
be made to do so; what you have meets the requirements for dataflow
abstraction.  If our infrastructure isn't powerful enough, we should
probably make it more powerful.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-04 23:53                               ` Mark Mitchell
@ 2011-01-05  0:06                                 ` H.J. Lu
  2011-01-05  0:08                                   ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-05  0:06 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On Tue, Jan 4, 2011 at 3:35 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 2:09 PM, H.J. Lu wrote:
>
>> I tried to use DF.  DF can tell me if a register is live or dead at the basic
>> block entry.  But what I want to know is if the upper 128bit of a vector
>> register is zero, dead or live, at the basic block entry.  Can DF tell me
>> that?
>
> DF, in the sense of math, can certainly tell you that.  I'm not sure
> about DF, in the sense of code presently in GCC.  But, if not, it could
> be made to do so; what you have meets the requirements for dataflow
> abstraction.  If our infrastructure isn't powerful enough, we should
> probably make it more powerful.
>

Missing info in the current DF infrastructure:

1. Used/unused registers at basic block boundaries.
2. The accessing modes of dead/live/used/unused registers at basic
block boundaries.
3. Zero/sign extension info of live registers at the basic block entry.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05  0:06                                 ` H.J. Lu
@ 2011-01-05  0:08                                   ` Mark Mitchell
  2011-01-05  0:09                                     ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-05  0:08 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On 1/4/2011 3:50 PM, H.J. Lu wrote:

> 1. Used/unused registers at basic block boundaries.
> 2. The accessing modes of dead/live/used/unused registers at basic
> block boundaries.
> 3. Zero/sign extension info of live registers at the basic block entry.

HJ, I'm not sure what point you're trying to make.  My point is that
using standard data-flow techniques to solve this problem seems correct.
 You seem to be saying that our current infrastructure doesn't have
everything you need.  Presuming that you agree that using standard
data-flow techniques is appropriate, that leaves you with two viable
options: (a) enhance the infrastructure, (b) use the algorithms, but not
the infrastructure.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05  0:08                                   ` Mark Mitchell
@ 2011-01-05  0:09                                     ` H.J. Lu
  2011-01-05  0:24                                       ` Mark Mitchell
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-05  0:09 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On Tue, Jan 4, 2011 at 3:53 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 3:50 PM, H.J. Lu wrote:
>
>> 1. Used/unused registers at basic block boundaries.
>> 2. The accessing modes of dead/live/used/unused registers at basic
>> block boundaries.
>> 3. Zero/sign extension info of live registers at the basic block entry.
>
> HJ, I'm not sure what point you're trying to make.  My point is that
> using standard data-flow techniques to solve this problem seems correct.
>  You seem to be saying that our current infrastructure doesn't have
> everything you need.  Presuming that you agree that using standard

That is correct.

> data-flow techniques is appropriate, that leaves you with two viable
> options: (a) enhance the infrastructure, (b) use the algorithms, but not
> the infrastructure.
>

Enhance the DF infrastructure is beyond my resources.  I
will take a look at the DF algorithm.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05  0:09                                     ` H.J. Lu
@ 2011-01-05  0:24                                       ` Mark Mitchell
  2011-01-05 16:44                                         ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Mitchell @ 2011-01-05  0:24 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On 1/4/2011 4:06 PM, H.J. Lu wrote:

> Enhance the DF infrastructure is beyond my resources.  I
> will take a look at the DF algorithm.

Wikipedia (or any good compiler book) should have a good description of
the appropriate work-list based algorithms.  The basic idea is that you
walk the BB tree in the right order (starting at the entry blocks),
adding successor blocks to the worklist whenever you change a block.

Thank you,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05  0:24                                       ` Mark Mitchell
@ 2011-01-05 16:44                                         ` H.J. Lu
  2011-01-05 17:12                                           ` Jakub Jelinek
  2011-01-13 18:04                                           ` Richard Henderson
  0 siblings, 2 replies; 46+ messages in thread
From: H.J. Lu @ 2011-01-05 16:44 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Uros Bizjak, gcc-patches, Richard Guenther, Jakub Jelinek, deVries, Tom

On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 4:06 PM, H.J. Lu wrote:
>
>> Enhance the DF infrastructure is beyond my resources.  I
>> will take a look at the DF algorithm.
>
> Wikipedia (or any good compiler book) should have a good description of
> the appropriate work-list based algorithms.  The basic idea is that you
> walk the BB tree in the right order (starting at the entry blocks),
> adding successor blocks to the worklist whenever you change a block.
>

Are there any existing GCC passes which implement their own data-flow
analysis?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05 16:44                                         ` H.J. Lu
@ 2011-01-05 17:12                                           ` Jakub Jelinek
  2011-01-05 23:01                                             ` H.J. Lu
  2011-01-13 17:19                                             ` H.J. Lu
  2011-01-13 18:04                                           ` Richard Henderson
  1 sibling, 2 replies; 46+ messages in thread
From: Jakub Jelinek @ 2011-01-05 17:12 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Mark Mitchell, Uros Bizjak, gcc-patches, Richard Guenther, deVries, Tom

On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
> >
> >> Enhance the DF infrastructure is beyond my resources.  I
> >> will take a look at the DF algorithm.
> >
> > Wikipedia (or any good compiler book) should have a good description of
> > the appropriate work-list based algorithms.  The basic idea is that you
> > walk the BB tree in the right order (starting at the entry blocks),
> > adding successor blocks to the worklist whenever you change a block.
> >
> 
> Are there any existing GCC passes which implement their own data-flow
> analysis?

E.g. var-tracking.c (vt_find_locations).

	Jakub

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05 17:12                                           ` Jakub Jelinek
@ 2011-01-05 23:01                                             ` H.J. Lu
  2011-01-13 17:19                                             ` H.J. Lu
  1 sibling, 0 replies; 46+ messages in thread
From: H.J. Lu @ 2011-01-05 23:01 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Mark Mitchell, Uros Bizjak, gcc-patches, Richard Guenther, deVries, Tom

On Wed, Jan 5, 2011 at 8:46 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
>> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
>> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
>> >
>> >> Enhance the DF infrastructure is beyond my resources.  I
>> >> will take a look at the DF algorithm.
>> >
>> > Wikipedia (or any good compiler book) should have a good description of
>> > the appropriate work-list based algorithms.  The basic idea is that you
>> > walk the BB tree in the right order (starting at the entry blocks),
>> > adding successor blocks to the worklist whenever you change a block.
>> >
>>
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> E.g. var-tracking.c (vt_find_locations).
>

Thanks.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05 17:12                                           ` Jakub Jelinek
  2011-01-05 23:01                                             ` H.J. Lu
@ 2011-01-13 17:19                                             ` H.J. Lu
  2011-01-13 17:25                                               ` Mark Mitchell
  2011-01-13 18:16                                               ` Richard Henderson
  1 sibling, 2 replies; 46+ messages in thread
From: H.J. Lu @ 2011-01-13 17:19 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Mark Mitchell, Uros Bizjak, gcc-patches, Richard Guenther, deVries, Tom

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

On Wed, Jan 5, 2011 at 8:46 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
>> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
>> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
>> >
>> >> Enhance the DF infrastructure is beyond my resources.  I
>> >> will take a look at the DF algorithm.
>> >
>> > Wikipedia (or any good compiler book) should have a good description of
>> > the appropriate work-list based algorithms.  The basic idea is that you
>> > walk the BB tree in the right order (starting at the entry blocks),
>> > adding successor blocks to the worklist whenever you change a block.
>> >
>>
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> E.g. var-tracking.c (vt_find_locations).
>

Thanks for Mark's suggestion and Jakub's pointer.  This patch implements
the work-list based algorithm.  I built SPEC CPU 2K/2006 with it.  Only 2
benchmarks are different with and without the patch.  It removes 4 extra
vzeroupper insns from 416.gamess in SPEC CPU 2006 and 1extra vzeroupper
from 177.mesa in SPEC CPU 2K.  The compile time difference is about 0.3%.

There are no AVX-SSE transition penalties in SPEC CPU 2K 32bit/64bit
and SPEC CPU 2006 64bit.  I am running  SPEC CPU 2006 32bit.  I am not
expecting any problems.  OK for trunk?

Thanks.


-- 
H.J.
---
2011-01-12  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c: Include sbitmap.h and fibheap.h.
	(move_or_delete_vzeroupper): Visit basic blocks using the
	work-list based algorithm based on vt_find_locations in
	var-tracking.c.

	* config/i386/t-i386: Also depend on sbitmap.h and $(FIBHEAP_H).

[-- Attachment #2: gcc-vzero-df-1.patch --]
[-- Type: text/plain, Size: 5301 bytes --]

2011-01-12  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c: Include sbitmap.h and fibheap.h.
	(move_or_delete_vzeroupper): Visit basic blocks using the
	work-list based algorithm based on vt_find_locations in
	var-tracking.c.

	* config/i386/t-i386: Also depend on sbitmap.h and $(FIBHEAP_H).

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a26314b..5afc1ae 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "debug.h"
 #include "dwarf2out.h"
 #include "sched-int.h"
+#include "sbitmap.h"
+#include "fibheap.h"
 
 enum upper_128bits_state
 {
@@ -338,14 +340,18 @@ move_or_delete_vzeroupper (void)
   edge e;
   edge_iterator ei;
   basic_block bb;
-  unsigned int count;
+  fibheap_t worklist, pending, fibheap_swap;
+  sbitmap visited, in_worklist, in_pending, sbitmap_swap;
+  int *bb_order;
+  int *rc_order;
+  int i;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
 
-  /* Process successor blocks of all entry points.  */
+  /* Process outgoing edges of entry point.  */
   if (dump_file)
-    fprintf (dump_file, "Process all entry points\n");
+    fprintf (dump_file, "Process outgoing edges of entry point\n");
 
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
     {
@@ -355,25 +361,100 @@ move_or_delete_vzeroupper (void)
       BLOCK_INFO (e->dest)->processed = true;
     }
 
-  /* Process all basic blocks.  */
-  count = 0;
-  do
+  /* Compute reverse completion order of depth first search of the CFG
+     so that the data-flow runs faster.  */
+  rc_order = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS);
+  bb_order = XNEWVEC (int, last_basic_block);
+  pre_and_rev_post_order_compute (NULL, rc_order, false);
+  for (i = 0; i < n_basic_blocks - NUM_FIXED_BLOCKS; i++)
+    bb_order[rc_order[i]] = i;
+  free (rc_order);
+
+  worklist = fibheap_new ();
+  pending = fibheap_new ();
+  visited = sbitmap_alloc (last_basic_block);
+  in_worklist = sbitmap_alloc (last_basic_block);
+  in_pending = sbitmap_alloc (last_basic_block);
+  sbitmap_zero (in_worklist);
+
+  /* Don't check outgoing edges of entry point.  */
+  sbitmap_ones (in_pending);
+  FOR_EACH_BB (bb)
+    if (BLOCK_INFO (bb)->processed)
+      RESET_BIT (in_pending, bb->index);
+    else
+      fibheap_insert (pending, bb_order[bb->index], bb);
+
+  if (dump_file)
+    fprintf (dump_file, "Check remaining basic blocks\n");
+
+  while (!fibheap_empty (pending))
     {
-      if (dump_file)
-	fprintf (dump_file, "Process all basic blocks: trip %d\n",
-		 count);
+      fibheap_swap = pending;
+      pending = worklist;
+      worklist = fibheap_swap;
+      sbitmap_swap = in_pending;
+      in_pending = in_worklist;
+      in_worklist = sbitmap_swap;
+
+      sbitmap_zero (visited);
+
       cfun->machine->rescan_vzeroupper_p = 0;
-      FOR_EACH_BB (bb)
-	move_or_delete_vzeroupper_1 (bb, false);
+
+      while (!fibheap_empty (worklist))
+	{
+	  bb = (basic_block) fibheap_extract_min (worklist);
+	  RESET_BIT (in_worklist, bb->index);
+	  gcc_assert (!TEST_BIT (visited, bb->index));
+	  if (!TEST_BIT (visited, bb->index))
+	    {
+	      edge_iterator ei;
+
+	      SET_BIT (visited, bb->index);
+
+	      move_or_delete_vzeroupper_1 (bb, false);
+
+	      FOR_EACH_EDGE (e, ei, bb->succs)
+		{
+		  if (e->dest == EXIT_BLOCK_PTR
+		      || BLOCK_INFO (e->dest)->processed)
+		    continue;
+
+		  if (TEST_BIT (visited, e->dest->index))
+		    {
+		      if (!TEST_BIT (in_pending, e->dest->index))
+			{
+			  /* Send E->DEST to next round.  */
+			  SET_BIT (in_pending, e->dest->index);
+			  fibheap_insert (pending,
+					  bb_order[e->dest->index],
+					  e->dest);
+			}
+		    }
+		  else if (!TEST_BIT (in_worklist, e->dest->index))
+		    {
+		      /* Add E->DEST to current round.  */
+		      SET_BIT (in_worklist, e->dest->index);
+		      fibheap_insert (worklist, bb_order[e->dest->index],
+				      e->dest);
+		    }
+		}
+	    }
+	}
+
+      if (!cfun->machine->rescan_vzeroupper_p)
+	break;
     }
-  while (cfun->machine->rescan_vzeroupper_p && count++ < 20);
 
-  /* FIXME: Is 20 big enough?  */
-  if (count >= 20)
-    gcc_unreachable ();
+  free (bb_order);
+  fibheap_delete (worklist);
+  fibheap_delete (pending);
+  sbitmap_free (visited);
+  sbitmap_free (in_worklist);
+  sbitmap_free (in_pending);
 
   if (dump_file)
-    fprintf (dump_file, "Process all basic blocks\n");
+    fprintf (dump_file, "Process remaining basic blocks\n");
 
   FOR_EACH_BB (bb)
     move_or_delete_vzeroupper_1 (bb, true);
diff --git a/gcc/config/i386/t-i386 b/gcc/config/i386/t-i386
index 6c801a5..1c658a1 100644
--- a/gcc/config/i386/t-i386
+++ b/gcc/config/i386/t-i386
@@ -23,7 +23,7 @@ i386.o: $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
   $(RECOG_H) $(EXPR_H) $(OPTABS_H) toplev.h $(BASIC_BLOCK_H) \
   $(GGC_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h $(CGRAPH_H) \
   $(TREE_GIMPLE_H) $(DWARF2_H) $(DF_H) tm-constrs.h $(PARAMS_H) \
-  i386-builtin-types.inc debug.h dwarf2out.h
+  i386-builtin-types.inc debug.h dwarf2out.h sbitmap.h $(FIBHEAP_H)
 
 i386-c.o: $(srcdir)/config/i386/i386-c.c \
   $(srcdir)/config/i386/i386-protos.h $(CONFIG_H) $(SYSTEM_H) coretypes.h \

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-13 17:19                                             ` H.J. Lu
@ 2011-01-13 17:25                                               ` Mark Mitchell
  2011-01-13 18:16                                               ` Richard Henderson
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Mitchell @ 2011-01-13 17:25 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Uros Bizjak, gcc-patches, Richard Guenther, deVries, Tom

On 1/13/2011 9:13 AM, H.J. Lu wrote:

> Thanks for Mark's suggestion and Jakub's pointer.  This patch implements
> the work-list based algorithm.  I built SPEC CPU 2K/2006 with it.  Only 2
> benchmarks are different with and without the patch.  It removes 4 extra
> vzeroupper insns from 416.gamess in SPEC CPU 2006 and 1extra vzeroupper
> from 177.mesa in SPEC CPU 2K.  The compile time difference is about 0.3%.

HJ, thank you for implementing the worklist algorithm.  I have no
further comments on this patch; I would defer to the x86 maintainers.

Thank you,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-05 16:44                                         ` H.J. Lu
  2011-01-05 17:12                                           ` Jakub Jelinek
@ 2011-01-13 18:04                                           ` Richard Henderson
  2011-01-13 18:09                                             ` H.J. Lu
  1 sibling, 1 reply; 46+ messages in thread
From: Richard Henderson @ 2011-01-13 18:04 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Mark Mitchell, Uros Bizjak, gcc-patches, Richard Guenther,
	Jakub Jelinek, deVries, Tom

On 01/05/2011 08:39 AM, H.J. Lu wrote:
> Are there any existing GCC passes which implement their own data-flow
> analysis?

See also walk_dominator_tree and in general domwalk.c.

I don't believe you can iterate with walk_dominator_tree, but
examining its implementation will show you (1) how to walk the
blocks in an optimal order and (2) how to use a worklist to
queue blocks for (re-)processing.

Alternately, one pass via walk_dominator_tree might give results
that are good enough such that you may drop iteration entirely.


r~

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-13 18:04                                           ` Richard Henderson
@ 2011-01-13 18:09                                             ` H.J. Lu
  0 siblings, 0 replies; 46+ messages in thread
From: H.J. Lu @ 2011-01-13 18:09 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Mark Mitchell, Uros Bizjak, gcc-patches, Richard Guenther,
	Jakub Jelinek, deVries, Tom

On Thu, Jan 13, 2011 at 9:49 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/05/2011 08:39 AM, H.J. Lu wrote:
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> See also walk_dominator_tree and in general domwalk.c.
>
> I don't believe you can iterate with walk_dominator_tree, but
> examining its implementation will show you (1) how to walk the
> blocks in an optimal order and (2) how to use a worklist to
> queue blocks for (re-)processing.
>
> Alternately, one pass via walk_dominator_tree might give results
> that are good enough such that you may drop iteration entirely.
>

Hi Richard,

I implemented the similar work-list based algorithm based on
vt_find_locations in var-tracking.c:

http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00846.html

Does it it look OK?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-13 17:19                                             ` H.J. Lu
  2011-01-13 17:25                                               ` Mark Mitchell
@ 2011-01-13 18:16                                               ` Richard Henderson
  2011-01-13 18:51                                                 ` H.J. Lu
  1 sibling, 1 reply; 46+ messages in thread
From: Richard Henderson @ 2011-01-13 18:16 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On 01/13/2011 09:13 AM, H.J. Lu wrote:
> +      sbitmap_zero (visited);
> +
>        cfun->machine->rescan_vzeroupper_p = 0;
> +
> +      while (!fibheap_empty (worklist))
> +	{
> +	  bb = (basic_block) fibheap_extract_min (worklist);
> +	  RESET_BIT (in_worklist, bb->index);
> +	  gcc_assert (!TEST_BIT (visited, bb->index));
> +	  if (!TEST_BIT (visited, bb->index))
> +	    {
> +	      edge_iterator ei;
> +
> +	      SET_BIT (visited, bb->index);
> +
> +	      move_or_delete_vzeroupper_1 (bb, false);
> +
> +	      FOR_EACH_EDGE (e, ei, bb->succs)

Hum.  Your use of the worklist appears to be totally superficial.

This is a very complicated way to write

  for (i = 0; i < n; ++i)
    move_or_delete_vzeroupper_1 (BASIC_BLOCK (bb_order[i]), false);

Which is still an improvement, mind, since you're now scanning
the blocks in a more intelligent order.

Proper use of a worklist would do something like

  changed = move_or_delete_vzeroupper_1 (bb, false);
  if (changed)
    FOR_EACH_EDGE (...)
      ...

I.e. not queue successors if nothing has changed in the current block.


r~

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-13 18:16                                               ` Richard Henderson
@ 2011-01-13 18:51                                                 ` H.J. Lu
  2011-01-14 16:06                                                   ` Richard Henderson
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-13 18:51 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On Thu, Jan 13, 2011 at 10:03 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/13/2011 09:13 AM, H.J. Lu wrote:
>> +      sbitmap_zero (visited);
>> +
>>        cfun->machine->rescan_vzeroupper_p = 0;
>> +
>> +      while (!fibheap_empty (worklist))
>> +     {
>> +       bb = (basic_block) fibheap_extract_min (worklist);
>> +       RESET_BIT (in_worklist, bb->index);
>> +       gcc_assert (!TEST_BIT (visited, bb->index));
>> +       if (!TEST_BIT (visited, bb->index))
>> +         {
>> +           edge_iterator ei;
>> +
>> +           SET_BIT (visited, bb->index);
>> +
>> +           move_or_delete_vzeroupper_1 (bb, false);
>> +
>> +           FOR_EACH_EDGE (e, ei, bb->succs)
>
> Hum.  Your use of the worklist appears to be totally superficial.
>
> This is a very complicated way to write
>
>  for (i = 0; i < n; ++i)
>    move_or_delete_vzeroupper_1 (BASIC_BLOCK (bb_order[i]), false);
>
> Which is still an improvement, mind, since you're now scanning
> the blocks in a more intelligent order.
>
> Proper use of a worklist would do something like
>
>  changed = move_or_delete_vzeroupper_1 (bb, false);
>  if (changed)
>    FOR_EACH_EDGE (...)
>      ...
>
> I.e. not queue successors if nothing has changed in the current block.
>

We have to scan its successors even if the exit state of the current block
is unchanged since one predecessor exit state is just one factor which
impacts the exit state and vzeroupper optimization. We must repeatedly
propagate the exit states and scan basic blocks until the exit state of
all basic blocks stabilized.

We can skip a basic block only if it has been processed or no insns
inside the block will touch the upper 128bits.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-13 18:51                                                 ` H.J. Lu
@ 2011-01-14 16:06                                                   ` Richard Henderson
  2011-01-14 16:08                                                     ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Richard Henderson @ 2011-01-14 16:06 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On 01/13/2011 10:20 AM, H.J. Lu wrote:
> We have to scan its successors even if the exit state of the current block
> is unchanged since one predecessor exit state is just one factor which
> impacts the exit state and vzeroupper optimization.

Then you're not interpreting the "state" of a block broadly enough.

You certainly can consider the exit state of a block to be a function
of its input state and whatever it does locally.  Indeed, this is
exactly the formulation of the problem that you will need in order to
re-use the LCM infrastructure for 4.7.  But it doesn't hurt to think
about the problem in those terms now.


r~

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-14 16:06                                                   ` Richard Henderson
@ 2011-01-14 16:08                                                     ` H.J. Lu
  2011-01-16  8:04                                                       ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-14 16:08 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On Fri, Jan 14, 2011 at 7:36 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/13/2011 10:20 AM, H.J. Lu wrote:
>> We have to scan its successors even if the exit state of the current block
>> is unchanged since one predecessor exit state is just one factor which
>> impacts the exit state and vzeroupper optimization.
>
> Then you're not interpreting the "state" of a block broadly enough.
>
> You certainly can consider the exit state of a block to be a function
> of its input state and whatever it does locally.  Indeed, this is
> exactly the formulation of the problem that you will need in order to
> re-use the LCM infrastructure for 4.7.  But it doesn't hurt to think
> about the problem in those terms now.
>

vzeroupper works this way:

1. We add vzeroupper where there may be AVX->SSE transition
at function call and return RTL expansion.  It won't affect
correctness of the program.
2. We also add a special vzeroupper to function call which returns
or passes AVX registers. This will affect correctness of the program
if it isn't removed.

In vzeroupper optimization pass, we use the special vzeroupper to
propagate the AVX register state and remove it afterward.  We
also remove any redundant vzeroupper, which depends on the
the AVX register state at basic block entry.

Because #2, we have to scan a basic block at least once.
I can add more checks to avoid unnecessary rescans.


-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-14 16:08                                                     ` H.J. Lu
@ 2011-01-16  8:04                                                       ` H.J. Lu
  2011-01-24 18:00                                                         ` Richard Henderson
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2011-01-16  8:04 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

[-- Attachment #1: Type: text/plain, Size: 2370 bytes --]

On Fri, Jan 14, 2011 at 8:02 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jan 14, 2011 at 7:36 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 01/13/2011 10:20 AM, H.J. Lu wrote:
>>> We have to scan its successors even if the exit state of the current block
>>> is unchanged since one predecessor exit state is just one factor which
>>> impacts the exit state and vzeroupper optimization.
>>
>> Then you're not interpreting the "state" of a block broadly enough.
>>
>> You certainly can consider the exit state of a block to be a function
>> of its input state and whatever it does locally.  Indeed, this is
>> exactly the formulation of the problem that you will need in order to
>> re-use the LCM infrastructure for 4.7.  But it doesn't hurt to think
>> about the problem in those terms now.
>>
>
> vzeroupper works this way:
>
> 1. We add vzeroupper where there may be AVX->SSE transition
> at function call and return RTL expansion.  It won't affect
> correctness of the program.
> 2. We also add a special vzeroupper to function call which returns
> or passes AVX registers. This will affect correctness of the program
> if it isn't removed.
>
> In vzeroupper optimization pass, we use the special vzeroupper to
> propagate the AVX register state and remove it afterward.  We
> also remove any redundant vzeroupper, which depends on the
> the AVX register state at basic block entry.
>
> Because #2, we have to scan a basic block at least once.
> I can add more checks to avoid unnecessary rescans.
>

Here is the new patch. It rechecks outgoing edges only if the exit
tate is changed. It generates the identical SPEC CPU 2K/2006
executables as trunk.  Compiler difference is less than 0.3%.
OK for trunk?

Thanks.

-- 
H.J.
---
2011-01-15  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c: Include sbitmap.h and fibheap.h.
	(block_info): Add scanned and prev.
	(move_or_delete_vzeroupper_2): Return if the basic block
	has been scanned and the upper 128bit state is unchanged
	from the last scan.
	(move_or_delete_vzeroupper_1): Return true if the exit
	state is changed.
	(move_or_delete_vzeroupper): Visit basic blocks using the
	work-list based algorithm based on vt_find_locations in
	var-tracking.c.

	* config/i386/t-i386: Also depend on sbitmap.h and $(FIBHEAP_H).

[-- Attachment #2: gcc-vzero-df-2.patch --]
[-- Type: text/plain, Size: 11401 bytes --]

2011-01-15  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c: Include sbitmap.h and fibheap.h.
	(block_info): Add scanned and prev.
	(move_or_delete_vzeroupper_2): Return if the basic block
	has been scanned and the upper 128bit state is unchanged
	from the last scan.
	(move_or_delete_vzeroupper_1): Return true if the exit
	state is changed.
	(move_or_delete_vzeroupper): Visit basic blocks using the
	work-list based algorithm based on vt_find_locations in
	var-tracking.c.

	* config/i386/t-i386: Also depend on sbitmap.h and $(FIBHEAP_H).

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a26314b..450cddf 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "debug.h"
 #include "dwarf2out.h"
 #include "sched-int.h"
+#include "sbitmap.h"
+#include "fibheap.h"
 
 enum upper_128bits_state
 {
@@ -73,6 +75,10 @@ typedef struct block_info_def
   bool unchanged;
   /* TRUE if block has been processed.  */
   bool processed;
+  /* TRUE if block has been scanned.  */
+  bool scanned;
+  /* Previous state of the upper 128bits of AVX registers at entry.  */
+  enum upper_128bits_state prev;
 } *block_info;
 
 #define BLOCK_INFO(B)   ((block_info) (B)->aux)
@@ -135,6 +141,16 @@ move_or_delete_vzeroupper_2 (basic_block bb,
       return;
     }
 
+  if (BLOCK_INFO (bb)->scanned && BLOCK_INFO (bb)->prev == state)
+    {
+      if (dump_file)
+	fprintf (dump_file, " [bb %i] scanned: upper 128bits: %d\n",
+		 bb->index, BLOCK_INFO (bb)->state);
+      return;
+    }
+
+  BLOCK_INFO (bb)->prev = state;
+
   if (dump_file)
     fprintf (dump_file, " [bb %i] entry: upper 128bits: %d\n",
 	     bb->index, state);
@@ -264,6 +280,7 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 
   BLOCK_INFO (bb)->state = state;
   BLOCK_INFO (bb)->unchanged = unchanged;
+  BLOCK_INFO (bb)->scanned = true;
 
   if (dump_file)
     fprintf (dump_file, " [bb %i] exit: %s: upper 128bits: %d\n",
@@ -273,9 +290,10 @@ move_or_delete_vzeroupper_2 (basic_block bb,
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper
    in BLOCK and check its predecessor blocks.  Treat UNKNOWN state
-   as USED if UNKNOWN_IS_UNUSED is true.  */
+   as USED if UNKNOWN_IS_UNUSED is true.  Return TRUE if the exit
+   state is changed.  */
 
-static void
+static bool
 move_or_delete_vzeroupper_1 (basic_block block, bool unknown_is_unused)
 {
   edge e;
@@ -288,7 +306,7 @@ move_or_delete_vzeroupper_1 (basic_block block, bool unknown_is_unused)
 	     block->index, BLOCK_INFO (block)->processed);
 
   if (BLOCK_INFO (block)->processed)
-    return;
+    return false;
 
   state = unused;
 
@@ -324,8 +342,14 @@ done:
 
   /* Need to rescan if the upper 128bits of AVX registers are changed
      to USED at exit.  */
-  if (new_state != old_state && new_state == used)
-    cfun->machine->rescan_vzeroupper_p = 1;
+  if (new_state != old_state)
+    {
+      if (new_state == used)
+	cfun->machine->rescan_vzeroupper_p = 1;
+      return true;
+    }
+  else
+    return false;
 }
 
 /* Go through the instruction stream looking for vzeroupper.  Delete
@@ -338,14 +362,18 @@ move_or_delete_vzeroupper (void)
   edge e;
   edge_iterator ei;
   basic_block bb;
-  unsigned int count;
+  fibheap_t worklist, pending, fibheap_swap;
+  sbitmap visited, in_worklist, in_pending, sbitmap_swap;
+  int *bb_order;
+  int *rc_order;
+  int i;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
 
-  /* Process successor blocks of all entry points.  */
+  /* Process outgoing edges of entry point.  */
   if (dump_file)
-    fprintf (dump_file, "Process all entry points\n");
+    fprintf (dump_file, "Process outgoing edges of entry point\n");
 
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
     {
@@ -355,25 +383,102 @@ move_or_delete_vzeroupper (void)
       BLOCK_INFO (e->dest)->processed = true;
     }
 
-  /* Process all basic blocks.  */
-  count = 0;
-  do
+  /* Compute reverse completion order of depth first search of the CFG
+     so that the data-flow runs faster.  */
+  rc_order = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS);
+  bb_order = XNEWVEC (int, last_basic_block);
+  pre_and_rev_post_order_compute (NULL, rc_order, false);
+  for (i = 0; i < n_basic_blocks - NUM_FIXED_BLOCKS; i++)
+    bb_order[rc_order[i]] = i;
+  free (rc_order);
+
+  worklist = fibheap_new ();
+  pending = fibheap_new ();
+  visited = sbitmap_alloc (last_basic_block);
+  in_worklist = sbitmap_alloc (last_basic_block);
+  in_pending = sbitmap_alloc (last_basic_block);
+  sbitmap_zero (in_worklist);
+
+  /* Don't check outgoing edges of entry point.  */
+  sbitmap_ones (in_pending);
+  FOR_EACH_BB (bb)
+    if (BLOCK_INFO (bb)->processed)
+      RESET_BIT (in_pending, bb->index);
+    else
+      {
+	move_or_delete_vzeroupper_1 (bb, false);
+	fibheap_insert (pending, bb_order[bb->index], bb);
+      }
+
+  if (dump_file)
+    fprintf (dump_file, "Check remaining basic blocks\n");
+
+  while (!fibheap_empty (pending))
     {
-      if (dump_file)
-	fprintf (dump_file, "Process all basic blocks: trip %d\n",
-		 count);
+      fibheap_swap = pending;
+      pending = worklist;
+      worklist = fibheap_swap;
+      sbitmap_swap = in_pending;
+      in_pending = in_worklist;
+      in_worklist = sbitmap_swap;
+
+      sbitmap_zero (visited);
+
       cfun->machine->rescan_vzeroupper_p = 0;
-      FOR_EACH_BB (bb)
-	move_or_delete_vzeroupper_1 (bb, false);
+
+      while (!fibheap_empty (worklist))
+	{
+	  bb = (basic_block) fibheap_extract_min (worklist);
+	  RESET_BIT (in_worklist, bb->index);
+	  gcc_assert (!TEST_BIT (visited, bb->index));
+	  if (!TEST_BIT (visited, bb->index))
+	    {
+	      edge_iterator ei;
+
+	      SET_BIT (visited, bb->index);
+
+	      if (move_or_delete_vzeroupper_1 (bb, false))
+		FOR_EACH_EDGE (e, ei, bb->succs)
+		  {
+		    if (e->dest == EXIT_BLOCK_PTR
+			|| BLOCK_INFO (e->dest)->processed)
+		      continue;
+
+		    if (TEST_BIT (visited, e->dest->index))
+		      {
+			if (!TEST_BIT (in_pending, e->dest->index))
+			  {
+			    /* Send E->DEST to next round.  */
+			    SET_BIT (in_pending, e->dest->index);
+			    fibheap_insert (pending,
+					    bb_order[e->dest->index],
+					    e->dest);
+			  }
+		      }
+		    else if (!TEST_BIT (in_worklist, e->dest->index))
+		      {
+			/* Add E->DEST to current round.  */
+			SET_BIT (in_worklist, e->dest->index);
+			fibheap_insert (worklist, bb_order[e->dest->index],
+					e->dest);
+		      }
+		  }
+	    }
+	}
+
+      if (!cfun->machine->rescan_vzeroupper_p)
+	break;
     }
-  while (cfun->machine->rescan_vzeroupper_p && count++ < 20);
 
-  /* FIXME: Is 20 big enough?  */
-  if (count >= 20)
-    gcc_unreachable ();
+  free (bb_order);
+  fibheap_delete (worklist);
+  fibheap_delete (pending);
+  sbitmap_free (visited);
+  sbitmap_free (in_worklist);
+  sbitmap_free (in_pending);
 
   if (dump_file)
-    fprintf (dump_file, "Process all basic blocks\n");
+    fprintf (dump_file, "Process remaining basic blocks\n");
 
   FOR_EACH_BB (bb)
     move_or_delete_vzeroupper_1 (bb, true);
diff --git a/gcc/config/i386/t-i386 b/gcc/config/i386/t-i386
index 6c801a5..1c658a1 100644
--- a/gcc/config/i386/t-i386
+++ b/gcc/config/i386/t-i386
@@ -23,7 +23,7 @@ i386.o: $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
   $(RECOG_H) $(EXPR_H) $(OPTABS_H) toplev.h $(BASIC_BLOCK_H) \
   $(GGC_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h $(CGRAPH_H) \
   $(TREE_GIMPLE_H) $(DWARF2_H) $(DF_H) tm-constrs.h $(PARAMS_H) \
-  i386-builtin-types.inc debug.h dwarf2out.h
+  i386-builtin-types.inc debug.h dwarf2out.h sbitmap.h $(FIBHEAP_H)
 
 i386-c.o: $(srcdir)/config/i386/i386-c.c \
   $(srcdir)/config/i386/i386-protos.h $(CONFIG_H) $(SYSTEM_H) coretypes.h \
diff --git a/gcc/testsuite/ChangeLog.vzero b/gcc/testsuite/ChangeLog.vzero
new file mode 100644
index 0000000..fedac70
--- /dev/null
+++ b/gcc/testsuite/ChangeLog.vzero
@@ -0,0 +1,115 @@
+2010-12-17  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gfortran.dg/pr46519-2.f90: New.
+
+2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gfortran.dg/pr46519-1.f: Replace -mtune=generic with
+	-mvzeroupper.
+
+2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gfortran.dg/pr46519-1.f: New.
+
+2010-11-20  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-14.c: Replace -O0 with -O2.
+	* gcc.target/i386/avx-vzeroupper-15.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-16.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-17.c: Likewise.
+
+	* gcc.target/i386/avx-vzeroupper-25.c: New.
+	* gcc.target/i386/avx-vzeroupper-26.c: Likewise.
+
+2010-11-18  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-24.c: New.
+
+2010-11-18  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-21.c: New.
+	* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
+
+2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
+	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.
+
+2010-11-16  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-20.c: New.
+
+2010-11-04  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-19.c: New.
+
+2010-11-03  H.J. Lu  <hongjiu.lu@intel.com>
+
+	PR target/46295
+	* gcc.target/i386/pr46295.c: New.
+
+2010-11-03  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/pr46285.c: Remove -dp.
+
+2010-11-02  H.J. Lu  <hongjiu.lu@intel.com>
+
+	PR target/46285
+	* gcc.target/i386/pr46285.c.
+
+2010-11-02  H.J. Lu  <hongjiu.lu@intel.com>
+
+	PR target/46253
+	* gcc.target/i386/pr46253.c: New.
+
+2010-11-02  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-16.c: New.
+	* gcc.target/i386/avx-vzeroupper-17.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-18.c: Likewise.
+
+2010-11-02  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-15.c: New.
+
+2010-10-26  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-4.c: Don't scan
+	avx_vzeroupper_nop.  Scan avx_vzeroupper instead of
+	*avx_vzeroupper.
+	* gcc.target/i386/avx-vzeroupper-10.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-12.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-13.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-14.c: Likewise.
+
+2010-10-19  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-1.c: Add -mtune=generic.
+	* gcc.target/i386/avx-vzeroupper-1.c: Likewise.
+
+	* gcc.target/i386/avx-vzeroupper-13.c: New.
+	* gcc.target/i386/avx-vzeroupper-14.c: Likewise.
+
+2010-10-12  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-9.c: New.
+	* gcc.target/i386/avx-vzeroupper-10.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-12.c: Likewise.
+
+2010-10-12  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-5.c: New.
+	* gcc.target/i386/avx-vzeroupper-6.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-7.c: Likewise.
+	* gcc.target/i386/avx-vzeroupper-8.c: Likewise.
+
+2010-10-12  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-4.c: New.
+
+2010-06-15  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* gcc.target/i386/avx-vzeroupper-3.c: New.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-16  8:04                                                       ` H.J. Lu
@ 2011-01-24 18:00                                                         ` Richard Henderson
  2011-01-24 18:12                                                           ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Richard Henderson @ 2011-01-24 18:00 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On 01/15/2011 05:08 PM, H.J. Lu wrote:
> Here is the new patch. It rechecks outgoing edges only if the exit
> tate is changed. It generates the identical SPEC CPU 2K/2006
> executables as trunk.  Compiler difference is less than 0.3%.
> OK for trunk?

This is ok.  I'm looking forward to seeing this cleaned up
for 4.7 though, I must say...


r~

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2011-01-24 18:00                                                         ` Richard Henderson
@ 2011-01-24 18:12                                                           ` H.J. Lu
  0 siblings, 0 replies; 46+ messages in thread
From: H.J. Lu @ 2011-01-24 18:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jakub Jelinek, Mark Mitchell, Uros Bizjak, gcc-patches,
	Richard Guenther, deVries, Tom

On Mon, Jan 24, 2011 at 9:17 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/15/2011 05:08 PM, H.J. Lu wrote:
>> Here is the new patch. It rechecks outgoing edges only if the exit
>> tate is changed. It generates the identical SPEC CPU 2K/2006
>> executables as trunk.  Compiler difference is less than 0.3%.
>> OK for trunk?
>
> This is ok.  I'm looking forward to seeing this cleaned up
> for 4.7 though, I must say...
>

I opened:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47440

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-24 19:57             ` Uros Bizjak
@ 2010-11-24 21:41               ` H.J. Lu
  0 siblings, 0 replies; 46+ messages in thread
From: H.J. Lu @ 2010-11-24 21:41 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Guenther, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

On Wed, Nov 24, 2010 at 10:53 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Nov 24, 2010 at 7:41 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>> Here is a follow up  patch.  Fortran intrinsic may set TREE_THIS_VOLATILE
>> even if it does return. This patch removes the TREE_THIS_VOLATILE
>> optimization.  OK for trunk?
>
> Looking at the testcase, I guess it is OK. Perhaps a fortran person
> can comment from the fortran POV?
>
> Uros.
>

Fortran is OK. The problem is

if (TARGET_VZEROUPPER && !TREE_THIS_VOLATILE (cfun->decl))

Even if caller, which is MAIN__, never returns, we should issue vzeroupper
when making a library call.  I am checking this one.

-- 
H.J.
---
gcc/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (ix86_expand_call): Don't check
	TREE_THIS_VOLATILE.

gcc/testsuite/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-1.f: New.

[-- Attachment #2: gcc-pr46519-6.patch --]
[-- Type: text/plain, Size: 1955 bytes --]

gcc/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (ix86_expand_call): Don't check
	TREE_THIS_VOLATILE.

gcc/testsuite/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-1.f: New.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2a46f1a..582639c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21755,7 +21754,7 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }
 
   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && !TREE_THIS_VOLATILE (cfun->decl))
+  if (TARGET_VZEROUPPER)
     {
       rtx unspec;
       int avx256;
diff --git a/gcc/testsuite/gfortran.dg/pr46519-1.f b/gcc/testsuite/gfortran.dg/pr46519-1.f
new file mode 100644
index 0000000..7b1775e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr46519-1.f
@@ -0,0 +1,46 @@
+! { dg-do compile { target i?86-*-* x86_64-*-* } }
+! { dg-options "-O3 -mavx -mvzeroupper -dp" }
+
+      PROGRAM MG3XDEMO 
+      INTEGER LM, NM, NV, NR, NIT
+
+
+      PARAMETER( LM=7 )
+C      PARAMETER( NIT=40 )
+      PARAMETER( NM=2+2**LM, NV=NM**3 )
+      PARAMETER( NR = (8*(NM**3+NM**2+5*NM-23+7*LM))/7 )
+C
+C
+C If commented line is used than there is no penalty
+C	COMMON /X/ U, V, R, A, C, IR, MM
+	COMMON /X/ A, C, IR, MM
+      REAL*8 A(0:3),C(0:3)
+
+      INTEGER IT, N
+      INTEGER LMI, MTIME, NTIMES
+C
+      READ *,LMI
+      READ *,NIT
+      READ *,NTIMES
+      READ *,U0
+
+      READ 9004, A
+      READ 9004, C
+9004  FORMAT (4D8.0)
+
+      DO I = 0, 3
+	A(I) = A(I)/3.0D0
+	C(I) = C(I)/64.0D0
+      ENDDO
+C
+      N  = 2 + 2**LMI
+
+      WRITE(6,7)N-2,N-2,N-2,NIT
+ 6    FORMAT( I4, 2E19.12)
+ 7    FORMAT(/,' KERNEL B:  SOLVING A POISSON PROBLEM ON A ',I6,' BY ',
+     > I6,' BY ',I6,' GRID,',/,' USING ',I6,' MULTIGRID ITERATIONS.',/)
+C
+      STOP
+      END
+
+! { dg-final { scan-assembler-times "avx_vzeroupper" 1 } }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-24 19:53           ` H.J. Lu
@ 2010-11-24 19:57             ` Uros Bizjak
  2010-11-24 21:41               ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Uros Bizjak @ 2010-11-24 19:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Richard Guenther, gcc-patches

On Wed, Nov 24, 2010 at 7:41 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

> Here is a follow up  patch.  Fortran intrinsic may set TREE_THIS_VOLATILE
> even if it does return. This patch removes the TREE_THIS_VOLATILE
> optimization.  OK for trunk?

Looking at the testcase, I guess it is OK. Perhaps a fortran person
can comment from the fortran POV?

Uros.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-24 19:48         ` Uros Bizjak
@ 2010-11-24 19:53           ` H.J. Lu
  2010-11-24 19:57             ` Uros Bizjak
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-11-24 19:53 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Guenther, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3908 bytes --]

On Wed, Nov 24, 2010 at 10:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 3:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>>>>>>> Here is the patch for
>>>>>>>>
>>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>>>>
>>>>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>>>>> information about upper 128bits at block entry.
>>>>>>>
>>>>>>> This introduces another insn scanning pass, almost the same as
>>>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>>>>
>>>>>>> So, if I understand correctly:
>>>>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>>>>> CALL_NEEDS_VZEROUPPER.
>>>>>>> - A new pre-pass is required that scans all functions in order to
>>>>>>> detect functions with live AVX registers at exit, and at the same time
>>>>>>> marks the functions that *do not* use AVX registers.
>>>>>>> - Existing pass then re-scans everything to again detect functions
>>>>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>>>>
>>>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>>>>> can be used to handle this case?
>>>>>>>
>>>>>>
>>>>>> Here is the rewrite of the vzeroupper optimization pass.
>>>>>> To avoid circular dependency, it has 2 passes.  It
>>>>>> delays the circular dependency to the second pass
>>>>>> and avoid rescan as much as possible.
>>>>>>
>>>>>> I compared the bootstrap times with/wthout this patch
>>>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>>>>> I enabled c,c++,fortran,java,lto,objc
>>>>>>
>>>>>> Without patch:
>>>>>>
>>>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>>>>
>>>>>> With patch
>>>>>>
>>>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>>>>
>>>>>> The overhead is about 1.6%.
>>>>>
>>>>> That's a quite big overhead for something that doesn't use FP
>>>>> math (and thus no AVX).
>>>>
>>>> AVX256 vector insns are independent of FP math.  They can be
>>>> generated by vectorizer as well as loop unroll.  We can limit
>>>> it to -O2 or -O3 if overhead is a big concern.
>>>
>>> Limiting it to -fexpensive-optimizations would be a good start.  Btw,
>>> how is code-size affected?  Does it make sense to disable it when
>>> optimizing a function for size?  As it affects performance of callees
>>> whether the caller is optimized for size or speed probably isn't the
>>> best thing to check.
>>>
>>
>> We pay penalty at SSE<->AVX transition, not exactly in callee/caller.
>> We can just check optimize_size.
>>
>> Here is the updated patch to limit vzeroupper optimization to
>> -fexpensive-optimizations and not optimizing for size.  OK for trunk?
>
> ATM, I have no other (obvious) solution to two-pass problem, although
> I think LCM (please look at gcc/lcm.c) should be investigated if it
> fits this job.

I will investigate it for 4.7.

> The patch demonstrates better generated code, so I propose to proceed
> with the patch. Although IMO non-optimal solution depends on
> TARGET_VZEROUPPER and -fexpensive-optimizations.
>
> So, since it looks that there are no other objections, the patch is OK
> for mainline.
>

Here is a follow up  patch.  Fortran intrinsic may set TREE_THIS_VOLATILE
even if it does return. This patch removes the TREE_THIS_VOLATILE
optimization.  OK for trunk?

Thanks.

-- 
H.J.
---
gcc/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (ix86_expand_epilogue): Don't check
	TREE_THIS_VOLATILE.
	(ix86_expand_call): Likewise.

gcc/testsuite/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-1.f: New.

[-- Attachment #2: gcc-pr46519-5.patch --]
[-- Type: text/plain, Size: 2271 bytes --]

gcc/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (ix86_expand_epilogue): Don't check
	TREE_THIS_VOLATILE.
	(ix86_expand_call): Likewise.

gcc/testsuite/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-1.f: New.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2a46f1a..582639c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -11076,7 +11076,6 @@ ix86_expand_epilogue (int style)
 
   /* Emit vzeroupper if needed.  */
   if (TARGET_VZEROUPPER
-      && !TREE_THIS_VOLATILE (cfun->decl)
       && !cfun->machine->caller_return_avx256_p)
     emit_insn (gen_avx_vzeroupper (GEN_INT (call_no_avx256))); 
 
@@ -21755,7 +21754,7 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }
 
   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && !TREE_THIS_VOLATILE (cfun->decl))
+  if (TARGET_VZEROUPPER)
     {
       rtx unspec;
       int avx256;
diff --git a/gcc/testsuite/gfortran.dg/pr46519-1.f b/gcc/testsuite/gfortran.dg/pr46519-1.f
new file mode 100644
index 0000000..7b1775e
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr46519-1.f
@@ -0,0 +1,46 @@
+! { dg-do compile { target i?86-*-* x86_64-*-* } }
+! { dg-options "-O3 -mavx -mvzeroupper -dp" }
+
+      PROGRAM MG3XDEMO 
+      INTEGER LM, NM, NV, NR, NIT
+
+
+      PARAMETER( LM=7 )
+C      PARAMETER( NIT=40 )
+      PARAMETER( NM=2+2**LM, NV=NM**3 )
+      PARAMETER( NR = (8*(NM**3+NM**2+5*NM-23+7*LM))/7 )
+C
+C
+C If commented line is used than there is no penalty
+C	COMMON /X/ U, V, R, A, C, IR, MM
+	COMMON /X/ A, C, IR, MM
+      REAL*8 A(0:3),C(0:3)
+
+      INTEGER IT, N
+      INTEGER LMI, MTIME, NTIMES
+C
+      READ *,LMI
+      READ *,NIT
+      READ *,NTIMES
+      READ *,U0
+
+      READ 9004, A
+      READ 9004, C
+9004  FORMAT (4D8.0)
+
+      DO I = 0, 3
+	A(I) = A(I)/3.0D0
+	C(I) = C(I)/64.0D0
+      ENDDO
+C
+      N  = 2 + 2**LMI
+
+      WRITE(6,7)N-2,N-2,N-2,NIT
+ 6    FORMAT( I4, 2E19.12)
+ 7    FORMAT(/,' KERNEL B:  SOLVING A POISSON PROBLEM ON A ',I6,' BY ',
+     > I6,' BY ',I6,' GRID,',/,' USING ',I6,' MULTIGRID ITERATIONS.',/)
+C
+      STOP
+      END
+
+! { dg-final { scan-assembler-times "avx_vzeroupper" 1 } }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-20 18:20       ` H.J. Lu
@ 2010-11-24 19:48         ` Uros Bizjak
  2010-11-24 19:53           ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Uros Bizjak @ 2010-11-24 19:48 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Richard Guenther, gcc-patches

On Sat, Nov 20, 2010 at 3:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>>>>>>> Here is the patch for
>>>>>>>
>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>>>
>>>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>>>> information about upper 128bits at block entry.
>>>>>>
>>>>>> This introduces another insn scanning pass, almost the same as
>>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>>>
>>>>>> So, if I understand correctly:
>>>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>>>> CALL_NEEDS_VZEROUPPER.
>>>>>> - A new pre-pass is required that scans all functions in order to
>>>>>> detect functions with live AVX registers at exit, and at the same time
>>>>>> marks the functions that *do not* use AVX registers.
>>>>>> - Existing pass then re-scans everything to again detect functions
>>>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>>>
>>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>>>> can be used to handle this case?
>>>>>>
>>>>>
>>>>> Here is the rewrite of the vzeroupper optimization pass.
>>>>> To avoid circular dependency, it has 2 passes.  It
>>>>> delays the circular dependency to the second pass
>>>>> and avoid rescan as much as possible.
>>>>>
>>>>> I compared the bootstrap times with/wthout this patch
>>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>>>> I enabled c,c++,fortran,java,lto,objc
>>>>>
>>>>> Without patch:
>>>>>
>>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>>>
>>>>> With patch
>>>>>
>>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>>>
>>>>> The overhead is about 1.6%.
>>>>
>>>> That's a quite big overhead for something that doesn't use FP
>>>> math (and thus no AVX).
>>>
>>> AVX256 vector insns are independent of FP math.  They can be
>>> generated by vectorizer as well as loop unroll.  We can limit
>>> it to -O2 or -O3 if overhead is a big concern.
>>
>> Limiting it to -fexpensive-optimizations would be a good start.  Btw,
>> how is code-size affected?  Does it make sense to disable it when
>> optimizing a function for size?  As it affects performance of callees
>> whether the caller is optimized for size or speed probably isn't the
>> best thing to check.
>>
>
> We pay penalty at SSE<->AVX transition, not exactly in callee/caller.
> We can just check optimize_size.
>
> Here is the updated patch to limit vzeroupper optimization to
> -fexpensive-optimizations and not optimizing for size.  OK for trunk?

ATM, I have no other (obvious) solution to two-pass problem, although
I think LCM (please look at gcc/lcm.c) should be investigated if it
fits this job.

The patch demonstrates better generated code, so I propose to proceed
with the patch. Although IMO non-optimal solution depends on
TARGET_VZEROUPPER and -fexpensive-optimizations.

So, since it looks that there are no other objections, the patch is OK
for mainline.

Uros.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-20 12:11     ` Richard Guenther
@ 2010-11-20 18:20       ` H.J. Lu
  2010-11-24 19:48         ` Uros Bizjak
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-11-20 18:20 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Uros Bizjak, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5191 bytes --]

On Sat, Nov 20, 2010 at 2:53 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 12:31 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, Nov 19, 2010 at 2:48 PM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>
>>>>>> Here is the patch for
>>>>>>
>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>>
>>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>>> information about upper 128bits at block entry.
>>>>>
>>>>> This introduces another insn scanning pass, almost the same as
>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>>
>>>>> So, if I understand correctly:
>>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>>> CALL_NEEDS_VZEROUPPER.
>>>>> - A new pre-pass is required that scans all functions in order to
>>>>> detect functions with live AVX registers at exit, and at the same time
>>>>> marks the functions that *do not* use AVX registers.
>>>>> - Existing pass then re-scans everything to again detect functions
>>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>>
>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>>> can be used to handle this case?
>>>>>
>>>>
>>>> Here is the rewrite of the vzeroupper optimization pass.
>>>> To avoid circular dependency, it has 2 passes.  It
>>>> delays the circular dependency to the second pass
>>>> and avoid rescan as much as possible.
>>>>
>>>> I compared the bootstrap times with/wthout this patch
>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>>> I enabled c,c++,fortran,java,lto,objc
>>>>
>>>> Without patch:
>>>>
>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>>
>>>> With patch
>>>>
>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>>
>>>> The overhead is about 1.6%.
>>>
>>> That's a quite big overhead for something that doesn't use FP
>>> math (and thus no AVX).
>>
>> AVX256 vector insns are independent of FP math.  They can be
>> generated by vectorizer as well as loop unroll.  We can limit
>> it to -O2 or -O3 if overhead is a big concern.
>
> Limiting it to -fexpensive-optimizations would be a good start.  Btw,
> how is code-size affected?  Does it make sense to disable it when
> optimizing a function for size?  As it affects performance of callees
> whether the caller is optimized for size or speed probably isn't the
> best thing to check.
>

We pay penalty at SSE<->AVX transition, not exactly in callee/caller.
We can just check optimize_size.

Here is the updated patch to limit vzeroupper optimization to
-fexpensive-optimizations and not optimizing for size.  OK for trunk?

Thanks.


-- 
H.J.
---
gcc/

2010-11-20  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (upper_128bits_state): New.
	(block_info_def): Remove upper_128bits_set and done.  Add state,
	referenced, count, processed and rescanned.
	(check_avx256_stores): Updated.
	(move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
	Call note_stores only if needed.  Set referenced and count.
	(move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
	(rescan_move_or_delete_vzeroupper): New.
	(move_or_delete_vzeroupper):  Process and rescan all all basic
	blocks instead of predecessor blocks of all exit points.
	(ix86_option_override_internal): Enable vzeroupper optimization
	only for -fexpensive-optimizations and not optimizing for size.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_vzeroupper_p
	and use_avx256_p.  Add rescan_vzeroupper_p.

gcc/testsuite/

2010-11-20  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-14.c: Replace -O0 with -O2.
	* gcc.target/i386/avx-vzeroupper-15.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-16.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-17.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.
	* gcc.target/i386/avx-vzeroupper-21.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-24.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-25.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-26.c: Likewise.

[-- Attachment #2: gcc-pr46519-4.patch --]
[-- Type: text/plain, Size: 24296 bytes --]

gcc/

2010-11-20  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (upper_128bits_state): New.
	(block_info_def): Remove upper_128bits_set and done.  Add state,
	referenced, count, processed and rescanned. 
	(check_avx256_stores): Updated.
	(move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
	Call note_stores only if needed.  Set referenced and count.
	(move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
	(rescan_move_or_delete_vzeroupper): New.
	(move_or_delete_vzeroupper):  Process and rescan all all basic
	blocks instead of predecessor blocks of all exit points.
	(ix86_option_override_internal): Enable vzeroupper optimization
	only for -fexpensive-optimizations and not optimizing for size.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_vzeroupper_p
	and use_avx256_p.  Add rescan_vzeroupper_p.

gcc/testsuite/

2010-11-20  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-14.c: Replace -O0 with -O2.
	* gcc.target/i386/avx-vzeroupper-15.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-16.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-17.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.
	* gcc.target/i386/avx-vzeroupper-21.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-24.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-25.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-26.c: Likewise.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d5f097d..2d95744 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -57,12 +57,25 @@ along with GCC; see the file COPYING3.  If not see
 #include "dwarf2out.h"
 #include "sched-int.h"
 
+enum upper_128bits_state
+{
+  unknown = 0,		/* Unknown.  */
+  unused,		/* Not used or not referenced.  */
+  used			/* Used or referenced.  */
+};
+
 typedef struct block_info_def
 {
-  /* TRUE if the upper 128bits of any AVX registers are live at exit.  */
-  bool upper_128bits_set;
+  /* State of the upper 128bits of any AVX registers at exit.  */
+  enum upper_128bits_state state;
+  /* If the upper 128bits of any AVX registers are referenced.  */
+  enum upper_128bits_state referenced;
+  /* Number of vzerouppers in this block.  */
+  unsigned int count;
   /* TRUE if block has been processed.  */
-  bool done;
+  bool processed;
+  /* TRUE if block has been rescanned.  */
+  bool rescanned;
 } *block_info;
 
 #define BLOCK_INFO(B)   ((block_info) (B)->aux)
@@ -93,8 +106,9 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
 	  && REG_P (SET_SRC (set))
 	  && VALID_AVX256_REG_MODE (GET_MODE (SET_SRC (set)))))
     {
-      bool *upper_128bits_set = (bool *) data;
-      *upper_128bits_set = true;
+      enum upper_128bits_state *state
+	= (enum upper_128bits_state *) data;
+      *state = used;
     }
 }
 
@@ -106,19 +120,24 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
    are live at entry.  */
 
 static void
-move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
+move_or_delete_vzeroupper_2 (basic_block bb,
+			     enum upper_128bits_state state)
 {
-  rtx insn;
+  rtx insn, bb_end;
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
+  enum upper_128bits_state referenced = BLOCK_INFO (bb)->referenced;
+  int count = BLOCK_INFO (bb)->count;
 
   if (dump_file)
     fprintf (dump_file, " BB [%i] entry: upper 128bits: %d\n",
-	     bb->index, upper_128bits_set);
+	     bb->index, state);
 
+  /* BB_END changes when it is deleted.  */
+  bb_end = BB_END (bb);
   insn = BB_HEAD (bb);
-  while (insn != BB_END (bb))
+  while (insn != bb_end)
     {
       insn = NEXT_INSN (insn);
 
@@ -167,67 +186,89 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 	      && GET_CODE (XVECEXP (pat, 0, 0)) == UNSPEC_VOLATILE
 	      && XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL)
 	    {
-	      upper_128bits_set = false;
+	      state = unused;
 
 	      /* Delete pending vzeroupper insertion.  */
 	      if (vzeroupper_insn)
 		{
+		  count--;
 		  delete_insn (vzeroupper_insn);
 		  vzeroupper_insn = NULL_RTX;
 		}
 	    }
-	  else if (!upper_128bits_set)
-	    note_stores (pat, check_avx256_stores, &upper_128bits_set);
+	  else if (state != used && referenced != unused)
+	    {
+	      /* No need to call note_stores if the upper 128bits of
+		 AVX registers are never referenced.  */
+	      note_stores (pat, check_avx256_stores, &state);
+	      if (state == used)
+		referenced = used;
+	    }
 	  continue;
 	}
 
       /* Process vzeroupper intrinsic.  */
+      count++;
       avx256 = INTVAL (XVECEXP (pat, 0, 0));
 
-      if (!upper_128bits_set)
+      if (state == unused)
 	{
 	  /* Since the upper 128bits are cleared, callee must not pass
 	     256bit AVX register.  We only need to check if callee
 	     returns 256bit AVX register.  */
-	  upper_128bits_set = (avx256 == callee_return_avx256);
+	  if (avx256 == callee_return_avx256)
+	    state = used;
 
-	  /* Remove unnecessary vzeroupper since
-	     upper 128bits are cleared.  */
+	  /* Remove unnecessary vzeroupper since upper 128bits are
+	     cleared.  */
 	  if (dump_file)
 	    {
 	      fprintf (dump_file, "Delete redundant vzeroupper:\n");
 	      print_rtl_single (dump_file, insn);
 	    }
+	  count--;
 	  delete_insn (insn);
 	}
-      else if (avx256 == callee_return_pass_avx256
-	       || avx256 == callee_pass_avx256)
+      else
 	{
-	  /* Callee passes 256bit AVX register.  Check if callee
-	     returns 256bit AVX register.  */
-	  upper_128bits_set = (avx256 == callee_return_pass_avx256);
+	  /* Set state to UNUSED if callee doesn't return 256bit AVX
+	     register.  */
+	  if (avx256 != callee_return_pass_avx256)
+	    state = unused;
 
-	  /* Must remove vzeroupper since
-	     callee passes in 256bit AVX register.  */
-	  if (dump_file)
+	  if (avx256 == callee_return_pass_avx256
+	      || avx256 == callee_pass_avx256)
 	    {
-	      fprintf (dump_file, "Delete callee pass vzeroupper:\n");
-	      print_rtl_single (dump_file, insn);
+	      /* Must remove vzeroupper since callee passes in 256bit
+		 AVX register.  */
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Delete callee pass vzeroupper:\n");
+		  print_rtl_single (dump_file, insn);
+		}
+	      count--;
+	      delete_insn (insn);
 	    }
-	  delete_insn (insn);
-	}
-      else
-	{
-	  upper_128bits_set = false;
-	  vzeroupper_insn = insn;
+	  else
+	    vzeroupper_insn = insn;
 	}
     }
 
-  BLOCK_INFO (bb)->upper_128bits_set = upper_128bits_set;
+  BLOCK_INFO (bb)->state = state;
+
+  if (BLOCK_INFO (bb)->referenced == unknown)
+    {
+      /* The upper 128bits of AVX registers are never referenced if
+	 REFERENCED isn't updated.  */
+      if (referenced == unknown)
+	referenced = unused;
+      BLOCK_INFO (bb)->referenced = referenced;
+      BLOCK_INFO (bb)->count = count;
+    }
 
   if (dump_file)
     fprintf (dump_file, " BB [%i] exit: upper 128bits: %d\n",
-	     bb->index, upper_128bits_set);
+	     bb->index, state);
 }
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper
@@ -238,18 +279,18 @@ move_or_delete_vzeroupper_1 (basic_block block)
 {
   edge e;
   edge_iterator ei;
-  bool upper_128bits_set;
+  enum upper_128bits_state state;
 
   if (dump_file)
     fprintf (dump_file, " Process BB [%i]: status: %d\n",
-	     block->index, BLOCK_INFO (block)->done);
+	     block->index, BLOCK_INFO (block)->processed);
 
-  if (BLOCK_INFO (block)->done)
+  if (BLOCK_INFO (block)->processed)
     return;
 
-  BLOCK_INFO (block)->done = true;
+  BLOCK_INFO (block)->processed = true;
 
-  upper_128bits_set = false;
+  state = unknown;
 
   /* Process all predecessor edges of this block.  */
   FOR_EACH_EDGE (e, ei, block->preds)
@@ -257,12 +298,70 @@ move_or_delete_vzeroupper_1 (basic_block block)
       if (e->src == block)
 	continue;
       move_or_delete_vzeroupper_1 (e->src);
-      if (BLOCK_INFO (e->src)->upper_128bits_set)
-	upper_128bits_set = true;
+      switch (BLOCK_INFO (e->src)->state)
+	{
+	case unknown:
+	  if (state == unused)
+	    state = unknown;
+	  break;
+	case used:
+	  state = used;
+	  break;
+	case unused:
+	  break;
+	}
     }
 
+  /* If state of any predecessor edges is unknown, we need to rescan.  */
+  if (state == unknown)
+    cfun->machine->rescan_vzeroupper_p = 1;
+
   /* Process this block.  */
-  move_or_delete_vzeroupper_2 (block, upper_128bits_set);
+  move_or_delete_vzeroupper_2 (block, state);
+}
+
+/* Helper function for move_or_delete_vzeroupper.  Rescan vzeroupper
+   in BLOCK and its predecessor blocks recursively.  */
+
+static void
+rescan_move_or_delete_vzeroupper (basic_block block)
+{
+  edge e;
+  edge_iterator ei;
+  enum upper_128bits_state state;
+
+  if (dump_file)
+    fprintf (dump_file, " Rescan BB [%i]: status: %d\n",
+	     block->index, BLOCK_INFO (block)->rescanned);
+
+  if (BLOCK_INFO (block)->rescanned)
+    return;
+
+  BLOCK_INFO (block)->rescanned = true;
+
+  state = unused;
+
+  /* Rescan all predecessor edges of this block.  */
+  FOR_EACH_EDGE (e, ei, block->preds)
+    {
+      if (e->src == block)
+	continue;
+      rescan_move_or_delete_vzeroupper (e->src);
+      /* For rescan, UKKNOWN state is treated as UNUSED.  */
+      if (BLOCK_INFO (e->src)->state == used)
+	state = used;
+    }
+
+  /* Rescan this block only if there are vzerouppers or the upper
+     128bits of AVX registers are referenced.  */
+  if (BLOCK_INFO (block)->count == 0
+      && (state == used || BLOCK_INFO (block)->referenced != used))
+    {
+      if (state == used)
+	BLOCK_INFO (block)->state = state;
+    }
+  else
+    move_or_delete_vzeroupper_2 (block, state);
 }
 
 /* Go through the instruction stream looking for vzeroupper.  Delete
@@ -274,6 +373,8 @@ move_or_delete_vzeroupper (void)
 {
   edge e;
   edge_iterator ei;
+  basic_block bb;
+  unsigned int count = 0;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
@@ -285,16 +386,31 @@ move_or_delete_vzeroupper (void)
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
     {
       move_or_delete_vzeroupper_2 (e->dest,
-				   cfun->machine->caller_pass_avx256_p);
-      BLOCK_INFO (e->dest)->done = true;
+				   cfun->machine->caller_pass_avx256_p
+				   ? used : unused);
+      BLOCK_INFO (e->dest)->processed = true;
+      BLOCK_INFO (e->dest)->rescanned = true;
     }
 
-  /* Process predecessor blocks of all exit points.  */
+  /* Process all basic blocks.  */
   if (dump_file)
-    fprintf (dump_file, "Process all exit points\n");
+    fprintf (dump_file, "Process all basic blocks\n");
 
-  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
-    move_or_delete_vzeroupper_1 (e->src);
+  FOR_EACH_BB (bb)
+    {
+      move_or_delete_vzeroupper_1 (bb);
+      count += BLOCK_INFO (bb)->count;
+    }
+
+  /* Rescan all basic blocks if needed.  */
+  if (count && cfun->machine->rescan_vzeroupper_p)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Rescan all basic blocks\n");
+
+      FOR_EACH_BB (bb)
+	rescan_move_or_delete_vzeroupper (bb);
+    }
 
   free_aux_for_blocks ();
 }
@@ -4051,8 +4167,11 @@ ix86_option_override_internal (bool main_args_p)
 
   if (TARGET_AVX)
     {
-      /* Enable vzeroupper pass by default for TARGET_AVX.  */
-      if (!(target_flags_explicit & MASK_VZEROUPPER))
+      /* When not optimize for size, enable vzeroupper optimization for
+	 TARGET_AVX with -fexpensive-optimizations.  */
+      if (!optimize_size
+	  && flag_expensive_optimizations
+	  && !(target_flags_explicit & MASK_VZEROUPPER))
 	target_flags |= MASK_VZEROUPPER;
     }
   else 
@@ -4062,17 +4181,6 @@ ix86_option_override_internal (bool main_args_p)
     }
 }
 
-/* Return TRUE if type TYPE and mode MODE use 256bit AVX modes.  */
-
-static bool
-use_avx256_p (enum machine_mode mode, const_tree type)
-{
-  return (VALID_AVX256_REG_MODE (mode)
-	  || (type
-	      && TREE_CODE (type) == VECTOR_TYPE
-	      && int_size_in_bytes (type) == 32));
-}
-
 /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
 
 static bool
@@ -5687,7 +5795,6 @@ init_cumulative_args (CUMULATIVE_ARGS *cum,  /* Argument info to initialize */
       if (function_pass_avx256_p (fnret_value))
 	{
 	  /* The return value of this function uses 256bit AVX modes.  */
-	  cfun->machine->use_avx256_p = true;
 	  if (caller)
 	    cfun->machine->callee_return_avx256_p = true;
 	  else
@@ -6956,7 +7063,6 @@ ix86_function_arg (CUMULATIVE_ARGS *cum, enum machine_mode omode,
   if (TARGET_VZEROUPPER && function_pass_avx256_p (arg))
     {
       /* This argument uses 256bit AVX modes.  */
-      cfun->machine->use_avx256_p = true;
       if (cum->caller)
 	cfun->machine->callee_pass_avx256_p = true;
       else
@@ -10970,12 +11076,9 @@ ix86_expand_epilogue (int style)
 
   /* Emit vzeroupper if needed.  */
   if (TARGET_VZEROUPPER
-      && cfun->machine->use_avx256_p
+      && !TREE_THIS_VOLATILE (cfun->decl)
       && !cfun->machine->caller_return_avx256_p)
-    {
-      cfun->machine->use_vzeroupper_p = 1;
-      emit_insn (gen_avx_vzeroupper (GEN_INT (call_no_avx256))); 
-    }
+    emit_insn (gen_avx_vzeroupper (GEN_INT (call_no_avx256))); 
 
   if (crtl->args.pops_args && crtl->args.size)
     {
@@ -15130,9 +15233,6 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
   rtx op0, op1;
   enum tls_model model;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -15277,9 +15377,6 @@ ix86_expand_vector_move (enum machine_mode mode, rtx operands[])
   rtx op0 = operands[0], op1 = operands[1];
   unsigned int align = GET_MODE_ALIGNMENT (mode);
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   /* Force constants other than zero into memory.  We do not know how
      the instructions used to build constants modify the upper 64 bits
      of the register, once we have that information we may be able
@@ -15386,9 +15483,6 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 {
   rtx op0, op1, m;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -21661,12 +21755,11 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }
 
   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
+  if (TARGET_VZEROUPPER && !TREE_THIS_VOLATILE (cfun->decl))
     {
       rtx unspec;
       int avx256;
 
-      cfun->machine->use_vzeroupper_p = 1;
       if (cfun->machine->callee_pass_avx256_p)
 	{
 	  if (cfun->machine->callee_return_avx256_p)
@@ -22763,9 +22856,6 @@ ix86_local_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   /* Don't do dynamic stack realignment for long long objects with
      -mpreferred-stack-boundary=2.  */
   if (!TARGET_64BIT
@@ -22872,9 +22962,6 @@ ix86_minimum_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   if (TARGET_64BIT || align != 64 || ix86_preferred_stack_boundary >= 64)
     return align;
 
@@ -29782,7 +29869,7 @@ ix86_reorg (void)
     }
 
   /* Run the vzeroupper optimization if needed.  */
-  if (cfun->machine->use_vzeroupper_p)
+  if (TARGET_VZEROUPPER)
     move_or_delete_vzeroupper ();
 }
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 170ad50..6e7db03 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2296,12 +2296,6 @@ struct GTY(()) machine_function {
      stack below the return address.  */
   BOOL_BITFIELD static_chain_on_stack : 1;
 
-  /* Nonzero if the current function uses vzeroupper.  */
-  BOOL_BITFIELD use_vzeroupper_p : 1;
-
-  /* Nonzero if the current function uses 256bit AVX regisers.  */
-  BOOL_BITFIELD use_avx256_p : 1;
-
   /* Nonzero if caller passes 256bit AVX modes.  */
   BOOL_BITFIELD caller_pass_avx256_p : 1;
 
@@ -2314,6 +2308,9 @@ struct GTY(()) machine_function {
   /* Nonzero if the current callee returns 256bit AVX modes.  */
   BOOL_BITFIELD callee_return_avx256_p : 1;
 
+  /* Nonzero if rescan vzerouppers in the current function is needed.  */
+  BOOL_BITFIELD rescan_vzeroupper_p : 1;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
index 5007753..667bb17 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
@@ -14,4 +14,4 @@ foo ()
   _mm256_zeroupper ();
 }
 
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
index 507f945..d98ceb9 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
@@ -16,4 +16,4 @@ foo ()
 }
 
 /* { dg-final { scan-assembler-times "\\*avx_vzeroall" 1 } } */
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
index e74bc24..a31b4a2 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-14.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O0 -mavx -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
 
 #include <immintrin.h>
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
index 134a3dd..803936e 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-15.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O0 -mavx -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
 
 #include <immintrin.h>
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-16.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-16.c
index 3fb099d..ad46d35 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-16.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-16.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target lp64 } */
-/* { dg-options "-O0 -mavx -mabi=ms -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mabi=ms -mtune=generic -dp" } */
 
 typedef float __m256 __attribute__ ((__vector_size__ (32), __may_alias__));
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-17.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-17.c
index 2f3cfd2..5b5c64b 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-17.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-17.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target lp64 } */
-/* { dg-options "-O0 -mavx -mabi=ms -mtune=generic -dp" } */
+/* { dg-options "-O2 -mavx -mabi=ms -mtune=generic -dp" } */
 
 typedef float __m256 __attribute__ ((__vector_size__ (32), __may_alias__));
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
new file mode 100644
index 0000000..3301083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
+
+extern void free (void *);
+void
+bar (void *ncstrp)
+{
+  if(ncstrp==((void *)0))
+    return;
+  free(ncstrp);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c
new file mode 100644
index 0000000..6dea055
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void exit (int) __attribute__ ((__noreturn__));
+
+int
+foo (int i)
+{
+  if (i == 0)
+    exit (1);
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c
new file mode 100644
index 0000000..b4e4a58
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void exit (int) __attribute__ ((__noreturn__));
+extern void bar (void);
+
+int
+foo (int i)
+{
+  if (i == 0)
+    {
+      bar ();
+      exit (1);
+    }
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c
new file mode 100644
index 0000000..66df800
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void fatal (void) __attribute__ ((__noreturn__));
+extern void exit (int) __attribute__ ((__noreturn__));
+
+void
+fatal (void)
+{
+  exit (1);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c
new file mode 100644
index 0000000..4fdd374
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+typedef struct bitmap_element_def {
+  struct bitmap_element_def *next;
+  unsigned int indx;
+} bitmap_element;
+typedef struct bitmap_head_def {
+  bitmap_element *first;
+  bitmap_element *current;
+  unsigned int indx;
+} bitmap_head;
+typedef struct bitmap_head_def *bitmap;
+typedef const struct bitmap_head_def *const_bitmap;
+extern void bar (void) __attribute__ ((__noreturn__));
+unsigned char
+bitmap_and_compl_into (bitmap a, const_bitmap b)
+{
+  bitmap_element *a_elt = a->first;
+  const bitmap_element *b_elt = b->first;
+  if (a == b)
+    {
+      if ((!(a)->first))
+	return 0;
+      else
+	return 1;
+    }
+  while (a_elt && b_elt)
+    {
+      if (a_elt->indx < b_elt->indx)
+	a_elt = a_elt->next;
+    }
+  if (a->indx == a->current->indx)
+    bar ();
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-25.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-25.c
new file mode 100644
index 0000000..5ef49c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-25.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O0 -mavx -mtune=generic -dp" } */
+
+#include <immintrin.h>
+
+extern __m256 x, y;
+
+void
+foo ()
+{
+  x = y;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-26.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-26.c
new file mode 100644
index 0000000..96e9190
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-26.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -mavx -mtune=generic -dp" } */
+
+#include <immintrin.h>
+
+extern __m256 x, y;
+extern void (*bar) (void);
+
+void
+foo ()
+{
+  x = y;
+  bar ();
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-20  1:48   ` H.J. Lu
@ 2010-11-20 12:11     ` Richard Guenther
  2010-11-20 18:20       ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Richard Guenther @ 2010-11-20 12:11 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, gcc-patches

On Sat, Nov 20, 2010 at 12:31 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Nov 19, 2010 at 2:48 PM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>>> Here is the patch for
>>>>>
>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>
>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>> information about upper 128bits at block entry.
>>>>
>>>> This introduces another insn scanning pass, almost the same as
>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>
>>>> So, if I understand correctly:
>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>> CALL_NEEDS_VZEROUPPER.
>>>> - A new pre-pass is required that scans all functions in order to
>>>> detect functions with live AVX registers at exit, and at the same time
>>>> marks the functions that *do not* use AVX registers.
>>>> - Existing pass then re-scans everything to again detect functions
>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>
>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>> can be used to handle this case?
>>>>
>>>
>>> Here is the rewrite of the vzeroupper optimization pass.
>>> To avoid circular dependency, it has 2 passes.  It
>>> delays the circular dependency to the second pass
>>> and avoid rescan as much as possible.
>>>
>>> I compared the bootstrap times with/wthout this patch
>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>> I enabled c,c++,fortran,java,lto,objc
>>>
>>> Without patch:
>>>
>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>
>>> With patch
>>>
>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>
>>> The overhead is about 1.6%.
>>
>> That's a quite big overhead for something that doesn't use FP
>> math (and thus no AVX).
>
> AVX256 vector insns are independent of FP math.  They can be
> generated by vectorizer as well as loop unroll.  We can limit
> it to -O2 or -O3 if overhead is a big concern.

Limiting it to -fexpensive-optimizations would be a good start.  Btw,
how is code-size affected?  Does it make sense to disable it when
optimizing a function for size?  As it affects performance of callees
whether the caller is optimized for size or speed probably isn't the
best thing to check.

Richard.

> H.J.
> ---
>> Richard.
>>
>>>
>>> --
>>> H.J.
>>> ---
>>> gcc/
>>>
>>> 2010-11-19  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * config/i386/i386.c (upper_128bits_state): New.
>>>        (block_info_def): Remove upper_128bits_set and done.  Add state,
>>>        referenced, count, processed and rescanned.
>>>        (check_avx256_stores): Updated.
>>>        (move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
>>>        Call note_stores only if needed.  Set referenced and count.
>>>        (move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
>>>        (rescan_move_or_delete_vzeroupper): New.
>>>        (move_or_delete_vzeroupper):  Process and rescan all all basic
>>>        blocks instead of predecessor blocks of all exit points.
>>>        (use_avx256_p): Removed.
>>>        (init_cumulative_args): Don't set use_avx256_p.
>>>        (ix86_function_arg): Likewise.
>>>        (ix86_expand_move): Likewise.
>>>        (ix86_expand_vector_move_misalign): Likewise.
>>>        (ix86_local_alignment): Likewise.
>>>        (ix86_minimum_alignment): Likewise.
>>>        (ix86_expand_epilogue): Don't check use_avx256_p when generating
>>>        vzeroupper.
>>>        (ix86_expand_call): Likewise.
>>>
>>>        * config/i386/i386.h (machine_function): Remove use_vzeroupper_p
>>>        and use_avx256_p.  Add rescan_vzeroupper_p.
>>>
>>> gcc/testsuite/
>>>
>>> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>>>        * gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>>>
>>>        * gcc.target/i386/avx-vzeroupper-20.c: New.
>>>        * gcc.target/i386/avx-vzeroupper-21.c: Likewise.
>>>        * gcc.target/i386/avx-vzeroupper-22.c: Likewise.
>>>        * gcc.target/i386/avx-vzeroupper-23.c: Likewise.
>>>        * gcc.target/i386/avx-vzeroupper-24.c: Likewise.
>>>
>>
>
>
>
> --
> H.J.
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-20  0:24 ` Richard Guenther
@ 2010-11-20  1:48   ` H.J. Lu
  2010-11-20 12:11     ` Richard Guenther
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-11-20  1:48 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Uros Bizjak, gcc-patches

On Fri, Nov 19, 2010 at 2:48 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>>> Here is the patch for
>>>>
>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>
>>>> We have 2 blocks pointing to each others. This patch first scans
>>>> all blocks without moving vzeroupper so that we can have accurate
>>>> information about upper 128bits at block entry.
>>>
>>> This introduces another insn scanning pass, almost the same as
>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>
>>> So, if I understand correctly:
>>> - The patch removes the detection if the function ever touches AVX registers.
>>> - Due to this, all call_insn RTXes have to be decorated with
>>> CALL_NEEDS_VZEROUPPER.
>>> - A new pre-pass is required that scans all functions in order to
>>> detect functions with live AVX registers at exit, and at the same time
>>> marks the functions that *do not* use AVX registers.
>>> - Existing pass then re-scans everything to again detect functions
>>> with live AVX registers at exit and handles vzeroupper emission.
>>>
>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>> can be used to handle this case?
>>>
>>
>> Here is the rewrite of the vzeroupper optimization pass.
>> To avoid circular dependency, it has 2 passes.  It
>> delays the circular dependency to the second pass
>> and avoid rescan as much as possible.
>>
>> I compared the bootstrap times with/wthout this patch
>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>> I enabled c,c++,fortran,java,lto,objc
>>
>> Without patch:
>>
>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>
>> With patch
>>
>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>
>> The overhead is about 1.6%.
>
> That's a quite big overhead for something that doesn't use FP
> math (and thus no AVX).

AVX256 vector insns are independent of FP math.  They can be
generated by vectorizer as well as loop unroll.  We can limit
it to -O2 or -O3 if overhead is a big concern.

H.J.
---
> Richard.
>
>>
>> --
>> H.J.
>> ---
>> gcc/
>>
>> 2010-11-19  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * config/i386/i386.c (upper_128bits_state): New.
>>        (block_info_def): Remove upper_128bits_set and done.  Add state,
>>        referenced, count, processed and rescanned.
>>        (check_avx256_stores): Updated.
>>        (move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
>>        Call note_stores only if needed.  Set referenced and count.
>>        (move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
>>        (rescan_move_or_delete_vzeroupper): New.
>>        (move_or_delete_vzeroupper):  Process and rescan all all basic
>>        blocks instead of predecessor blocks of all exit points.
>>        (use_avx256_p): Removed.
>>        (init_cumulative_args): Don't set use_avx256_p.
>>        (ix86_function_arg): Likewise.
>>        (ix86_expand_move): Likewise.
>>        (ix86_expand_vector_move_misalign): Likewise.
>>        (ix86_local_alignment): Likewise.
>>        (ix86_minimum_alignment): Likewise.
>>        (ix86_expand_epilogue): Don't check use_avx256_p when generating
>>        vzeroupper.
>>        (ix86_expand_call): Likewise.
>>
>>        * config/i386/i386.h (machine_function): Remove use_vzeroupper_p
>>        and use_avx256_p.  Add rescan_vzeroupper_p.
>>
>> gcc/testsuite/
>>
>> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>>        * gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>>
>>        * gcc.target/i386/avx-vzeroupper-20.c: New.
>>        * gcc.target/i386/avx-vzeroupper-21.c: Likewise.
>>        * gcc.target/i386/avx-vzeroupper-22.c: Likewise.
>>        * gcc.target/i386/avx-vzeroupper-23.c: Likewise.
>>        * gcc.target/i386/avx-vzeroupper-24.c: Likewise.
>>
>



-- 
H.J.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-19 21:58 H.J. Lu
@ 2010-11-20  0:24 ` Richard Guenther
  2010-11-20  1:48   ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: Richard Guenther @ 2010-11-20  0:24 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Uros Bizjak, gcc-patches

On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>>> Here is the patch for
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>
>>> We have 2 blocks pointing to each others. This patch first scans
>>> all blocks without moving vzeroupper so that we can have accurate
>>> information about upper 128bits at block entry.
>>
>> This introduces another insn scanning pass, almost the same as
>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>
>> So, if I understand correctly:
>> - The patch removes the detection if the function ever touches AVX registers.
>> - Due to this, all call_insn RTXes have to be decorated with
>> CALL_NEEDS_VZEROUPPER.
>> - A new pre-pass is required that scans all functions in order to
>> detect functions with live AVX registers at exit, and at the same time
>> marks the functions that *do not* use AVX registers.
>> - Existing pass then re-scans everything to again detect functions
>> with live AVX registers at exit and handles vzeroupper emission.
>>
>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>> can be used to handle this case?
>>
>
> Here is the rewrite of the vzeroupper optimization pass.
> To avoid circular dependency, it has 2 passes.  It
> delays the circular dependency to the second pass
> and avoid rescan as much as possible.
>
> I compared the bootstrap times with/wthout this patch
> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
> I enabled c,c++,fortran,java,lto,objc
>
> Without patch:
>
> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>
> With patch
>
> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>
> The overhead is about 1.6%.

That's a quite big overhead for something that doesn't use FP
math (and thus no AVX).

Richard.

>
> --
> H.J.
> ---
> gcc/
>
> 2010-11-19  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * config/i386/i386.c (upper_128bits_state): New.
>        (block_info_def): Remove upper_128bits_set and done.  Add state,
>        referenced, count, processed and rescanned.
>        (check_avx256_stores): Updated.
>        (move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
>        Call note_stores only if needed.  Set referenced and count.
>        (move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
>        (rescan_move_or_delete_vzeroupper): New.
>        (move_or_delete_vzeroupper):  Process and rescan all all basic
>        blocks instead of predecessor blocks of all exit points.
>        (use_avx256_p): Removed.
>        (init_cumulative_args): Don't set use_avx256_p.
>        (ix86_function_arg): Likewise.
>        (ix86_expand_move): Likewise.
>        (ix86_expand_vector_move_misalign): Likewise.
>        (ix86_local_alignment): Likewise.
>        (ix86_minimum_alignment): Likewise.
>        (ix86_expand_epilogue): Don't check use_avx256_p when generating
>        vzeroupper.
>        (ix86_expand_call): Likewise.
>
>        * config/i386/i386.h (machine_function): Remove use_vzeroupper_p
>        and use_avx256_p.  Add rescan_vzeroupper_p.
>
> gcc/testsuite/
>
> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>        * gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>
>        * gcc.target/i386/avx-vzeroupper-20.c: New.
>        * gcc.target/i386/avx-vzeroupper-21.c: Likewise.
>        * gcc.target/i386/avx-vzeroupper-22.c: Likewise.
>        * gcc.target/i386/avx-vzeroupper-23.c: Likewise.
>        * gcc.target/i386/avx-vzeroupper-24.c: Likewise.
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* PATCH: PR target/46519: Missing vzeroupper
@ 2010-11-19 21:58 H.J. Lu
  2010-11-20  0:24 ` Richard Guenther
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-11-19 21:58 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3259 bytes --]

On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>> Here is the patch for
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>
>> We have 2 blocks pointing to each others. This patch first scans
>> all blocks without moving vzeroupper so that we can have accurate
>> information about upper 128bits at block entry.
>
> This introduces another insn scanning pass, almost the same as
> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>
> So, if I understand correctly:
> - The patch removes the detection if the function ever touches AVX registers.
> - Due to this, all call_insn RTXes have to be decorated with
> CALL_NEEDS_VZEROUPPER.
> - A new pre-pass is required that scans all functions in order to
> detect functions with live AVX registers at exit, and at the same time
> marks the functions that *do not* use AVX registers.
> - Existing pass then re-scans everything to again detect functions
> with live AVX registers at exit and handles vzeroupper emission.
>
> I don't think this approach is acceptable. Maybe a LCM infrastructure
> can be used to handle this case?
>

Here is the rewrite of the vzeroupper optimization pass.
To avoid circular dependency, it has 2 passes.  It
delays the circular dependency to the second pass
and avoid rescan as much as possible.

I compared the bootstrap times with/wthout this patch
on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
I enabled c,c++,fortran,java,lto,objc

Without patch:

12378.70user 573.02system 41:54.21elapsed 515%CPU

With patch

12580.56user 578.07system 42:25.41elapsed 516%CPU

The overhead is about 1.6%.


-- 
H.J.
---
gcc/

2010-11-19  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (upper_128bits_state): New.
	(block_info_def): Remove upper_128bits_set and done.  Add state,
	referenced, count, processed and rescanned.
	(check_avx256_stores): Updated.
	(move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
	Call note_stores only if needed.  Set referenced and count.
	(move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
	(rescan_move_or_delete_vzeroupper): New.
	(move_or_delete_vzeroupper):  Process and rescan all all basic
	blocks instead of predecessor blocks of all exit points.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_vzeroupper_p
	and use_avx256_p.  Add rescan_vzeroupper_p.

gcc/testsuite/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.
	* gcc.target/i386/avx-vzeroupper-21.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-24.c: Likewise.

[-- Attachment #2: gcc-pr46519-3.patch --]
[-- Type: text/plain, Size: 20416 bytes --]

gcc/

2010-11-19  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (upper_128bits_state): New.
	(block_info_def): Remove upper_128bits_set and done.  Add state,
	referenced, count, processed and rescanned. 
	(check_avx256_stores): Updated.
	(move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
	Call note_stores only if needed.  Set referenced and count.
	(move_or_delete_vzeroupper_1): Updated.  Set rescan_vzeroupper_p.
	(rescan_move_or_delete_vzeroupper): New.
	(move_or_delete_vzeroupper):  Process and rescan all all basic
	blocks instead of predecessor blocks of all exit points.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_vzeroupper_p
	and use_avx256_p.  Add rescan_vzeroupper_p.

gcc/testsuite/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.
	* gcc.target/i386/avx-vzeroupper-21.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
	* gcc.target/i386/avx-vzeroupper-24.c: Likewise.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d5f097d..625fa12 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -57,12 +57,25 @@ along with GCC; see the file COPYING3.  If not see
 #include "dwarf2out.h"
 #include "sched-int.h"
 
+enum upper_128bits_state
+{
+  unknown = 0,		/* Unknown.  */
+  unused,		/* Not used or not referenced.  */
+  used			/* Used or referenced.  */
+};
+
 typedef struct block_info_def
 {
-  /* TRUE if the upper 128bits of any AVX registers are live at exit.  */
-  bool upper_128bits_set;
+  /* State of the upper 128bits of any AVX registers at exit.  */
+  enum upper_128bits_state state;
+  /* If the upper 128bits of any AVX registers are referenced.  */
+  enum upper_128bits_state referenced;
+  /* Number of vzerouppers in this block.  */
+  unsigned int count;
   /* TRUE if block has been processed.  */
-  bool done;
+  bool processed;
+  /* TRUE if block has been rescanned.  */
+  bool rescanned;
 } *block_info;
 
 #define BLOCK_INFO(B)   ((block_info) (B)->aux)
@@ -93,8 +106,9 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
 	  && REG_P (SET_SRC (set))
 	  && VALID_AVX256_REG_MODE (GET_MODE (SET_SRC (set)))))
     {
-      bool *upper_128bits_set = (bool *) data;
-      *upper_128bits_set = true;
+      enum upper_128bits_state *state
+	= (enum upper_128bits_state *) data;
+      *state = used;
     }
 }
 
@@ -106,19 +120,24 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
    are live at entry.  */
 
 static void
-move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
+move_or_delete_vzeroupper_2 (basic_block bb,
+			     enum upper_128bits_state state)
 {
-  rtx insn;
+  rtx insn, bb_end;
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
+  enum upper_128bits_state referenced = BLOCK_INFO (bb)->referenced;
+  int count = BLOCK_INFO (bb)->count;
 
   if (dump_file)
     fprintf (dump_file, " BB [%i] entry: upper 128bits: %d\n",
-	     bb->index, upper_128bits_set);
+	     bb->index, state);
 
+  /* BB_END changes when it is deleted.  */
+  bb_end = BB_END (bb);
   insn = BB_HEAD (bb);
-  while (insn != BB_END (bb))
+  while (insn != bb_end)
     {
       insn = NEXT_INSN (insn);
 
@@ -167,67 +186,89 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 	      && GET_CODE (XVECEXP (pat, 0, 0)) == UNSPEC_VOLATILE
 	      && XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL)
 	    {
-	      upper_128bits_set = false;
+	      state = unused;
 
 	      /* Delete pending vzeroupper insertion.  */
 	      if (vzeroupper_insn)
 		{
+		  count--;
 		  delete_insn (vzeroupper_insn);
 		  vzeroupper_insn = NULL_RTX;
 		}
 	    }
-	  else if (!upper_128bits_set)
-	    note_stores (pat, check_avx256_stores, &upper_128bits_set);
+	  else if (state != used && referenced != unused)
+	    {
+	      /* No need to call note_stores if the upper 128bits of
+		 AVX registers are never referenced.  */
+	      note_stores (pat, check_avx256_stores, &state);
+	      if (state == used)
+		referenced = used;
+	    }
 	  continue;
 	}
 
       /* Process vzeroupper intrinsic.  */
+      count++;
       avx256 = INTVAL (XVECEXP (pat, 0, 0));
 
-      if (!upper_128bits_set)
+      if (state == unused)
 	{
 	  /* Since the upper 128bits are cleared, callee must not pass
 	     256bit AVX register.  We only need to check if callee
 	     returns 256bit AVX register.  */
-	  upper_128bits_set = (avx256 == callee_return_avx256);
+	  if (avx256 == callee_return_avx256)
+	    state = used;
 
-	  /* Remove unnecessary vzeroupper since
-	     upper 128bits are cleared.  */
+	  /* Remove unnecessary vzeroupper since upper 128bits are
+	     cleared.  */
 	  if (dump_file)
 	    {
 	      fprintf (dump_file, "Delete redundant vzeroupper:\n");
 	      print_rtl_single (dump_file, insn);
 	    }
+	  count--;
 	  delete_insn (insn);
 	}
-      else if (avx256 == callee_return_pass_avx256
-	       || avx256 == callee_pass_avx256)
+      else
 	{
-	  /* Callee passes 256bit AVX register.  Check if callee
-	     returns 256bit AVX register.  */
-	  upper_128bits_set = (avx256 == callee_return_pass_avx256);
+	  /* Set state to UNUSED if callee doesn't return 256bit AVX
+	     register.  */
+	  if (avx256 != callee_return_pass_avx256)
+	    state = unused;
 
-	  /* Must remove vzeroupper since
-	     callee passes in 256bit AVX register.  */
-	  if (dump_file)
+	  if (avx256 == callee_return_pass_avx256
+	      || avx256 == callee_pass_avx256)
 	    {
-	      fprintf (dump_file, "Delete callee pass vzeroupper:\n");
-	      print_rtl_single (dump_file, insn);
+	      /* Must remove vzeroupper since callee passes in 256bit
+		 AVX register.  */
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "Delete callee pass vzeroupper:\n");
+		  print_rtl_single (dump_file, insn);
+		}
+	      count--;
+	      delete_insn (insn);
 	    }
-	  delete_insn (insn);
-	}
-      else
-	{
-	  upper_128bits_set = false;
-	  vzeroupper_insn = insn;
+	  else
+	    vzeroupper_insn = insn;
 	}
     }
 
-  BLOCK_INFO (bb)->upper_128bits_set = upper_128bits_set;
+  BLOCK_INFO (bb)->state = state;
+
+  if (BLOCK_INFO (bb)->referenced == unknown)
+    {
+      /* The upper 128bits of AVX registers are never referenced if
+	 REFERENCED isn't updated.  */
+      if (referenced == unknown)
+	referenced = unused;
+      BLOCK_INFO (bb)->referenced = referenced;
+      BLOCK_INFO (bb)->count = count;
+    }
 
   if (dump_file)
     fprintf (dump_file, " BB [%i] exit: upper 128bits: %d\n",
-	     bb->index, upper_128bits_set);
+	     bb->index, state);
 }
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper
@@ -238,18 +279,18 @@ move_or_delete_vzeroupper_1 (basic_block block)
 {
   edge e;
   edge_iterator ei;
-  bool upper_128bits_set;
+  enum upper_128bits_state state;
 
   if (dump_file)
     fprintf (dump_file, " Process BB [%i]: status: %d\n",
-	     block->index, BLOCK_INFO (block)->done);
+	     block->index, BLOCK_INFO (block)->processed);
 
-  if (BLOCK_INFO (block)->done)
+  if (BLOCK_INFO (block)->processed)
     return;
 
-  BLOCK_INFO (block)->done = true;
+  BLOCK_INFO (block)->processed = true;
 
-  upper_128bits_set = false;
+  state = unknown;
 
   /* Process all predecessor edges of this block.  */
   FOR_EACH_EDGE (e, ei, block->preds)
@@ -257,12 +298,70 @@ move_or_delete_vzeroupper_1 (basic_block block)
       if (e->src == block)
 	continue;
       move_or_delete_vzeroupper_1 (e->src);
-      if (BLOCK_INFO (e->src)->upper_128bits_set)
-	upper_128bits_set = true;
+      switch (BLOCK_INFO (e->src)->state)
+	{
+	case unknown:
+	  if (state == unused)
+	    state = unknown;
+	  break;
+	case used:
+	  state = used;
+	  break;
+	case unused:
+	  break;
+	}
     }
 
+  /* If state of any predecessor edges is unknown, we need to rescan.  */
+  if (state == unknown)
+    cfun->machine->rescan_vzeroupper_p = 1;
+
   /* Process this block.  */
-  move_or_delete_vzeroupper_2 (block, upper_128bits_set);
+  move_or_delete_vzeroupper_2 (block, state);
+}
+
+/* Helper function for move_or_delete_vzeroupper.  Rescan vzeroupper
+   in BLOCK and its predecessor blocks recursively.  */
+
+static void
+rescan_move_or_delete_vzeroupper (basic_block block)
+{
+  edge e;
+  edge_iterator ei;
+  enum upper_128bits_state state;
+
+  if (dump_file)
+    fprintf (dump_file, " Rescan BB [%i]: status: %d\n",
+	     block->index, BLOCK_INFO (block)->rescanned);
+
+  if (BLOCK_INFO (block)->rescanned)
+    return;
+
+  BLOCK_INFO (block)->rescanned = true;
+
+  state = unused;
+
+  /* Rescan all predecessor edges of this block.  */
+  FOR_EACH_EDGE (e, ei, block->preds)
+    {
+      if (e->src == block)
+	continue;
+      rescan_move_or_delete_vzeroupper (e->src);
+      /* For rescan, UKKNOWN state is treated as UNUSED.  */
+      if (BLOCK_INFO (e->src)->state == used)
+	state = used;
+    }
+
+  /* Rescan this block only if there are vzerouppers or the upper
+     128bits of AVX registers are referenced.  */
+  if (BLOCK_INFO (block)->count == 0
+      && (state == used || BLOCK_INFO (block)->referenced != used))
+    {
+      if (state == used)
+	BLOCK_INFO (block)->state = state;
+    }
+  else
+    move_or_delete_vzeroupper_2 (block, state);
 }
 
 /* Go through the instruction stream looking for vzeroupper.  Delete
@@ -274,6 +373,8 @@ move_or_delete_vzeroupper (void)
 {
   edge e;
   edge_iterator ei;
+  basic_block bb;
+  unsigned int count = 0;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
@@ -285,16 +386,31 @@ move_or_delete_vzeroupper (void)
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
     {
       move_or_delete_vzeroupper_2 (e->dest,
-				   cfun->machine->caller_pass_avx256_p);
-      BLOCK_INFO (e->dest)->done = true;
+				   cfun->machine->caller_pass_avx256_p
+				   ? used : unused);
+      BLOCK_INFO (e->dest)->processed = true;
+      BLOCK_INFO (e->dest)->rescanned = true;
     }
 
-  /* Process predecessor blocks of all exit points.  */
+  /* Process all basic blocks.  */
   if (dump_file)
-    fprintf (dump_file, "Process all exit points\n");
+    fprintf (dump_file, "Process all basic blocks\n");
 
-  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
-    move_or_delete_vzeroupper_1 (e->src);
+  FOR_EACH_BB (bb)
+    {
+      move_or_delete_vzeroupper_1 (bb);
+      count += BLOCK_INFO (bb)->count;
+    }
+
+  /* Rescan all basic blocks if needed.  */
+  if (count && cfun->machine->rescan_vzeroupper_p)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Rescan all basic blocks\n");
+
+      FOR_EACH_BB (bb)
+	rescan_move_or_delete_vzeroupper (bb);
+    }
 
   free_aux_for_blocks ();
 }
@@ -4062,17 +4178,6 @@ ix86_option_override_internal (bool main_args_p)
     }
 }
 
-/* Return TRUE if type TYPE and mode MODE use 256bit AVX modes.  */
-
-static bool
-use_avx256_p (enum machine_mode mode, const_tree type)
-{
-  return (VALID_AVX256_REG_MODE (mode)
-	  || (type
-	      && TREE_CODE (type) == VECTOR_TYPE
-	      && int_size_in_bytes (type) == 32));
-}
-
 /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
 
 static bool
@@ -5687,7 +5792,6 @@ init_cumulative_args (CUMULATIVE_ARGS *cum,  /* Argument info to initialize */
       if (function_pass_avx256_p (fnret_value))
 	{
 	  /* The return value of this function uses 256bit AVX modes.  */
-	  cfun->machine->use_avx256_p = true;
 	  if (caller)
 	    cfun->machine->callee_return_avx256_p = true;
 	  else
@@ -6956,7 +7060,6 @@ ix86_function_arg (CUMULATIVE_ARGS *cum, enum machine_mode omode,
   if (TARGET_VZEROUPPER && function_pass_avx256_p (arg))
     {
       /* This argument uses 256bit AVX modes.  */
-      cfun->machine->use_avx256_p = true;
       if (cum->caller)
 	cfun->machine->callee_pass_avx256_p = true;
       else
@@ -10970,12 +11073,9 @@ ix86_expand_epilogue (int style)
 
   /* Emit vzeroupper if needed.  */
   if (TARGET_VZEROUPPER
-      && cfun->machine->use_avx256_p
+      && !TREE_THIS_VOLATILE (cfun->decl)
       && !cfun->machine->caller_return_avx256_p)
-    {
-      cfun->machine->use_vzeroupper_p = 1;
-      emit_insn (gen_avx_vzeroupper (GEN_INT (call_no_avx256))); 
-    }
+    emit_insn (gen_avx_vzeroupper (GEN_INT (call_no_avx256))); 
 
   if (crtl->args.pops_args && crtl->args.size)
     {
@@ -15130,9 +15230,6 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
   rtx op0, op1;
   enum tls_model model;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -15277,9 +15374,6 @@ ix86_expand_vector_move (enum machine_mode mode, rtx operands[])
   rtx op0 = operands[0], op1 = operands[1];
   unsigned int align = GET_MODE_ALIGNMENT (mode);
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   /* Force constants other than zero into memory.  We do not know how
      the instructions used to build constants modify the upper 64 bits
      of the register, once we have that information we may be able
@@ -15386,9 +15480,6 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 {
   rtx op0, op1, m;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -21661,12 +21752,11 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }
 
   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
+  if (TARGET_VZEROUPPER && !TREE_THIS_VOLATILE (cfun->decl))
     {
       rtx unspec;
       int avx256;
 
-      cfun->machine->use_vzeroupper_p = 1;
       if (cfun->machine->callee_pass_avx256_p)
 	{
 	  if (cfun->machine->callee_return_avx256_p)
@@ -22763,9 +22853,6 @@ ix86_local_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   /* Don't do dynamic stack realignment for long long objects with
      -mpreferred-stack-boundary=2.  */
   if (!TARGET_64BIT
@@ -22872,9 +22959,6 @@ ix86_minimum_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   if (TARGET_64BIT || align != 64 || ix86_preferred_stack_boundary >= 64)
     return align;
 
@@ -29782,7 +29866,7 @@ ix86_reorg (void)
     }
 
   /* Run the vzeroupper optimization if needed.  */
-  if (cfun->machine->use_vzeroupper_p)
+  if (TARGET_VZEROUPPER)
     move_or_delete_vzeroupper ();
 }
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 170ad50..6e7db03 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2296,12 +2296,6 @@ struct GTY(()) machine_function {
      stack below the return address.  */
   BOOL_BITFIELD static_chain_on_stack : 1;
 
-  /* Nonzero if the current function uses vzeroupper.  */
-  BOOL_BITFIELD use_vzeroupper_p : 1;
-
-  /* Nonzero if the current function uses 256bit AVX regisers.  */
-  BOOL_BITFIELD use_avx256_p : 1;
-
   /* Nonzero if caller passes 256bit AVX modes.  */
   BOOL_BITFIELD caller_pass_avx256_p : 1;
 
@@ -2314,6 +2308,9 @@ struct GTY(()) machine_function {
   /* Nonzero if the current callee returns 256bit AVX modes.  */
   BOOL_BITFIELD callee_return_avx256_p : 1;
 
+  /* Nonzero if rescan vzerouppers in the current function is needed.  */
+  BOOL_BITFIELD rescan_vzeroupper_p : 1;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
index 5007753..667bb17 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
@@ -14,4 +14,4 @@ foo ()
   _mm256_zeroupper ();
 }
 
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
index 507f945..d98ceb9 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
@@ -16,4 +16,4 @@ foo ()
 }
 
 /* { dg-final { scan-assembler-times "\\*avx_vzeroall" 1 } } */
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
new file mode 100644
index 0000000..3301083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
+
+extern void free (void *);
+void
+bar (void *ncstrp)
+{
+  if(ncstrp==((void *)0))
+    return;
+  free(ncstrp);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c
new file mode 100644
index 0000000..6dea055
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-21.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void exit (int) __attribute__ ((__noreturn__));
+
+int
+foo (int i)
+{
+  if (i == 0)
+    exit (1);
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c
new file mode 100644
index 0000000..b4e4a58
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-22.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void exit (int) __attribute__ ((__noreturn__));
+extern void bar (void);
+
+int
+foo (int i)
+{
+  if (i == 0)
+    {
+      bar ();
+      exit (1);
+    }
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c
new file mode 100644
index 0000000..66df800
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-23.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+extern void fatal (void) __attribute__ ((__noreturn__));
+extern void exit (int) __attribute__ ((__noreturn__));
+
+void
+fatal (void)
+{
+  exit (1);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c
new file mode 100644
index 0000000..4fdd374
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-24.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx -mtune=generic -dp" } */
+
+typedef struct bitmap_element_def {
+  struct bitmap_element_def *next;
+  unsigned int indx;
+} bitmap_element;
+typedef struct bitmap_head_def {
+  bitmap_element *first;
+  bitmap_element *current;
+  unsigned int indx;
+} bitmap_head;
+typedef struct bitmap_head_def *bitmap;
+typedef const struct bitmap_head_def *const_bitmap;
+extern void bar (void) __attribute__ ((__noreturn__));
+unsigned char
+bitmap_and_compl_into (bitmap a, const_bitmap b)
+{
+  bitmap_element *a_elt = a->first;
+  const bitmap_element *b_elt = b->first;
+  if (a == b)
+    {
+      if ((!(a)->first))
+	return 0;
+      else
+	return 1;
+    }
+  while (a_elt && b_elt)
+    {
+      if (a_elt->indx < b_elt->indx)
+	a_elt = a_elt->next;
+    }
+  if (a->indx == a->current->indx)
+    bar ();
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: PATCH: PR target/46519: Missing vzeroupper
  2010-11-18  7:29 H.J. Lu
@ 2010-11-18  8:34 ` H.J. Lu
  0 siblings, 0 replies; 46+ messages in thread
From: H.J. Lu @ 2010-11-18  8:34 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

On Wed, Nov 17, 2010 at 8:15 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Nov 17, 2010 at 8:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Nov 17, 2010 at 3:36 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Wed, Nov 17, 2010 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> On Wed, Nov 17, 2010 at 2:33 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>> On Tue, Nov 16, 2010 at 11:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>>> On Wed, Nov 17, 2010 at 7:44 AM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>>>>
>>>>>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>>>>>
>>>>>>> We should check NEXT_INSN (insn) != NEXT_INSN (BB_END (bb)) in
>>>>>>> move_or_delete_vzeroupper_2.  This patch does it.
>>>>>>
>>>>>> Huh? The loop does simple linear scan of all insns in the bb, so it
>>>>>> can't miss BB_END. IIUC, in your case the bb does not have BB_END
>>>>>> (bb), but it has NEXT_INSN (BB_END (bb))?
>>>>>
>>>>> It has BB_END, but it won't be visited by NEXT_INSN starting from
>>>>> BB_HEAD. insn != NEXT_INSN (BB_END (bb)) is used to check the
>>>>> end of the BB everywhere in gcc.
>>>>>
>>>>>> Can you please provide a test case that illustrates this?
>>>>>>
>>>>>
>>>>> I am enclosing a work in progress.  We noticed that we are
>>>>> missing a few vzerouppers at -O3 on SPEC CPU 2K/2006.
>>>>> One isssue is we may have
>>>>>
>>>>> foo:
>>>>>
>>>>>       call bar <<<<< Missing vzeroupper
>>>>>
>>>>>       256bit vectorized insn
>>>>>       goto foo
>>>>>
>>>>> We miss vzeroupper before call bar.  We don't have a small testcase.
>>>>> But this patch fixes this case by inspection. We are checking other
>>>>> cases.
>>>>
>>>> @@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
>>>> upper_128bits_set)
>>>>             bb->index, upper_128bits_set);
>>>>
>>>>   insn = BB_HEAD (bb);
>>>> +  last = NEXT_INSN (BB_END (bb));
>>>>   while (insn != BB_END (bb))
>>>>     {
>>>>       insn = NEXT_INSN (insn);
>>>> +      if (insn == last)
>>>> +       break;
>>>>
>>>>       if (!NONDEBUG_INSN_P (insn))
>>>>        continue;
>>>>
>>>> The change above is not needed. The new check is never triggered - the
>>>> loop terminates when "insn == BB_END (bb)" at "while", so I fail to
>>>> see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
>>>> (BB_END (bb))" is needed.
>>>
>>> Here is the patch for
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>
>>> We have 2 blocks pointing to each others. This patch first scans
>>> all blocks without moving vzeroupper so that we can have accurate
>>> information about upper 128bits at block entry.
>>>
>>>> (The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
>>>> with NEXT_INSN.)
>>>
>>> Please try gcc.target/i386/avx-vzeroupper-20.c.  It will
>>> trigger this condition.
>>>
>>>> @@ -10970,7 +10973,7 @@ ix86_expand_epilogue (int style)
>>>>
>>>>   /* Emit vzeroupper if needed.  */
>>>>   if (TARGET_VZEROUPPER
>>>> -      && cfun->machine->use_avx256_p
>>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize)
>>>>       && !cfun->machine->caller_return_avx256_p)
>>>>     {
>>>>       cfun->machine->use_vzeroupper_p = 1;
>>>> @@ -21661,7 +21664,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
>>>>     }
>>>>
>>>>   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
>>>> -  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
>>>> +  if (TARGET_VZEROUPPER
>>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize))
>>>>
>>>> Decorate *ALL* calls with CALL_NEEDS_VZEROUPPER with
>>>> -ftree-vectorize?! It looks that parts (or state machine) that set
>>>> ...->use_avx256_p flag should be fixed.
>>>
>>> There are:
>>>
>>> foo:
>>>
>>>      call bar <<<<< Missing vzeroupper
>>>
>>>      256bit vectorized insn
>>>      goto foo
>>>
>>> I couldn't find a hook to set use_avx256_p before RTL expansion
>>> starts.
>>>
>>>>     {
>>>>       rtx unspec;
>>>>       int avx256;
>>>> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> new file mode 100644
>>>> index 0000000..3301083
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> @@ -0,0 +1,13 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
>>>> +
>>>> +extern void free (void *);
>>>> +void
>>>> +bar (void *ncstrp)
>>>> +{
>>>> +  if(ncstrp==((void *)0))
>>>> +    return;
>>>> +  free(ncstrp);
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
>>>>
>>>> Hm, this testcase doesn't go together with the above change. There is
>>>> no vectorization involved, and the scan checks that vzeroupper is NOT
>>>> emitted.
>>>>
>>>
>>> This testcase is for
>>>
>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>
>
> I sent the patch without comments too soon.
>
> As discussed in PR, setting and checking use_avx256_p isn't reliable.
> This patch removes use_avx256_p.  Any comments?
>
> Thanks.
>
>
> --
> H.J.
> ---
> gcc/
>
> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * config/i386/i386.c (block_info_def): Add scanned and no_avx256.
>        (move_or_delete_vzeroupper_2): Properly check the end of basic
>        block.  Call note_stores only if no_avx256 is false.
>        (scan_live_upper_128bits_2): New.
>        (scan_live_upper_128bits_1): Likewise.
>        (move_or_delete_vzeroupper): Call scan_live_upper_128bits_1 to
>        scan predecessor blocks of all exit points.
>        (use_avx256_p): Removed.
>        (init_cumulative_args): Don't set use_avx256_p.
>        (ix86_function_arg): Likewise.
>        (ix86_expand_move): Likewise.
>        (ix86_expand_vector_move_misalign): Likewise.
>        (ix86_local_alignment): Likewise.
>        (ix86_minimum_alignment): Likewise.
>        (ix86_expand_epilogue): Don't check use_avx256_p when generating
>        vzeroupper.
>        (ix86_expand_call): Likewise.
>
>        * config/i386/i386.h (machine_function): Remove use_avx256_p.
>
> gcc/testsuite/
>
> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>        * gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>
>        * gcc.target/i386/avx-vzeroupper-20.c: New.
>

Small optimization.  Don't emit vzeroupper if callee doesn't return.


-- 
H.J.
---
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7553db0..cb43620 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21798,7 +21798,9 @@ void
 ix86_split_call_vzeroupper (rtx insn, rtx vzeroupper)
 {
   rtx call = XVECEXP (PATTERN (insn), 0, 0);
-  emit_insn (gen_avx_vzeroupper (vzeroupper));
+  /* Don't emit vzeroupper if callee doesn't return.  */
+  if (!find_reg_note (insn, REG_NORETURN, NULL))
+    emit_insn (gen_avx_vzeroupper (vzeroupper));
   emit_call_insn (call);
 }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* PATCH: PR target/46519: Missing vzeroupper
@ 2010-11-18  7:29 H.J. Lu
  2010-11-18  8:34 ` H.J. Lu
  0 siblings, 1 reply; 46+ messages in thread
From: H.J. Lu @ 2010-11-18  7:29 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 6133 bytes --]

On Wed, Nov 17, 2010 at 8:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Nov 17, 2010 at 3:36 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Nov 17, 2010 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Wed, Nov 17, 2010 at 2:33 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>> On Tue, Nov 16, 2010 at 11:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>> On Wed, Nov 17, 2010 at 7:44 AM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>>>
>>>>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>>>>
>>>>>> We should check NEXT_INSN (insn) != NEXT_INSN (BB_END (bb)) in
>>>>>> move_or_delete_vzeroupper_2.  This patch does it.
>>>>>
>>>>> Huh? The loop does simple linear scan of all insns in the bb, so it
>>>>> can't miss BB_END. IIUC, in your case the bb does not have BB_END
>>>>> (bb), but it has NEXT_INSN (BB_END (bb))?
>>>>
>>>> It has BB_END, but it won't be visited by NEXT_INSN starting from
>>>> BB_HEAD. insn != NEXT_INSN (BB_END (bb)) is used to check the
>>>> end of the BB everywhere in gcc.
>>>>
>>>>> Can you please provide a test case that illustrates this?
>>>>>
>>>>
>>>> I am enclosing a work in progress.  We noticed that we are
>>>> missing a few vzerouppers at -O3 on SPEC CPU 2K/2006.
>>>> One isssue is we may have
>>>>
>>>> foo:
>>>>
>>>>       call bar <<<<< Missing vzeroupper
>>>>
>>>>       256bit vectorized insn
>>>>       goto foo
>>>>
>>>> We miss vzeroupper before call bar.  We don't have a small testcase.
>>>> But this patch fixes this case by inspection. We are checking other
>>>> cases.
>>>
>>> @@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
>>> upper_128bits_set)
>>>             bb->index, upper_128bits_set);
>>>
>>>   insn = BB_HEAD (bb);
>>> +  last = NEXT_INSN (BB_END (bb));
>>>   while (insn != BB_END (bb))
>>>     {
>>>       insn = NEXT_INSN (insn);
>>> +      if (insn == last)
>>> +       break;
>>>
>>>       if (!NONDEBUG_INSN_P (insn))
>>>        continue;
>>>
>>> The change above is not needed. The new check is never triggered - the
>>> loop terminates when "insn == BB_END (bb)" at "while", so I fail to
>>> see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
>>> (BB_END (bb))" is needed.
>>
>> Here is the patch for
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>
>> We have 2 blocks pointing to each others. This patch first scans
>> all blocks without moving vzeroupper so that we can have accurate
>> information about upper 128bits at block entry.
>>
>>> (The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
>>> with NEXT_INSN.)
>>
>> Please try gcc.target/i386/avx-vzeroupper-20.c.  It will
>> trigger this condition.
>>
>>> @@ -10970,7 +10973,7 @@ ix86_expand_epilogue (int style)
>>>
>>>   /* Emit vzeroupper if needed.  */
>>>   if (TARGET_VZEROUPPER
>>> -      && cfun->machine->use_avx256_p
>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize)
>>>       && !cfun->machine->caller_return_avx256_p)
>>>     {
>>>       cfun->machine->use_vzeroupper_p = 1;
>>> @@ -21661,7 +21664,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
>>>     }
>>>
>>>   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
>>> -  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
>>> +  if (TARGET_VZEROUPPER
>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize))
>>>
>>> Decorate *ALL* calls with CALL_NEEDS_VZEROUPPER with
>>> -ftree-vectorize?! It looks that parts (or state machine) that set
>>> ...->use_avx256_p flag should be fixed.
>>
>> There are:
>>
>> foo:
>>
>>      call bar <<<<< Missing vzeroupper
>>
>>      256bit vectorized insn
>>      goto foo
>>
>> I couldn't find a hook to set use_avx256_p before RTL expansion
>> starts.
>>
>>>     {
>>>       rtx unspec;
>>>       int avx256;
>>> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>> new file mode 100644
>>> index 0000000..3301083
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>> @@ -0,0 +1,13 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
>>> +
>>> +extern void free (void *);
>>> +void
>>> +bar (void *ncstrp)
>>> +{
>>> +  if(ncstrp==((void *)0))
>>> +    return;
>>> +  free(ncstrp);
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
>>>
>>> Hm, this testcase doesn't go together with the above change. There is
>>> no vectorization involved, and the scan checks that vzeroupper is NOT
>>> emitted.
>>>
>>
>> This testcase is for
>>
>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>

I sent the patch without comments too soon.

As discussed in PR, setting and checking use_avx256_p isn't reliable.
This patch removes use_avx256_p.  Any comments?

Thanks.


-- 
H.J.
---
gcc/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (block_info_def): Add scanned and no_avx256.
	(move_or_delete_vzeroupper_2): Properly check the end of basic
	block.  Call note_stores only if no_avx256 is false.
	(scan_live_upper_128bits_2): New.
	(scan_live_upper_128bits_1): Likewise.
	(move_or_delete_vzeroupper): Call scan_live_upper_128bits_1 to
	scan predecessor blocks of all exit points.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_avx256_p.

gcc/testsuite/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.

[-- Attachment #2: gcc-pr46519-2.patch --]
[-- Type: text/plain, Size: 12434 bytes --]

gcc/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (block_info_def): Add scanned and no_avx256.
	(move_or_delete_vzeroupper_2): Properly check the end of basic
	block.  Call note_stores only if no_avx256 is false.
	(scan_live_upper_128bits_2): New.
	(scan_live_upper_128bits_1): Likewise.
	(move_or_delete_vzeroupper): Call scan_live_upper_128bits_1 to
	scan predecessor blocks of all exit points.
	(use_avx256_p): Removed.
	(init_cumulative_args): Don't set use_avx256_p.
	(ix86_function_arg): Likewise.
	(ix86_expand_move): Likewise.
	(ix86_expand_vector_move_misalign): Likewise.
	(ix86_local_alignment): Likewise.
	(ix86_minimum_alignment): Likewise.
	(ix86_expand_epilogue): Don't check use_avx256_p when generating
	vzeroupper.
	(ix86_expand_call): Likewise.

	* config/i386/i386.h (machine_function): Remove use_avx256_p.

gcc/testsuite/

2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
	* gcc.target/i386/avx-vzeroupper-11.c: Likewise.

	* gcc.target/i386/avx-vzeroupper-20.c: New.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 11820cf..4b450a3 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -63,6 +63,10 @@ typedef struct block_info_def
   bool upper_128bits_set;
   /* TRUE if block has been processed.  */
   bool done;
+  /* TRUE if block has been scanned.  */
+  bool scanned;
+  /* TRUE if 256bit AVX register isn't referenced in block.  */
+  bool no_avx256;
 } *block_info;
 
 #define BLOCK_INFO(B)   ((block_info) (B)->aux)
@@ -108,19 +112,23 @@ check_avx256_stores (rtx dest, const_rtx set, void *data)
 static void
 move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 {
-  rtx insn;
+  rtx insn, last;
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
+  bool no_avx256 = BLOCK_INFO (bb)->no_avx256;
 
   if (dump_file)
     fprintf (dump_file, " BB [%i] entry: upper 128bits: %d\n",
 	     bb->index, upper_128bits_set);
 
   insn = BB_HEAD (bb);
+  last = NEXT_INSN (BB_END (bb));
   while (insn != BB_END (bb))
     {
       insn = NEXT_INSN (insn);
+      if (insn == last)
+	break;
 
       if (!NONDEBUG_INSN_P (insn))
 	continue;
@@ -176,7 +184,7 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 		  vzeroupper_insn = NULL_RTX;
 		}
 	    }
-	  else if (!upper_128bits_set)
+	  else if (!upper_128bits_set && !no_avx256)
 	    note_stores (pat, check_avx256_stores, &upper_128bits_set);
 	  continue;
 	}
@@ -191,8 +199,8 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 	     returns 256bit AVX register.  */
 	  upper_128bits_set = (avx256 == callee_return_avx256);
 
-	  /* Remove unnecessary vzeroupper since
-	     upper 128bits are cleared.  */
+	  /* Remove unnecessary vzeroupper since upper 128bits are
+	     cleared.  */
 	  if (dump_file)
 	    {
 	      fprintf (dump_file, "Delete redundant vzeroupper:\n");
@@ -207,8 +215,8 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool upper_128bits_set)
 	     returns 256bit AVX register.  */
 	  upper_128bits_set = (avx256 == callee_return_pass_avx256);
 
-	  /* Must remove vzeroupper since
-	     callee passes in 256bit AVX register.  */
+	  /* Must remove vzeroupper since callee passes in 256bit
+	     AVX register.  */
 	  if (dump_file)
 	    {
 	      fprintf (dump_file, "Delete callee pass vzeroupper:\n");
@@ -265,6 +273,109 @@ move_or_delete_vzeroupper_1 (basic_block block)
   move_or_delete_vzeroupper_2 (block, upper_128bits_set);
 }
 
+/* Helper function for scan_live_upper_128bits_1.  Scan BB to check
+   if the upper 128bits of any AVX registers is live at exit of BB.  */
+
+static void
+scan_live_upper_128bits_2 (basic_block bb, bool upper_128bits_set)
+{
+  rtx insn, pat;
+  int avx256;
+  bool no_avx256 = true;
+
+  if (dump_file)
+    fprintf (dump_file, " BB [%i] entry: upper 128bits: %d\n",
+	     bb->index, upper_128bits_set);
+
+  FOR_BB_INSNS (bb, insn)
+    if (NONJUMP_INSN_P (insn))
+      {
+	pat = PATTERN (insn);
+
+	/* Check insn for vzeroupper intrinsic.  */
+	if (GET_CODE (pat) == UNSPEC_VOLATILE
+	    && XINT (pat, 1) == UNSPECV_VZEROUPPER)
+	  {
+	    /* Process vzeroupper intrinsic.  */
+	    avx256 = INTVAL (XVECEXP (pat, 0, 0));
+	    if (!upper_128bits_set)
+	      {
+		/* Since the upper 128bits are cleared, callee must
+		   not pass 256bit AVX register.  We only need to check
+		   if callee returns 256bit AVX register.  */
+		upper_128bits_set = (avx256 == callee_return_avx256);
+	      }
+	    else if (avx256 == callee_return_pass_avx256
+		     || avx256 == callee_pass_avx256)
+	      {
+		/* Callee passes 256bit AVX register.  Check if callee
+		   returns 256bit AVX register.  */
+		upper_128bits_set = (avx256 == callee_return_pass_avx256);
+	      }
+	    else
+	      upper_128bits_set = false;
+	  }
+	else
+	  {
+	    /* Check insn for vzeroall intrinsic.  */
+	    if (GET_CODE (pat) == PARALLEL
+		&& GET_CODE (XVECEXP (pat, 0, 0)) == UNSPEC_VOLATILE
+		&& XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL)
+	      upper_128bits_set = false;
+	    else if (!upper_128bits_set)
+	      {
+		note_stores (pat, check_avx256_stores,
+			     &upper_128bits_set);
+		if (upper_128bits_set)
+		  no_avx256 = false;
+	      }
+	  }
+      }
+
+  BLOCK_INFO (bb)->upper_128bits_set = upper_128bits_set;
+  BLOCK_INFO (bb)->no_avx256 = no_avx256;
+
+  if (dump_file)
+    fprintf (dump_file, " BB [%i] exit: upper 128bits: %d\n",
+	     bb->index, upper_128bits_set);
+}
+
+/* Helper function for move_or_delete_vzeroupper.  Scan BLOCK and its
+   predecessor blocks recursively to check if the upper 128bits of any
+   AVX registers is live at exit of BLOCK.  */
+
+static void
+scan_live_upper_128bits_1 (basic_block block)
+{
+  edge e;
+  edge_iterator ei;
+  bool upper_128bits_set;
+
+  if (dump_file)
+    fprintf (dump_file, " Scan BB [%i]: status: %d\n",
+	     block->index, BLOCK_INFO (block)->scanned);
+
+  if (BLOCK_INFO (block)->scanned)
+    return;
+
+  BLOCK_INFO (block)->scanned = true;
+
+  upper_128bits_set = false;
+
+  /* Process all predecessor edges of this block.  */
+  FOR_EACH_EDGE (e, ei, block->preds)
+    {
+      if (e->src == block)
+	continue;
+      scan_live_upper_128bits_1 (e->src);
+      if (BLOCK_INFO (e->src)->upper_128bits_set)
+	upper_128bits_set = true;
+    }
+
+  /* Scan this block.  */
+  scan_live_upper_128bits_2 (block, upper_128bits_set);
+}
+
 /* Go through the instruction stream looking for vzeroupper.  Delete
    it if upper 128bit AVX registers are unused.  If it isn't deleted,
    move it to just before a jump insn.  */
@@ -287,8 +398,16 @@ move_or_delete_vzeroupper (void)
       move_or_delete_vzeroupper_2 (e->dest,
 				   cfun->machine->caller_pass_avx256_p);
       BLOCK_INFO (e->dest)->done = true;
+      BLOCK_INFO (e->dest)->scanned = true;
     }
 
+  /* Scan predecessor blocks of all exit points.  */
+  if (dump_file)
+    fprintf (dump_file, "Scan all exit points\n");
+
+  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
+    scan_live_upper_128bits_1 (e->src);
+
   /* Process predecessor blocks of all exit points.  */
   if (dump_file)
     fprintf (dump_file, "Process all exit points\n");
@@ -4062,17 +4181,6 @@ ix86_option_override_internal (bool main_args_p)
     }
 }
 
-/* Return TRUE if type TYPE and mode MODE use 256bit AVX modes.  */
-
-static bool
-use_avx256_p (enum machine_mode mode, const_tree type)
-{
-  return (VALID_AVX256_REG_MODE (mode)
-	  || (type
-	      && TREE_CODE (type) == VECTOR_TYPE
-	      && int_size_in_bytes (type) == 32));
-}
-
 /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
 
 static bool
@@ -5687,7 +5795,6 @@ init_cumulative_args (CUMULATIVE_ARGS *cum,  /* Argument info to initialize */
       if (function_pass_avx256_p (fnret_value))
 	{
 	  /* The return value of this function uses 256bit AVX modes.  */
-	  cfun->machine->use_avx256_p = true;
 	  if (caller)
 	    cfun->machine->callee_return_avx256_p = true;
 	  else
@@ -6956,7 +7063,6 @@ ix86_function_arg (CUMULATIVE_ARGS *cum, enum machine_mode omode,
   if (TARGET_VZEROUPPER && function_pass_avx256_p (arg))
     {
       /* This argument uses 256bit AVX modes.  */
-      cfun->machine->use_avx256_p = true;
       if (cum->caller)
 	cfun->machine->callee_pass_avx256_p = true;
       else
@@ -10970,7 +11076,6 @@ ix86_expand_epilogue (int style)
 
   /* Emit vzeroupper if needed.  */
   if (TARGET_VZEROUPPER
-      && cfun->machine->use_avx256_p
       && !cfun->machine->caller_return_avx256_p)
     {
       cfun->machine->use_vzeroupper_p = 1;
@@ -15130,9 +15235,6 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
   rtx op0, op1;
   enum tls_model model;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -15277,9 +15379,6 @@ ix86_expand_vector_move (enum machine_mode mode, rtx operands[])
   rtx op0 = operands[0], op1 = operands[1];
   unsigned int align = GET_MODE_ALIGNMENT (mode);
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   /* Force constants other than zero into memory.  We do not know how
      the instructions used to build constants modify the upper 64 bits
      of the register, once we have that information we may be able
@@ -15386,9 +15485,6 @@ ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 {
   rtx op0, op1, m;
 
-  if (VALID_AVX256_REG_MODE (mode))
-    cfun->machine->use_avx256_p = true;
-
   op0 = operands[0];
   op1 = operands[1];
 
@@ -21661,7 +21757,7 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }
 
   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
+  if (TARGET_VZEROUPPER)
     {
       rtx unspec;
       int avx256;
@@ -22763,9 +22859,6 @@ ix86_local_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   /* Don't do dynamic stack realignment for long long objects with
      -mpreferred-stack-boundary=2.  */
   if (!TARGET_64BIT
@@ -22872,9 +22965,6 @@ ix86_minimum_alignment (tree exp, enum machine_mode mode,
       decl = NULL;
     }
 
-  if (use_avx256_p (mode, type))
-    cfun->machine->use_avx256_p = true;
-
   if (TARGET_64BIT || align != 64 || ix86_preferred_stack_boundary >= 64)
     return align;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 170ad50..f7c38e5 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2299,9 +2299,6 @@ struct GTY(()) machine_function {
   /* Nonzero if the current function uses vzeroupper.  */
   BOOL_BITFIELD use_vzeroupper_p : 1;
 
-  /* Nonzero if the current function uses 256bit AVX regisers.  */
-  BOOL_BITFIELD use_avx256_p : 1;
-
   /* Nonzero if caller passes 256bit AVX modes.  */
   BOOL_BITFIELD caller_pass_avx256_p : 1;
 
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
index 5007753..667bb17 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-10.c
@@ -14,4 +14,4 @@ foo ()
   _mm256_zeroupper ();
 }
 
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
index 507f945..d98ceb9 100644
--- a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-11.c
@@ -16,4 +16,4 @@ foo ()
 }
 
 /* { dg-final { scan-assembler-times "\\*avx_vzeroall" 1 } } */
-/* { dg-final { scan-assembler-times "avx_vzeroupper" 3 } } */
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
new file mode 100644
index 0000000..3301083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
+
+extern void free (void *);
+void
+bar (void *ncstrp)
+{
+  if(ncstrp==((void *)0))
+    return;
+  free(ncstrp);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2011-01-24 17:26 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-17 19:45 PATCH: PR target/46519: Missing vzeroupper H.J. Lu
2010-12-18 19:36 ` Uros Bizjak
2010-12-18 20:11   ` H.J. Lu
2010-12-29 11:03     ` Uros Bizjak
2010-12-29 16:23       ` H.J. Lu
2010-12-30 12:42         ` Uros Bizjak
2011-01-01  1:05           ` Mark Mitchell
2011-01-01  1:38             ` H.J. Lu
2011-01-01  1:39               ` Mark Mitchell
2011-01-01  2:08                 ` H.J. Lu
2011-01-01  2:17                   ` Mark Mitchell
2011-01-01 16:01                     ` H.J. Lu
2011-01-04  1:15                       ` Mark Mitchell
2011-01-04  3:59                         ` H.J. Lu
2011-01-04  5:54                           ` Mark Mitchell
2011-01-04 22:17                             ` H.J. Lu
2011-01-04 23:53                               ` Mark Mitchell
2011-01-05  0:06                                 ` H.J. Lu
2011-01-05  0:08                                   ` Mark Mitchell
2011-01-05  0:09                                     ` H.J. Lu
2011-01-05  0:24                                       ` Mark Mitchell
2011-01-05 16:44                                         ` H.J. Lu
2011-01-05 17:12                                           ` Jakub Jelinek
2011-01-05 23:01                                             ` H.J. Lu
2011-01-13 17:19                                             ` H.J. Lu
2011-01-13 17:25                                               ` Mark Mitchell
2011-01-13 18:16                                               ` Richard Henderson
2011-01-13 18:51                                                 ` H.J. Lu
2011-01-14 16:06                                                   ` Richard Henderson
2011-01-14 16:08                                                     ` H.J. Lu
2011-01-16  8:04                                                       ` H.J. Lu
2011-01-24 18:00                                                         ` Richard Henderson
2011-01-24 18:12                                                           ` H.J. Lu
2011-01-13 18:04                                           ` Richard Henderson
2011-01-13 18:09                                             ` H.J. Lu
  -- strict thread matches above, loose matches on Subject: below --
2010-11-19 21:58 H.J. Lu
2010-11-20  0:24 ` Richard Guenther
2010-11-20  1:48   ` H.J. Lu
2010-11-20 12:11     ` Richard Guenther
2010-11-20 18:20       ` H.J. Lu
2010-11-24 19:48         ` Uros Bizjak
2010-11-24 19:53           ` H.J. Lu
2010-11-24 19:57             ` Uros Bizjak
2010-11-24 21:41               ` H.J. Lu
2010-11-18  7:29 H.J. Lu
2010-11-18  8:34 ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).