On Wed, Nov 24, 2010 at 10:18 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 3:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>>>>>>> Here is the patch for
>>>>>>>>
>>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>>>>
>>>>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>>>>> information about upper 128bits at block entry.
>>>>>>>
>>>>>>> This introduces another insn scanning pass, almost the same as
>>>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>>>>
>>>>>>> So, if I understand correctly:
>>>>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>>>>> CALL_NEEDS_VZEROUPPER.
>>>>>>> - A new pre-pass is required that scans all functions in order to
>>>>>>> detect functions with live AVX registers at exit, and at the same time
>>>>>>> marks the functions that *do not* use AVX registers.
>>>>>>> - Existing pass then re-scans everything to again detect functions
>>>>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>>>>
>>>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>>>>> can be used to handle this case?
>>>>>>>
>>>>>>
>>>>>> Here is the rewrite of the vzeroupper optimization pass.
>>>>>> To avoid circular dependency, it has 2 passes.  It
>>>>>> delays the circular dependency to the second pass
>>>>>> and avoid rescan as much as possible.
>>>>>>
>>>>>> I compared the bootstrap times with/wthout this patch
>>>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>>>>> I enabled c,c++,fortran,java,lto,objc
>>>>>>
>>>>>> Without patch:
>>>>>>
>>>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>>>>
>>>>>> With patch
>>>>>>
>>>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>>>>
>>>>>> The overhead is about 1.6%.
>>>>>
>>>>> That's a quite big overhead for something that doesn't use FP
>>>>> math (and thus no AVX).
>>>>
>>>> AVX256 vector insns are independent of FP math.  They can be
>>>> generated by vectorizer as well as loop unroll.  We can limit
>>>> it to -O2 or -O3 if overhead is a big concern.
>>>
>>> Limiting it to -fexpensive-optimizations would be a good start.  Btw,
>>> how is code-size affected?  Does it make sense to disable it when
>>> optimizing a function for size?  As it affects performance of callees
>>> whether the caller is optimized for size or speed probably isn't the
>>> best thing to check.
>>>
>>
>> We pay penalty at SSE<->AVX transition, not exactly in callee/caller.
>> We can just check optimize_size.
>>
>> Here is the updated patch to limit vzeroupper optimization to
>> -fexpensive-optimizations and not optimizing for size.  OK for trunk?
>
> ATM, I have no other (obvious) solution to two-pass problem, although
> I think LCM (please look at gcc/lcm.c) should be investigated if it
> fits this job.

I will investigate it for 4.7.

> The patch demonstrates better generated code, so I propose to proceed
> with the patch. Although IMO non-optimal solution depends on
> TARGET_VZEROUPPER and -fexpensive-optimizations.
>
> So, since it looks that there are no other objections, the patch is OK
> for mainline.
>

Here is a follow up  patch.  Fortran intrinsic may set TREE_THIS_VOLATILE
even if it does return. This patch removes the TREE_THIS_VOLATILE
optimization.  OK for trunk?

Thanks.

-- 
H.J.
---
gcc/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* config/i386/i386.c (ix86_expand_epilogue): Don't check
	TREE_THIS_VOLATILE.
	(ix86_expand_call): Likewise.

gcc/testsuite/

2010-11-24  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/46519
	* gfortran.dg/pr46519-1.f: New.