On Wed, Nov 24, 2010 at 10:18 AM, Uros Bizjak wrote: > On Sat, Nov 20, 2010 at 3:11 PM, H.J. Lu wrote: > >>>>>>>> Here is the patch for >>>>>>>> >>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519 >>>>>>>> >>>>>>>> We have 2 blocks pointing to each others. This patch first scans >>>>>>>> all blocks without moving vzeroupper so that we can have accurate >>>>>>>> information about upper 128bits at block entry. >>>>>>> >>>>>>> This introduces another insn scanning pass, almost the same as >>>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling). >>>>>>> >>>>>>> So, if I understand correctly: >>>>>>> - The patch removes the detection if the function ever touches AVX registers. >>>>>>> - Due to this, all call_insn RTXes have to be decorated with >>>>>>> CALL_NEEDS_VZEROUPPER. >>>>>>> - A new pre-pass is required that scans all functions in order to >>>>>>> detect functions with live AVX registers at exit, and at the same time >>>>>>> marks the functions that *do not* use AVX registers. >>>>>>> - Existing pass then re-scans everything to again detect functions >>>>>>> with live AVX registers at exit and handles vzeroupper emission. >>>>>>> >>>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure >>>>>>> can be used to handle this case? >>>>>>> >>>>>> >>>>>> Here is the rewrite of the vzeroupper optimization pass. >>>>>> To avoid circular dependency, it has 2 passes.  It >>>>>> delays the circular dependency to the second pass >>>>>> and avoid rescan as much as possible. >>>>>> >>>>>> I compared the bootstrap times with/wthout this patch >>>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx. >>>>>> I enabled c,c++,fortran,java,lto,objc >>>>>> >>>>>> Without patch: >>>>>> >>>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU >>>>>> >>>>>> With patch >>>>>> >>>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU >>>>>> >>>>>> The overhead is about 1.6%. >>>>> >>>>> That's a quite big overhead for something that doesn't use FP >>>>> math (and thus no AVX). >>>> >>>> AVX256 vector insns are independent of FP math.  They can be >>>> generated by vectorizer as well as loop unroll.  We can limit >>>> it to -O2 or -O3 if overhead is a big concern. >>> >>> Limiting it to -fexpensive-optimizations would be a good start.  Btw, >>> how is code-size affected?  Does it make sense to disable it when >>> optimizing a function for size?  As it affects performance of callees >>> whether the caller is optimized for size or speed probably isn't the >>> best thing to check. >>> >> >> We pay penalty at SSE<->AVX transition, not exactly in callee/caller. >> We can just check optimize_size. >> >> Here is the updated patch to limit vzeroupper optimization to >> -fexpensive-optimizations and not optimizing for size.  OK for trunk? > > ATM, I have no other (obvious) solution to two-pass problem, although > I think LCM (please look at gcc/lcm.c) should be investigated if it > fits this job. I will investigate it for 4.7. > The patch demonstrates better generated code, so I propose to proceed > with the patch. Although IMO non-optimal solution depends on > TARGET_VZEROUPPER and -fexpensive-optimizations. > > So, since it looks that there are no other objections, the patch is OK > for mainline. > Here is a follow up patch. Fortran intrinsic may set TREE_THIS_VOLATILE even if it does return. This patch removes the TREE_THIS_VOLATILE optimization. OK for trunk? Thanks. -- H.J. --- gcc/ 2010-11-24 H.J. Lu PR target/46519 * config/i386/i386.c (ix86_expand_epilogue): Don't check TREE_THIS_VOLATILE. (ix86_expand_call): Likewise. gcc/testsuite/ 2010-11-24 H.J. Lu PR target/46519 * gfortran.dg/pr46519-1.f: New.