From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-446135-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 36993 invoked by alias); 13 Jan 2017 18:48:56 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 36426 invoked by uid 89); 13 Jan 2017 18:48:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-5.1 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=Wont, Won't, our
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Jan 2017 18:48:45 +0000
Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 7BA4E445E0;	Fri, 13 Jan 2017 18:48:45 +0000 (UTC)
Received: from abulafia.quesejoda.com (ovpn-116-205.ams2.redhat.com [10.36.116.205])	by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v0DImhY3009811;	Fri, 13 Jan 2017 13:48:44 -0500
Subject: Re: [PR tree-optimization/71691] Fix unswitching in presence of maybe-undef SSA_NAMEs (take 2)
To: Richard Biener <richard.guenther@gmail.com>
References: <5e79620e-8adb-7768-a34e-32fd286995f0@redhat.com> <CAFiYyc1C9mvUbZCmg1kEsBZveowVQTL7p3DOU-KhQyMXP7uAhg@mail.gmail.com> <b8393005-a73c-26e4-b76e-04ee62608775@redhat.com> <CAFiYyc0jJDV3VXE+nX5U=wujd_qpCG87dW9qoYnbxDX4zZ+faw@mail.gmail.com> <1e426a06-270b-07fe-0e95-ef3382b614c2@redhat.com> <CAFiYyc2vY-gBo_VDGC=VMYWanyhC_+gRYRGOzcV_fwMVUK40LQ@mail.gmail.com>
Cc: gcc-patches <gcc-patches@gcc.gnu.org>, Jeff Law <law@redhat.com>
From: Aldy Hernandez <aldyh@redhat.com>
Message-ID: <0ff3a68e-3e2c-c0df-8c48-b02dac40d621@redhat.com>
Date: Fri, 13 Jan 2017 18:48:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0
MIME-Version: 1.0
In-Reply-To: <CAFiYyc2vY-gBo_VDGC=VMYWanyhC_+gRYRGOzcV_fwMVUK40LQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2017-01/txt/msg00989.txt.bz2

[Sorry for the delay, I was sick.]

On 01/09/2017 04:30 AM, Richard Biener wrote:
> On Sat, Jan 7, 2017 at 1:54 PM, Aldy Hernandez <aldyh@redhat.com> wrote:
>> On 01/04/2017 07:11 AM, Richard Biener wrote:
>>>
>>> On Tue, Jan 3, 2017 at 6:36 PM, Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>> On 12/20/2016 09:16 AM, Richard Biener wrote:
>>>>>
>>>>>
>>>>> On Fri, Dec 16, 2016 at 3:41 PM, Aldy Hernandez <aldyh@redhat.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi folks.
>>>>>>
>>>>>> This is a follow-up on Jeff and Richi's interaction on the
>>>>>> aforementioned
>>>>>> PR
>>>>>> here:
>>>>>>
>>>>>>         https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01397.html
>>>>>>
>>>>>> I decided to explore the idea of analyzing may-undefness on-demand,
>>>>>> which
>>>>>> actually looks rather cheap.
>>>>>>
>>>>>> BTW, I don't understand why we don't have auto_bitmap's, as we already
>>>>>> have
>>>>>> auto_sbitmap's.  I've implemented the former based on auto_sbitmap's
>>>>>> code
>>>>>> we
>>>>>> already have.
>>>>>>
>>>>>> The attached patch fixes the bug without introducing any regressions.
>>>>>>
>>>>>> I also tested the patch by compiling 242 .ii files with -O3.  These
>>>>>> were
>>>>>> gathered from a stage1 build with -save-temps.  There is a slight time
>>>>>> degradation of 4 seconds within 27 minutes of user time:
>>>>>>
>>>>>>         tainted:        26:52
>>>>>>         orig:           26:48
>>>>>>
>>>>>> This was the average aggregate time of two runs compiling all 242 .ii
>>>>>> files.
>>>>>> IMO, this looks reasonable.  It is after all, -O3.    Is it acceptable?
>>>>>
>>>>>
>>>>>
>>>>> +  while (!worklist.is_empty ())
>>>>> +    {
>>>>> +      name = worklist.pop ();
>>>>> +      gcc_assert (TREE_CODE (name) == SSA_NAME);
>>>>> +
>>>>> +      if (ssa_undefined_value_p (name, true))
>>>>> +       return true;
>>>>> +
>>>>> +      bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name));
>>>>>
>>>>> it should be already set as we use visited_ssa as "was it ever on the
>>>>> worklist",
>>>>> so maybe renaming it would be a good thing as well.
>>>>
>>>>
>>>>
>>>> I don't understand what you would prefer here.
>>>
>>>
>>> Set the bit when you put name on the worklist (and do not do that if the
>>> bit is set).  Thus simply remove the above and add a bitmap_set_bit
>>> for the initial name you put on the worklist.
>>>
>>>>>
>>>>> +             if (TREE_CODE (name) == SSA_NAME)
>>>>> +               {
>>>>> +                 /* If an SSA has already been seen, this may be a
>>>>> loop.
>>>>> +                    Fail conservatively.  */
>>>>> +                 if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION
>>>>> (name)))
>>>>> +                   return false;
>>>>>
>>>>> so to me "conservative" is returning true, not false.
>>>>
>>>>
>>>>
>>>> OK
>>>>
>>>>>
>>>>> +                 bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name));
>>>>> +                 worklist.safe_push (name);
>>>>>
>>>>> but for loops we can just continue and ignore this use.  And
>>>>> bitmap_set_bit
>>>>> returns whether it set a bit, thus
>>>>>
>>>>>                 if (bitmap_set_bit (visited_ssa, SSA_NAME_VERSION
>>>>> (name)))
>>>>>                   worklist.safe_push (name);
>>>>>
>>>>> should work?
>>>>
>>>>
>>>>
>>>> Fixed.
>>>>
>>>>>
>>>>> +      /* Check that any SSA names used to define NAME is also fully
>>>>> +        defined.  */
>>>>> +      use_operand_p use_p;
>>>>> +      ssa_op_iter iter;
>>>>> +      FOR_EACH_SSA_USE_OPERAND (use_p, def, iter, SSA_OP_USE)
>>>>> +       {
>>>>> +         name = USE_FROM_PTR (use_p);
>>>>> +         if (TREE_CODE (name) == SSA_NAME)
>>>>>
>>>>> always true.
>>>>>
>>>>> +           {
>>>>> +             /* If an SSA has already been seen, this may be a loop.
>>>>> +                Fail conservatively.  */
>>>>> +             if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION (name)))
>>>>> +               return false;
>>>>> +             bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name));
>>>>> +             worklist.safe_push (name);
>>>>>
>>>>> See above.
>>>>>
>>>>> In principle the thing is sound but I'd like to be able to pass in a
>>>>> bitmap of
>>>>> known maybe-undefined/must-defined SSA names to have a cache for
>>>>> multiple invocations from within a pass (like this unswitching case).
>>>>
>>>>
>>>>
>>>> Done, though bitmaps are now calculated as part of the instantiation.
>>>>
>>>>>
>>>>> Also once you hit defs that are in a post-dominated region of the loop
>>>>> entry
>>>>> you can treat them as not undefined (as their use invokes undefined
>>>>> behavior).  This is also how you treat function parameters (well,
>>>>> ssa_undefined_value_p does), where the call site invokes undefined
>>>>> behavior
>>>>> when passing in undefined values.  So we need an extra parameter
>>>>> specifying
>>>>> the post-dominance region.
>>>>
>>>>
>>>>
>>>> Done.
>>>>
>>>>>
>>>>> You do not handle memory or calls conservatively which means the
>>>>> existing
>>>>> testcase only needs some obfuscation to become a problem again.  To fix
>>>>> that before /* Check that any SSA names used to define NAME is also
>>>>> fully
>>>>> defined.  */ bail out conservatively, like
>>>>>
>>>>>    if (! is_gimple_assign (def)
>>>>>       || gimple_assign_single_p (def))
>>>>>     return true;
>>>>
>>>>
>>>>
>>>> As I asked previously, I understand the !is_gimple_assign, which will
>>>> skip
>>>> over GIMPLE_CALLs setting a value, but the "gimple_assign_single_p(def)
>>>> ==
>>>> true"??
>>>>
>>>> Won't this last one bail on things like e.3_7 = e, or x_77 = y_88? Don't
>>>> we
>>>> want to follow up the def chain precisely on these?
>>>>
>>>> The attached implementation uses a cache, and a pre-computed
>>>> post-dominance
>>>> region.  A previous incantation of this patch (sans the post-dominance
>>>> stuff) had performance within the noise of the previous implementation.
>>>>
>>>> I am testing again, and will do some performance comparisons in a bit,
>>>> but
>>>> for now-- are we on the same page here?  Is this what you wanted?
>>>
>>>
>>> +      /* DEFs in the post-dominated region of the loop entry invoke
>>> +        undefined behavior.  Adding any use won't make things any
>>> +        worse.  */
>>> +      for (unsigned i = 0; i < postdom.length (); ++i)
>>> +       if (def->bb == postdom[i])
>>>
>>> gimple_bb (def)
>>>
>>> +         {
>>> +           set_defined (t);
>>> +           return false;
>>> +         }
>>>
>>> I think you can't really return false here but need to continue processing
>>> the worklist.  I'd say rather than a linear walk you can as well use
>>> dominated_by_p (CDI_POST_DOMINATORS, ...) and record the entry
>>> block instead?
>>>
>>> Note that the way you compute post-dominators doesn't work -- nothing
>>> keeps them up-to-date when unswitching does CFG manipulations :/
>>> Fortunately we have a way to compute them per region, see how
>>> tree-if-conv.c
>>> does that (build_region, calculate_dominance_info_for_region).
>>
>>
>> If I understand correctly, we could compute them per region in
>> tree_unswitch_single_loop() for each individual loop with what
>> tree-if-conv.c uses.
>>
>> I could certainly abstract build_region/calculate_dominance_info_for_region
>> into something new, say calculate_dominance_info_for_loop_region().  Then we
>> could use dominated_by_p(CDI_POST_DOMINATORS, ...) in my class as you
>> suggest.
>>
>> Question... computing the post-dom tree per region as I've just described
>> will only create post-dom information for the loop region, which means that
>> any definition outside of the loop will have zero dominance info.
>
> Yes.
>
>> What if the definition in question post-dominates the loop entry but is
>> outside/after of the loop?
>
> How would we ever arrive at such def?  Once we reach a def before the
> region/loop
> we know it's fine to use (the missing "stop" point I pointed out).

Hmmm, it looks like we can't even build the post-dom region 
appropriately in the presence of infinite loops.

First, build_region() fails if it can't find a loop post-header:

   /* The last element is loop post-header.  */
   gcc_assert (exit_bb);

which we won't have in an infinite loop like:

bb2: //loop header
|
V
loop 1:
bb3:		<-------+
	bar();		|
|			|
V			|
bb4:			|
	goto bb3;   >---+


We could just build the region without the post-header, but then we fail 
while building the DFS dominance region:

dom_info::calc_dfs_tree_nonrec (basic_block bb)
....
	  gcc_assert (bn != m_start_block);

Presumably because we've looped back to the start block (bb4 in a post 
dominator world).  This doesn't happen calculating going forward because 
we always have a start block outside of the region (the function entry 
block).

I tried to fake edge my way out of it, but things get really messed up 
while putting/removing fake edges in the presence of loop info.  I'm 
assuming all this is by design.

Would you prefer me to continue down this path, trying to build a 
post-dom region with infinite loops and fixing build_region / 
calc_dfs_tree_nonrec, or could we do without this dominance 
optimization?  Pretty please?

>
>>  We will have no post-dom information for this.
>> In this case, could we just ignore and continue evaluating the SSA_NAME with
>> the rest of our heuristics, or did you have something else in mind?  Another
>> option would be to calculate the post-dom information for the entire
>> function on every loop (calculate_dominance_info(CDI_POST_DOMINATORS)), but
>> that seems rather expensive.
>>
>> As a side note, at this point I just want to fix/close the PR in an
>> acceptable manner, not come up with the end-all catch-all most-efficient
>> patch for an unlikely scenario ;-).
>
> Yeah, which is why I wonder whether the caching is worth the trouble (in it's
> current form) for the unswitching use (given it's other restrictions
> on conditions
> to unswitch).

We could go back to my original, no caching version (with the other 
suggestions).  That solves the problem quite simply, with no regressions.

Thanks again.
Aldy