From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-462283-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 126204 invoked by alias); 15 Sep 2017 16:50:34 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 125939 invoked by uid 89); 15 Sep 2017 16:50:33 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=pursuit, stronger
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 15 Sep 2017 16:50:31 +0000
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id E1F8D81E03;	Fri, 15 Sep 2017 16:50:29 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com E1F8D81E03
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=law@redhat.com
Received: from localhost.localdomain (ovpn-112-55.rdu2.redhat.com [10.10.112.55])	by smtp.corp.redhat.com (Postfix) with ESMTP id 6F25F17CDD;	Fri, 15 Sep 2017 16:50:27 +0000 (UTC)
Subject: Re: [PATCH] Factor out division by squares and remove division around comparisons (1/2)
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>, Jackson Woodruff <jackson.woodruff@foss.arm.com>, Richard Biener <richard.guenther@gmail.com>
Cc: "kyrylo.tkachov@foss.arm.com" <kyrylo.tkachov@foss.arm.com>, "Joseph S. Myers" <joseph@codesourcery.com>, GCC Patches <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>
References: <375649d4-3c43-0c37-3e4d-3913c7213993@foss.arm.com> <CAFiYyc3W59mDXn-pEKNO1S3Uk=2XpowRtT65ZgJb_E8AGnCK-Q@mail.gmail.com> <DB6PR0801MB205392EDB2B58B757167CC26838D0@DB6PR0801MB2053.eurprd08.prod.outlook.com> <CAFiYyc0wGaBuujfTUSVrsAOUoOkoO6SCYB9Mh+kPJ7DP68pSUA@mail.gmail.com> <DB6PR0801MB2053EA8A9D0D899A21DF5CF183830@DB6PR0801MB2053.eurprd08.prod.outlook.com> <41575a02-f31a-3944-9efa-c4db14e1606d@redhat.com> <2c8f1237-fdb2-b7de-0f7d-6ba15c1368fe@foss.arm.com> <CAFiYyc3ggiuEoPKG5e_KNU0aMzT4TS+N4RkXcnynt+dUS6ZXtQ@mail.gmail.com> <dbeb5c74-3035-71d6-e2d8-49e8b0c46b7f@foss.arm.com> <CAFiYyc0Su3E_qu=pucVnPfH2O+9pz+=AJAtGdkgFvgiwLA=7mQ@mail.gmail.com> <e4218e1c-5c2f-56d3-4682-1eab065fdd5c@foss.arm.com> <d278713e-3075-51fb-af1a-27ca35825876@redhat.com> <DB6PR0801MB2053A5F95768973B9D5421CF836E0@DB6PR0801MB2053.eurprd08.prod.outlook.com>
From: Jeff Law <law@redhat.com>
Message-ID: <a1e736d4-180c-184b-8bdc-c0659df359b1@redhat.com>
Date: Fri, 15 Sep 2017 16:50:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <DB6PR0801MB2053A5F95768973B9D5421CF836E0@DB6PR0801MB2053.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-IsSubscribed: yes
X-SW-Source: 2017-09/txt/msg01045.txt.bz2

On 09/13/2017 03:20 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> On 09/06/2017 03:55 AM, Jackson Woodruff wrote:
>>> On 08/30/2017 01:46 PM, Richard Biener wrote:
> 
>>>>    rdivtmp = 1 / (y*C);
>>>>    tem = x *rdivtmp;
>>>>    tem2= z * rdivtmp;
>>>>
>>>> instead of
>>>>
>>>>    rdivtmp = 1/y;
>>>>    tem = x * 1/C * rdivtmp;
>>>>    tem2 = z * 1/C * rdivtmp;
>>>
>>> Ideally we would be able to CSE that into
>>>
>>> rdivtmp = 1/y * 1/C;
>>> tem = x * rdivtmp;
>>> tem2 = z * rdivtmp;
>> So why is your sequence significantly better than Richi's desired
>> seqeuence?  They both seem to need 3 mults and a division (which in both
>> cases might be a reciprocal estimation).    In Richi's sequence we have
>> to mult and feed the result as an operand into the reciprocal insn.  In
>> yours we feed the result of the reciprocal into the multiply.
> 
> Basically this stuff happens a lot in real code, which is exactly why I proposed it.
> I even provided counts of how many divisions each transformation avoids:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026. 
I don't doubt that it happens in real code.  There's a reason why MIPS
IV added recip and rsqrt 20 years ago.  Our ability to exploit them has
always been fairly limited though.

What I'm trying to avoid is a transformation where the two forms are
roughly equal in terms of what they expose vs what they hide.

If pulling the 1/c out is consistently better, then that's obviously
good.  If it's essentially a toss-up because of the other interactions
with CSE, then we need to think about it more deeply.

I _suspect_ that pulling the 1/c out is generally better, but something
more concrete than my intuition would be helpful.


> 
> Note this transformation is a canonicalization - if you can't merge a
> constant somehow, moving it out to the LHS can expose more
> opportunities, like in (C1 * x) / (C2 * y) -> (C1 * x * C2) / y -> (C3 * x) / y.
> Same for negation as it behaves like * -1.
FWIW, I generally agree.

> 
> The key here is that it is at least an order of magnitude worse if you have
> to execute an extra division than if you have an extra multiply.
No doubt.  I'll trade a division for a multiply any day.  Similarly 1/C
is just a constant, so I consider it essentially free.


> 
>> ISTM in the end if  y*C or 1/(y*C) is a CSE, then Richi's sequence wins.
>>  Similarly if 1/y is a CSE, then yours wins.  Is there some reason to
>> believe that one is a more likely CSE than the other?
> 
> The idea is that 1/y is much more likely a CSE than 1/(y*C).
Do we have anything other than intuition to back that up?
> 
> We could make the pattern only fire in single use cases and see whether
> that makes a difference. It would be easy to test old vs new vs single-use
> new and count how many divisions we end up with. We could add 1/ (y * C)
> to the reciprocal phase if it is unacceptable as a canonicalization, but then
> you won't be able to optimize (C1 * x * C2) / y.
We could.  I think the question would then become does restricting to
the single-use case kill too many opportunities.

Sigh.  I think the 4 of us could go round and round on this forever in
the pursuit of the perfect code.  Though in reality I'm happy with a
clear improvement.


> 
>> I think there's a fundamental phase ordering problem here.  You want to
>> CSE stuff as much as possible, then expose reciprocals, then CSE again
>> because exposing reciprocals can expose new CSE opportunities.
> 
> I agree there are phase ordering issues and various problems in
> reassociation, CSE and division optimizations not being able to find and
> optimize complex cases that are worthwhile.
> 
> However I don't agree doing CSE before reciprocals is a good idea. We
> want to expose reciprocals early on, even if that means we find fewer
> CSEs as a result - again because division is so much more expensive than
> any other operation. CSE is generally not smart enough to CSE a * x in
> a * b * x and a * c * x, something which is likely to happen quite frequently -
> unlike the made up division examples here.
We have much stronger reassociation capabilities for multiplication --
it's a well understood problem and if we fail to pick up a * x across
those two kind of expressions, I'd consider our reassociation code as
failing pretty badly, particularly for integer types.

BUt yes, division is expensive.  And I'm all for a tranformation that
turns a division into a multiplication.  That's almost always a win.

> 
> The first question is whether you see it as a canonicalization. If so, then
> match.pd should be fine.
Pulling the constant part out of the denominator and turning it into a
multiplication by the recip constant should likely be considered
canonicalization.  I think Richi largely agreed to this in the thread as
well and asked for that hunk of the patch to be extracted and submitted
independent of the other changes so that it could go ahead and move
forward while we figure out the rest.

Note that tree-ssa-reassoc.c has a fair amount of this kind of
canonicalization.  In fact, I think that's where we handle things like
pulling out negations.  You may actually do better handling it there.


Jeff