From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-196120-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 96272 invoked by alias); 25 May 2018 19:26:03 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 96200 invoked by uid 89); 25 May 2018 19:25:58 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 25 May 2018 19:25:56 +0000
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 072723003705;	Fri, 25 May 2018 19:25:55 +0000 (UTC)
Received: from localhost.localdomain (ovpn-112-50.rdu2.redhat.com [10.10.112.50])	by smtp.corp.redhat.com (Postfix) with ESMTP id 1010517016;	Fri, 25 May 2018 19:25:52 +0000 (UTC)
Subject: Re: PR80155: Code hoisting and register pressure
To: Richard Biener <rguenther@suse.de>, "Bin.Cheng" <amker.cheng@gmail.com>, Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: GCC Development <gcc@gcc.gnu.org>, Thomas Preudhomme <thomas.preudhomme@linaro.org>
References: <CAAgBjMndG3T40NnF6-SZkzyF_UzQwzXGT6HcnMJr0HfBD5mTKg@mail.gmail.com> <alpine.LSU.2.20.1805231011350.24704@zhemvz.fhfr.qr> <CAAgBjMnx87jrW7ZusO5OHdVbvNiw1VLu9C-2Zw2x5gH5MpfLCg@mail.gmail.com> <014f7b2a-3c64-4144-37a4-4cc7bdff3d47@redhat.com> <CAAgBjMnEzskovbNKffYKfkTsN2+WRbjwaARCuhCss-S4-FaPGQ@mail.gmail.com> <CAHFci2_5gZ4vVOkxFoRDhPJH-VaGYGTjt4DZgwLCK_weLQefKQ@mail.gmail.com> <66d75fc6-5553-212d-1aca-d92b3c6193b2@redhat.com> <3D61699E-73EC-4936-9FF3-494DF6816616@suse.de>
From: Jeff Law <law@redhat.com>
Openpgp: preference=signencrypt
Autocrypt: addr=law@redhat.com; prefer-encrypt=mutual; keydata= xsBNBFkbIO8BCACVIqDhDVh9ur8C+zNV1J/cXfwvVDAUcphDEFl4jyHqZORK4Pd3Db8oWqLm Q8lOCr/VOS7lrCtdpVMQkLGOGA16oJ8g7hzhnojpjY09UjsoUiG7oKacuxj8skfp6SIx93Zl +iNYPRa4S+za6nY8qiVjyUuiyX04ZPZMrKp2c2sGi+HnBKUZXGhrz/Jdzdox3tjajWZnObyy nhEN6hn9L3KawTtGPE/R6A/1RhHTD9FQmIWIeucpaY5c6GNKXTFpj2VYx57LY5hve1R5vhrJ IZcgwZAiOtmik5lVi96glY5h6bugRwpexjhwORTLPBCkwiYotSxX99mWd6EHL576i5CNABEB AAHNGUplZmYgTGF3IDxsYXdAcmVkaGF0LmNvbT7CwI4EEwEIADgWIQR+niGjtnP5P/8PpRq8 fP682pgzWwUCWRsg7wIbAwULCQgHAgYVCAkKCwIEFgIDAQIeAQIXgAAKCRC8fP682pgzW5QG B/9VATJmx5235RB+8jiDYGXQf3vd9gBfPy/l1tsaK400eFAevDzfGvKmeCKe+uGnlrH3vyT8 rg9zqH+s5a1Y+lDXPOpJAFmmzbOLU4FW4ucbawmtYvBL65PqpQneCTYnC802/OAcxjm/Onem HlgeK6WicNsBTPwYN/0araDFUejyYBIFi9CNqqflwk5Z3brKbQ9bAYIkysVLC/c3njKPmM0c WPFHG91ubLbWCHwTIK0+mAL714eTD74dXzOjO2ZDBPLGlFN/kO3+YjaO6UOD2O8acvAMCivT kWLr7JwRgLIQDN2DkhQDd3LTPqQE/yOcMcXBTO+fxm8KG0iKQBqWMyGJzsBNBFkbIO8BCACy qbOsv7XegSeea8XORt5zMaBVWKoSyhmmcCmlxZFS2cuYOBt79MO13lZE2DlO3Lv5IKikj/D4 ketGVO4+h5psEMH5Yz5P8bx0TmgwbK1GxPZrzeXozUFJDvvCDbIlT0v0pwUXuK3hg8Ieo2h5 uTed/cn1OjySXW5BqLxN0cyr5hL+J6dcsHvKLT/N3nTgCQhoJXK2MrEMhAGgF3jKpMn3CoS4 i/ZbNI2MQR6LWHwdZ95f0fI8NzHSfVzeLtzCKQec7nr9fgd6Ylk1ZpGWQUPlQmKjzYgeCeTK NO04cwt20WIrQWeWiZFPA0U86NDBdSBrYp4kG3dfIXE+wSSvE7qPABEBAAHCwHYEGAEIACAW IQR+niGjtnP5P/8PpRq8fP682pgzWwUCWRsg7wIbDAAKCRC8fP682pgzW3REB/9cT7iKRPg/ OK9bpLlllIEDM90IaKC79DQrv+fRudOR78cdV4XUwPSFnyHUsP3VJ4lDy5FhiKCwGie0BK53 EsxgMrLy1L8hboFdTE4Vi0xzCheMaMVp4hATDU29k1cuxu1VPpCa8E3mYeHjNV7ip0HN5L4D rfs8lRPJE/oM1vGs9DgQFZrCPPNRNGKC97BH+DHccesEJr7tSsQrkPkt0z/FTKr5wIM02vSx OJjgmcVbGB7dc2j/Sx8loXmuKnuKtM35668kUG8jeJvSQk3o/VHpD27bhl0rR68R2jN6G6kQ egMVb6dPu1Ius8rBE5rFw88J4JEb5q4hMNClWWUFHIdP
Message-ID: <f285c55e-7aa6-b492-06d4-f2acd5ec8a65@redhat.com>
Date: Fri, 25 May 2018 19:26:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <3D61699E-73EC-4936-9FF3-494DF6816616@suse.de>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2018-05/txt/msg00232.txt.bz2

On 05/25/2018 11:54 AM, Richard Biener wrote:
> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>> On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>> <prathamesh.kulkarni@linaro.org> wrote:
>>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I am trying to work on PR80155, which exposes a problem with
>> code
>>>>>>>> hoisting and register pressure on a leading embedded benchmark
>> for ARM
>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>>
>>>>>>>> I have attached two test-cases which (hopefully) are
>> representative of
>>>>>>>> the original test-case.
>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>> the
>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>>
>>>>>>>> Initially I thought the spill was happening because of "too many
>>>>>>>> hoistings" taking place in original test-case thus increasing
>> the
>>>>>>>> register pressure, but it seems the spill is possibly caused
>> because
>>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>>
>>>>>>>> For example, the following hoistings take place with
>> trans_dfa_2.c:
>>>>>>>>
>>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>>
>>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>> {plus_expr,_4,1} (0006)
>>>>>>>>
>>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>> first
>>>>>>>> two hoistings in block 4
>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>> that by
>>>>>>>> disabling hoisting into block 4,
>>>>>>>> which resulted in no extra spills.
>>>>>>>>
>>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>>>> hoisted from blocks 5 and 9,
>>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>>> computed in a block on loop exit, gets hoisted outside that
>> block
>>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>>> disabling hoisting of expressions which appeared in blocks on
>> loop
>>>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>>>> hoistings didn't seem to matter.
>>>>>>>
>>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>>> a block that also exits from the loop special is that an
>>>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>>> happening here and it would be a pass ordering issue as
>>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>>>> your testcase.
>>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>>> for both these test-cases! Sadly it still causes the spill for the
>> benchmark -:(
>>>>>> I will try to create a better approximation of the original
>> test-case.
>>>>>>>
>>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>>> of BB 4.  After PRE we have
>>>>>>>
>>>>>>> <bb 4> [local count: 159303558]:
>>>>>>> <L1>:
>>>>>>> pretmp_123 = *tab_37(D);
>>>>>>> _87 = pretmp_123 + 1;
>>>>>>> if (c_36 == 65)
>>>>>>>   goto <bb 5>; [34.00%]
>>>>>>> else
>>>>>>>   goto <bb 8>; [66.00%]
>>>>>>>
>>>>>>> <bb 5> [local count: 54163210]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_96 != 0)
>>>>>>>   goto <bb 7>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 6>; [11.00%]
>>>>>>>
>>>>>>> <bb 8> [local count: 105140348]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_56 != 0)
>>>>>>>   goto <bb 10>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 9>; [11.00%]
>>>>>>>
>>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>>> may also point at the real issue of the code hoisting which is
>>>>>>> tearing apart the RMW operation?
>>>>>> Indeed, this possibility seems much more likely than block being
>> on loop exit.
>>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>> this
>>>>>> specific test-case to check
>>>>>> if that prevents the spill.
>>>>> Even if it prevents the spill in this case, it's likely a good
>> thing to
>>>>> do.  The statements prior to the conditional in bb5 and bb8 should
>> be
>>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>>> Hi,
>>>> It seems disabling forwprop somehow works for causing no extra
>> spills
>>>> on the original test-case.
>>>>
>>>> For instance,
>>>> Hoisting without forwprop:
>>>>
>>>> bb 3:
>>>> _1 = tab_1(D) + 8
>>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_268 + 1;
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>>  *_1 = _ 2
>>>>
>>>> bb 5:
>>>> *_1 = _2
>>>>
>>>> Hoisting with forwprop:
>>>>
>>>> bb 3:
>>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_164 + 1
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> bb 5:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> Although in both cases, we aren't hoisting stores, the issues with
>> forwprop
>>>> for this case seems to be the folding of
>>>> *_1 = _2
>>>> into
>>>> MEM[tab_1(D) + 8] = _2  ?
>>>
>>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>>> range.
>>>
>>>>
>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>> the
>>>> sense it created same set of hoistings as without forwprop, however
>> it
>>>> still results in additional spills (albeit different registers).
>>>>
>>>> That's because forwprop seems to be increasing live range of
>>>> prephitmp_217 by substituting
>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
>> 1).
>>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>>> shall I find prephitmp_221/prephitmp_217 stuff.
>>>
>>>> On the other hand, Bin pointed out to me in private that forwprop
>> also
>>>> helps to restrict register pressure by propagating "tab + const_int"
>>>> for same test-case.
>>>>
>>>> So I am not really sure if there's an easier fix than having
>>>> heuristics for estimating register pressure at TREE level ? I would
>> be
>>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>>> forwprop/sink/hoisting can be improved by taking live range into
>>> consideration.  Specifically, to direct such passes when moving code
>>> around different basic blocks, because inter-block register pressure
>>> is hard to resolve afterwards.
>>>
>>> As suggested by Jeff and Richi, I guess the first step would be doing
>>> experiments, collecting more benchmark data for reordering sink
>> before
>>> pre?  It enables code sink as well as decreases register pressure in
>>> the original reduced cases IIRC.
>> We might even consider re-evaluating Bernd's work on what is
>> effectively
>> a gimple scheduler to minimize register pressure.
> 
> Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces... 
I really wonder how bad it would be to walk over expr.c and change the
expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at the
more complex statements rather than relying on TER.

That's really all TER is supposed to be doing anyway.

Jeff