From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id A37C73857C4E
 for <gcc@gcc.gnu.org>; Mon,  5 Jul 2021 02:29:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A37C73857C4E
Received: from pps.filterd (m0098394.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 16524XOQ159931; Sun, 4 Jul 2021 22:29:45 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39kjt6ept2-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Sun, 04 Jul 2021 22:29:45 -0400
Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16526EER167635;
 Sun, 4 Jul 2021 22:29:45 -0400
Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.98])
 by mx0a-001b2d01.pphosted.com with ESMTP id 39kjt6epsj-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Sun, 04 Jul 2021 22:29:44 -0400
Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1])
 by ppma03ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1652IH5g017427;
 Mon, 5 Jul 2021 02:29:42 GMT
Received: from b06cxnps4075.portsmouth.uk.ibm.com
 (d06relay12.portsmouth.uk.ibm.com [9.149.109.197])
 by ppma03ams.nl.ibm.com with ESMTP id 39jfh8rhrn-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 05 Jul 2021 02:29:42 +0000
Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60])
 by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 1652TejT33423618
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Mon, 5 Jul 2021 02:29:40 GMT
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 2E9394203F;
 Mon,  5 Jul 2021 02:29:40 +0000 (GMT)
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id E826E42047;
 Mon,  5 Jul 2021 02:29:38 +0000 (GMT)
Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34])
 by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Mon,  5 Jul 2021 02:29:38 +0000 (GMT)
Subject: Re: Question on tree LIM
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Development <gcc@gcc.gnu.org>,
 "Andre Vieira (lists)" <Andre.SimoesDiasVieira@arm.com>,
 Xiong Hu Luo <luoxhu@linux.ibm.com>
References: <1338ef7b-57f4-a376-5827-c85392ed53a8@linux.ibm.com>
 <CAFiYyc15i7ErH6K+Cptq4Z+23r3iqLW6pGstQvZLix6KnjWi5g@mail.gmail.com>
 <0fd24c58-bcd4-ce7d-d986-bee82d2b7ff5@linux.ibm.com>
 <CAFiYyc0=ZejPUTPYQKp+5Xrn7oBYvYzf2CFdwtD0foqzs9X0tQ@mail.gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <657d2ed0-fab1-47f3-a426-93f9e8bba116@linux.ibm.com>
Date: Mon, 5 Jul 2021 10:29:37 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
In-Reply-To: <CAFiYyc0=ZejPUTPYQKp+5Xrn7oBYvYzf2CFdwtD0foqzs9X0tQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: NONCsXL9kaZ_KsBJObsTU23NY-HCIAE0
X-Proofpoint-ORIG-GUID: wBorE32kh4SfYiNtb9_FL-R-WZvTNgC5
Content-Transfer-Encoding: 8bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-07-04_17:2021-07-02,
 2021-07-04 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 adultscore=0 mlxlogscore=999
 mlxscore=0 bulkscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0
 phishscore=0 clxscore=1015 spamscore=0 suspectscore=0 priorityscore=1501
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000
 definitions=main-2107050010
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H4,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc mailing list <gcc.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <mailto:gcc-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc>,
 <mailto:gcc-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Jul 2021 02:29:51 -0000

on 2021/7/2 下午7:28, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richard,
>>
>> on 2021/7/2 下午4:07, Richard Biener wrote:
>>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc <gcc@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am investigating one degradation related to SPEC2017 exchange2_r,
>>>> with loop vectorization on at -O2, it degraded by 6%.  By some
>>>> isolation, I found it isn't directly caused by vectorization itself,
>>>> but exposed by vectorization, some stuffs for vectorization
>>>> condition checks are hoisted out and they increase the register
>>>> pressure, finally results in more spillings than before.  If I simply
>>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>>>> the original), if further disable rtl lim, it just becomes to 30% of
>>>> the original.  It seems to indicate there is some room to improve in
>>>> both LIMs.
>>>>
>>>> By quick scanning in tree LIM, I noticed that there seems no any
>>>> considerations on register pressure, it looked intentional? I am
>>>> wondering what's the design philosophy behind it?  Is it because that
>>>> it's hard to model register pressure well here?  If so, it seems to
>>>> put the burden onto late RA, which needs to have a good
>>>> rematerialization support.
>>>
>>> Yes, it is "intentional" in that doing any kind of prioritization based
>>> on register pressure is hard on the GIMPLE level since most
>>> high-level transforms try to expose followup transforms which you'd
>>> somehow have to anticipate.  Note that LIMs "cost model" (if you can
>>> call it such...) is too simplistic to be a good base to decide which
>>> 10 of the 20 candidates you want to move (and I've repeatedly pondered
>>> to remove it completely).
>>>
>>
>> Thanks for the explanation!  Do you really want to remove it completely
>> rather than just improve it with a better one?  :-\
> 
> ;)  For example the LIM cost model makes it not hoist an invariant (int)x
> but then PRE which detects invariant motion opportunities as partial
> redundances happily does (because PRE has no cost model at all - heh).
> 

Got it, thanks for further clarification. :)

>> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
>> seems to suffer from high register pressure and bad spillings.  Not sure
>> whether they are also somehow related to the pressure given from LIM, but
>> the trigger is commit
>> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
>> frequency, maybe it's worth to re-visiting this idea about considering
>> BB frequency in LIM cost model:
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Note most "problems", and those which are harder to undo, stem from
> LIMs store-motion which increases register pressure inside loops by
> adding loop-carried dependences.  The BB frequency might be a way
> to order candidates when we have a way to set a better cap on the
> number of refs to move.  Note the current "cost" model is rather a
> benefit model and causes us to not move cheap things (like the above
> conversion) because it seems not worth the trouble.
> 

Yeah, I noticed it at least excludes "cheap" ones.

> Note a very simple way would be to have a --param specifying a
> maximum number of refs to move (but note there are several
> LIM/store-motion passes so any such static limit would have
> surprising effects).  For store-motion I considered a hard limit on
> the number of loop carried dependences (PHIs) and counting both
> existing and added ones (to avoid the surprise).
> 
> Note how such limits or other cost models should consider inner and
> outer loop behavior remains to be determined - at least LIM works
> at the level of whole loop nests and there's a rough idea of dependent
> transforms but simply gathering candidates and stripping some isn't
> going to work without major surgery in that area I think.
> 

Thanks for all the notes and thoughts, I might had better to visit RA remat
first, Xionghu had some interests to investigate how to consider BB freq in
LIMs, I will check its effect and further check these ideas if need then.

BR,
Kewen

>>> As to putting the burden on RA - yes, that's one possibility.  The other
>>> possibility is to use the register-pressure aware scheduler, though not
>>> sure if that will ever move things into loop bodies.
>>>
>>
>> Brandly new idea!  IIUC it requires a global scheduler, not sure how well
>> GCC global scheduler performs, generally speaking the register-pressure
>> aware scheduler will prefer the insn which has more deads (for that
>> intensive regclass), for this problem the modeling seems a bit different,
>> it has to care about total interference numbers between two "equivalent"
>> blocks (src/dest), not sure if it's easier to do than rematerialization.
> 
> No idea either but as said above undoing store-motion is harder than
> scheduling or RA remat.
> 
>>>> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>>>>
>>>>    1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>>>>
>>>> The extra hoisted statements after the vectorization on this loop
>>>> (cheap cost model btw) are:
>>>>
>>>>     _686 = (integer(kind=8)) rnext_679;
>>>>     _1111 = (sizetype) _19;
>>>>     _1112 = _1111 * 12;
>>>>     _1927 = _1112 + 12;
>>>>   * _1895 = _1927 - _2650;
>>>>     _1113 = (unsigned long) rnext_679;
>>>>   * niters.6220_1128 = 10 - _1113;
>>>>   * _1021 = 9 - _1113;
>>>>   * bnd.6221_940 = niters.6220_1128 >> 2;
>>>>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>>>>     _144 = niters_vector_mult_vf.6222_939 + _1113;
>>>>     tmp.6223_934 = (integer(kind=8)) _144;
>>>>     S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>>>>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>>>>
>>>> PS: * indicates the one has a long live interval.
>>>
>>> Note for the vectorizer generated conditions there's quite some room for
>>> improvements to reduce the amount of semi-redundant computations.  I've
>>> pointed out some to Andre, in particular suggesting to maintain a single
>>> "remaining scalar iterations" IV across all the checks to avoid keeping
>>> 'niters' live and doing all the above masking & shifting repeatedly before
>>> the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
>>> he got with that idea.
>>>
>>
>> Great, it definitely helps to mitigate this problem.  Thanks for the information.
>>
>>
>> BR,
>> Kewen