From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id A37C73857C4E for ; Mon, 5 Jul 2021 02:29:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A37C73857C4E Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 16524XOQ159931; Sun, 4 Jul 2021 22:29:45 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 39kjt6ept2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 04 Jul 2021 22:29:45 -0400 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16526EER167635; Sun, 4 Jul 2021 22:29:45 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0a-001b2d01.pphosted.com with ESMTP id 39kjt6epsj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 04 Jul 2021 22:29:44 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1652IH5g017427; Mon, 5 Jul 2021 02:29:42 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma03ams.nl.ibm.com with ESMTP id 39jfh8rhrn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 05 Jul 2021 02:29:42 +0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1652TejT33423618 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Jul 2021 02:29:40 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2E9394203F; Mon, 5 Jul 2021 02:29:40 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E826E42047; Mon, 5 Jul 2021 02:29:38 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 5 Jul 2021 02:29:38 +0000 (GMT) Subject: Re: Question on tree LIM To: Richard Biener Cc: GCC Development , "Andre Vieira (lists)" , Xiong Hu Luo References: <1338ef7b-57f4-a376-5827-c85392ed53a8@linux.ibm.com> <0fd24c58-bcd4-ce7d-d986-bee82d2b7ff5@linux.ibm.com> From: "Kewen.Lin" Message-ID: <657d2ed0-fab1-47f3-a426-93f9e8bba116@linux.ibm.com> Date: Mon, 5 Jul 2021 10:29:37 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-GUID: NONCsXL9kaZ_KsBJObsTU23NY-HCIAE0 X-Proofpoint-ORIG-GUID: wBorE32kh4SfYiNtb9_FL-R-WZvTNgC5 Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-07-04_17:2021-07-02, 2021-07-04 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 mlxlogscore=999 mlxscore=0 bulkscore=0 lowpriorityscore=0 malwarescore=0 impostorscore=0 phishscore=0 clxscore=1015 spamscore=0 suspectscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2107050010 X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Jul 2021 02:29:51 -0000 on 2021/7/2 下午7:28, Richard Biener wrote: > On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin wrote: >> >> Hi Richard, >> >> on 2021/7/2 下午4:07, Richard Biener wrote: >>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc wrote: >>>> >>>> Hi, >>>> >>>> I am investigating one degradation related to SPEC2017 exchange2_r, >>>> with loop vectorization on at -O2, it degraded by 6%. By some >>>> isolation, I found it isn't directly caused by vectorization itself, >>>> but exposed by vectorization, some stuffs for vectorization >>>> condition checks are hoisted out and they increase the register >>>> pressure, finally results in more spillings than before. If I simply >>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of >>>> the original), if further disable rtl lim, it just becomes to 30% of >>>> the original. It seems to indicate there is some room to improve in >>>> both LIMs. >>>> >>>> By quick scanning in tree LIM, I noticed that there seems no any >>>> considerations on register pressure, it looked intentional? I am >>>> wondering what's the design philosophy behind it? Is it because that >>>> it's hard to model register pressure well here? If so, it seems to >>>> put the burden onto late RA, which needs to have a good >>>> rematerialization support. >>> >>> Yes, it is "intentional" in that doing any kind of prioritization based >>> on register pressure is hard on the GIMPLE level since most >>> high-level transforms try to expose followup transforms which you'd >>> somehow have to anticipate. Note that LIMs "cost model" (if you can >>> call it such...) is too simplistic to be a good base to decide which >>> 10 of the 20 candidates you want to move (and I've repeatedly pondered >>> to remove it completely). >>> >> >> Thanks for the explanation! Do you really want to remove it completely >> rather than just improve it with a better one? :-\ > > ;) For example the LIM cost model makes it not hoist an invariant (int)x > but then PRE which detects invariant motion opportunities as partial > redundances happily does (because PRE has no cost model at all - heh). > Got it, thanks for further clarification. :) >> Here there are some PRs (PR96825, PR98782) related to exchange2_r which >> seems to suffer from high register pressure and bad spillings. Not sure >> whether they are also somehow related to the pressure given from LIM, but >> the trigger is commit >> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction >> frequency, maybe it's worth to re-visiting this idea about considering >> BB frequency in LIM cost model: >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > Note most "problems", and those which are harder to undo, stem from > LIMs store-motion which increases register pressure inside loops by > adding loop-carried dependences. The BB frequency might be a way > to order candidates when we have a way to set a better cap on the > number of refs to move. Note the current "cost" model is rather a > benefit model and causes us to not move cheap things (like the above > conversion) because it seems not worth the trouble. > Yeah, I noticed it at least excludes "cheap" ones. > Note a very simple way would be to have a --param specifying a > maximum number of refs to move (but note there are several > LIM/store-motion passes so any such static limit would have > surprising effects). For store-motion I considered a hard limit on > the number of loop carried dependences (PHIs) and counting both > existing and added ones (to avoid the surprise). > > Note how such limits or other cost models should consider inner and > outer loop behavior remains to be determined - at least LIM works > at the level of whole loop nests and there's a rough idea of dependent > transforms but simply gathering candidates and stripping some isn't > going to work without major surgery in that area I think. > Thanks for all the notes and thoughts, I might had better to visit RA remat first, Xionghu had some interests to investigate how to consider BB freq in LIMs, I will check its effect and further check these ideas if need then. BR, Kewen >>> As to putting the burden on RA - yes, that's one possibility. The other >>> possibility is to use the register-pressure aware scheduler, though not >>> sure if that will ever move things into loop bodies. >>> >> >> Brandly new idea! IIUC it requires a global scheduler, not sure how well >> GCC global scheduler performs, generally speaking the register-pressure >> aware scheduler will prefer the insn which has more deads (for that >> intensive regclass), for this problem the modeling seems a bit different, >> it has to care about total interference numbers between two "equivalent" >> blocks (src/dest), not sure if it's easier to do than rematerialization. > > No idea either but as said above undoing store-motion is harder than > scheduling or RA remat. > >>>> btw, the example loop is at line 1150 from src exchange2.fppized.f90 >>>> >>>> 1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10 >>>> >>>> The extra hoisted statements after the vectorization on this loop >>>> (cheap cost model btw) are: >>>> >>>> _686 = (integer(kind=8)) rnext_679; >>>> _1111 = (sizetype) _19; >>>> _1112 = _1111 * 12; >>>> _1927 = _1112 + 12; >>>> * _1895 = _1927 - _2650; >>>> _1113 = (unsigned long) rnext_679; >>>> * niters.6220_1128 = 10 - _1113; >>>> * _1021 = 9 - _1113; >>>> * bnd.6221_940 = niters.6220_1128 >> 2; >>>> * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612; >>>> _144 = niters_vector_mult_vf.6222_939 + _1113; >>>> tmp.6223_934 = (integer(kind=8)) _144; >>>> S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934; >>>> * ivtmp.6410_289 = (unsigned long) S.823_1004; >>>> >>>> PS: * indicates the one has a long live interval. >>> >>> Note for the vectorizer generated conditions there's quite some room for >>> improvements to reduce the amount of semi-redundant computations. I've >>> pointed out some to Andre, in particular suggesting to maintain a single >>> "remaining scalar iterations" IV across all the checks to avoid keeping >>> 'niters' live and doing all the above masking & shifting repeatedly before >>> the prologue/main/vectorized epilogue/epilogue loops. Not sure how far >>> he got with that idea. >>> >> >> Great, it definitely helps to mitigate this problem. Thanks for the information. >> >> >> BR, >> Kewen