* LTO slows down calculix by more than 10% on aarch64
@ 2020-08-26 10:32 Prathamesh Kulkarni
2020-08-26 11:20 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-26 10:32 UTC (permalink / raw)
To: GCC Development
[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]
Hi,
We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
on aarch64 in our validation CI. I tried to investigate this issue a
bit, and it seems the regression comes from inlining of orthonl into
e_c3d. Disabling that brings back the performance. However, inlining
orthonl into e_c3d, increases it's size from 3187 to 3837 by around
16.9% which isn't too large.
I have attached two test-cases, e_c3d.f that has orthonl manually
inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
which contains unmodified function.
(gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
sufficient.
It seems that inlining orthonl, causes 20 hoistings into block 181,
which are then hoisted to block 173, in particular hoistings of w(1,
1) ... w(3, 3), which wasn't
possible without inlining. The hoistings happen because of basic block
that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
following block in line 1035 in e_c3d.f:
senergy=
& (s11*w(1,1)+s12*(w(1,2)+w(2,1))
& +s13*(w(1,3)+w(3,1))+s22*w(2,2)
& +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
Disabling hoisting into blocks 173 (and 181), brings back most of the
performance. I am not able to understand why (if?) these hoistings of
w(1, 1) ...
w(3, 3) are causing slowdown however. Looking at assembly, the hot
code-path from perf in e_c3d shows following code-gen diff:
For inlined version:
.L122:
ldr d15, [x1, -248]
add w0, w0, 1
add x2, x2, 24
add x1, x1, 72
fmul d15, d17, d15
fmul d15, d15, d18
fmul d14, d15, d14
fmadd d16, d14, d31, d16
cmp w0, 4
beq .L121
ldr d14, [x2, -8]
b .L122
and for non-inlined version:
.L118:
ldr d0, [x1, -248]
add w0, w0, 1
ldr d2, [x2, -8]
add x1, x1, 72
add x2, x2, 24
fmul d0, d3, d0
fmul d0, d0, d5
fmul d0, d0, d2
fmadd d1, d4, d0, d1
cmp w0, 4
bne .L118
which corresponds to the following loop in line 1014.
do n1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(m1,k1,n1,l1)
& *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
& *weight
I am not sure why would hoisting have any direct effect on this loop
except perhaps that hoisting allocated more reigsters, and led to
increased register pressure. Perhaps that's why it's using highered
number regs for code-gen in inlined version ? However disabling
hoisting in blocks 173 and 181, also leads to overall 6 extra spills
(by grepping for str to sp), so
hoisting is also helping here ? I am not sure how to proceed further,
and would be grateful for suggestions.
Thanks,
Prathamesh
[-- Attachment #2: e_c3d.f --]
[-- Type: application/octet-stream, Size: 47165 bytes --]
!
! CalculiX - A 3-dimensional finite element program
! Copyright (C) 1998 Guido Dhondt
!
! This program is free software; you can redistribute it and/or
! modify it under the terms of the GNU General Public License as
! published by the Free Software Foundation(version 2);
!
!
! This program is distributed in the hope that it will be useful,
! but WITHOUT ANY WARRANTY; without even the implied warranty of
! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
! GNU General Public License for more details.
!
! You should have received a copy of the GNU General Public License
! along with this program; if not, write to the Free Software
! Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
!
subroutine e_c3d(co,nk,konl,lakonl,p1,p2,omx,bodyfx,ibod,s,sm,f,
& ff,nelem,nmethod,elcon,nelcon,rhcon,nrhcon,alcon,nalcon,alzero,
& ielmat,ielorien,norien,orab,ntmat_,
& t0,t1,ithermal,prestr,iprestr,vold,iperturb,nelemload,
& sideload,xload,nload,idist,sti,stx,eei,iexpl,plicon,
& nplicon,plkcon,nplkcon,xstiff,npmat_,dtime,
& matname,mint_,ncmat_,mass,stiffness,buckling,rhs,intscheme)
!
! computation of the element matrix and rhs for the element with
! the topology in konl
!
! f: rhs with temperature and eigenstress contribution: for linear
! calculations only
! ff: rhs without temperature and eigenstress contribution
!
! nmethod=0: check for positive Jacobian
! nmethod=1: stiffness matrix + right hand side
! nmethod=2: stiffness matrix + mass matrix
! nmethod=3: static stiffness + buckling stiffness
! nmethod=4: right hand side (linear, iperturb=0)
!
implicit none
!
logical mass,stiffness,buckling,rhs
!
character*5 sideload(*)
character*8 lakonl
character*20 matname(*),amat
!
integer konl(20),ifaceq(8,6),nelemload(2,*),nk,ibod,nelem,nmethod,
& mattyp,ithermal,iprestr,iperturb,nload,idist,i,j,k,l,i1,i2,j1,
& k1,l1,ii,jj,ii1,jj1,id,ipointer,ig,m1,m2,m3,m4,kk,
& nelcon(2,*),nrhcon(*),nalcon(2,*),ielmat(*),ielorien(*),
& ntmat_,nope,nopes,norien,icmdl,ihyper,iexpl,kode,imat,mint2d,
& mint3d,mint_,ifacet(6,4),nopev,iorien,istiff,ncmat_,
& ifacew(8,5),intscheme,n,ipointeri,ipointerj,iii1,jjj1,n1
!
integer nplicon(0:ntmat_,*),nplkcon(0:ntmat_,*),npmat_
!
real*8 co(3,*),xl(3,20),shp(4,20),
& s(60,60),w(3,3),p1(3),p2(3),f(60),bodyf(3),bodyfx(3),ff(60),
& bf(3),q(3),shpj(4,20),elcon(0:ncmat_,ntmat_,*),
& rhcon(0:1,ntmat_,*),xkl(3,3),
& alcon(0:6,ntmat_,*),alzero(*),orab(7,*),t0(*),t1(*),
& anisox(3,3,3,3),beta(6),prestr(6,*),voldl(3,20),vo(3,3),
& xl2(3,8),xsj2(3),shp2(4,8),vold(0:3,*),xload(2,*),v(3,3,3,3),
& om,omx,e,un,al,um,xi,et,ze,tt,const,xsj,xsjj,sm(60,60),
& sti(6,mint_,*),stx(6,mint_,*),s11,s22,s33,s12,s13,s23,s11b,
& s22b,s33b,s12b,s13b,s23b,eei(6,mint_,*),t0l,t1l,stre(6),
& senergy,senergyb,rho,elas(21),alph(6),summass,summ,
& sume,factorm,factore,alp,elconloc(21),eth(6),exx,eyy,ezz,
& exy,exz,eyz,am1,weight,pgauss(3),dmass,xl1(3,8),term
!
real*8 gauss2d1(2,1),gauss2d2(2,4),gauss2d3(2,9),gauss2d4(2,1),
& gauss2d5(2,3),gauss3d1(3,1),gauss3d2(3,8),gauss3d3(3,27),
& gauss3d4(3,1),gauss3d5(3,4),gauss3d6(3,15),gauss3d7(3,2),
& gauss3d8(3,9),gauss3d9(3,18),weight2d1(1),weight2d2(4),
& weight2d3(9),weight2d4(1),weight2d5(3),weight3d1(1),
& weight3d2(8),weight3d3(27),weight3d4(1),weight3d5(4),
& weight3d6(15),weight3d7(2),weight3d8(9),weight3d9(18)
!
real*8 plicon(0:2*npmat_,ntmat_,*),plkcon(0:2*npmat_,ntmat_,*),
& xstiff(21,mint_,*),
& plconloc(82),dtime
!
include "gauss.f"
!
data ifaceq /4,3,2,1,11,10,9,12,
& 5,6,7,8,13,14,15,16,
& 1,2,6,5,9,18,13,17,
& 2,3,7,6,10,19,14,18,
& 3,4,8,7,11,20,15,19,
& 4,1,5,8,12,17,16,20/
data ifacet /1,3,2,7,6,5,
& 1,2,4,5,9,8,
& 2,3,4,6,10,9,
& 1,4,3,8,10,7/
data ifacew /1,3,2,9,8,7,0,0,
& 4,5,6,10,11,12,0,0,
& 1,2,5,4,7,14,10,13,
& 2,3,6,5,8,15,11,14,
& 4,6,3,1,12,15,9,13/
!
c if(nmethod.eq.5) then
c intscheme=1
c nmethod=2
c else
c intscheme=0
c endif
c!
c mass=.false.
c stiffness=.false.
c buckling=.false.
c rhs=.false.
c!
c if(nmethod.eq.1) then
c stiffness=.true.
c rhs=.true.
c elseif(nmethod.eq.2) then
c mass=.true.
c stiffness=.true.
c elseif(nmethod.eq.3) then
c stiffness=.true.
c buckling=.true.
c elseif(nmethod.eq.4) then
c rhs=.true.
c endif
!
summass=0.d0
!
imat=ielmat(nelem)
amat=matname(imat)
if(norien.gt.0) then
iorien=ielorien(nelem)
else
iorien=0
endif
!
if(lakonl(4:4).eq.'2') then
nope=20
nopev=8
nopes=8
elseif(lakonl(4:4).eq.'8') then
nope=8
nopev=8
nopes=4
elseif(lakonl(4:5).eq.'10') then
nope=10
nopev=4
nopes=6
elseif(lakonl(4:4).eq.'4') then
nope=4
nopev=4
nopes=3
elseif(lakonl(4:5).eq.'15') then
nope=15
nopev=6
else
nope=6
nopev=6
endif
!
if(intscheme.eq.0) then
if(lakonl(4:5).eq.'8R') then
mint2d=1
mint3d=1
elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) then
mint2d=4
mint3d=8
elseif(lakonl(4:4).eq.'2') then
mint2d=9
mint3d=27
elseif(lakonl(4:5).eq.'10') then
mint2d=3
mint3d=4
elseif(lakonl(4:4).eq.'4') then
mint2d=1
mint3d=1
elseif(lakonl(4:5).eq.'15') then
mint3d=9
else
mint3d=2
endif
else
if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
mint3d=27
elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
mint3d=15
else
mint3d=18
endif
endif
!
! computation of the coordinates of the local nodes
!
do i=1,nope
do j=1,3
xl(j,i)=co(j,konl(i))
enddo
enddo
!
if(nelcon(1,imat).lt.0) then
ihyper=1
else
ihyper=0
endif
!
! initialisation for distributed forces
!
if(rhs) then
if(idist.ne.0) then
do i=1,3*nope
f(i)=0.d0
ff(i)=0.d0
enddo
endif
endif
!
! displacements for 2nd order static and modal theory
!
if((iperturb.ne.0).and.stiffness.and.(.not.buckling)) then
do i1=1,nope
do i2=1,3
voldl(i2,i1)=vold(i2,konl(i1))
enddo
enddo
endif
!
! initialisation of sm
!
if(mass.or.buckling) then
do i=1,3*nope
do j=1,3*nope
sm(i,j)=0.d0
enddo
enddo
endif
!
! initialisation of s
!
do i=1,3*nope
do j=1,3*nope
s(i,j)=0.d0
enddo
enddo
!
! computation of the matrix: loop over the Gauss points
!
do kk=1,mint3d
if(intscheme.eq.0) then
if(lakonl(4:5).eq.'8R') then
xi=gauss3d1(1,kk)
et=gauss3d1(2,kk)
ze=gauss3d1(3,kk)
weight=weight3d1(kk)
elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R'))
& then
xi=gauss3d2(1,kk)
c if(nope.eq.20) xi=xi+1.d0
et=gauss3d2(2,kk)
ze=gauss3d2(3,kk)
weight=weight3d2(kk)
elseif(lakonl(4:4).eq.'2') then
c xi=gauss3d3(1,kk)+1.d0
xi=gauss3d3(1,kk)
et=gauss3d3(2,kk)
ze=gauss3d3(3,kk)
weight=weight3d3(kk)
elseif(lakonl(4:5).eq.'10') then
xi=gauss3d5(1,kk)
et=gauss3d5(2,kk)
ze=gauss3d5(3,kk)
weight=weight3d5(kk)
elseif(lakonl(4:4).eq.'4') then
xi=gauss3d4(1,kk)
et=gauss3d4(2,kk)
ze=gauss3d4(3,kk)
weight=weight3d4(kk)
elseif(lakonl(4:5).eq.'15') then
xi=gauss3d8(1,kk)
et=gauss3d8(2,kk)
ze=gauss3d8(3,kk)
weight=weight3d8(kk)
else
xi=gauss3d7(1,kk)
et=gauss3d7(2,kk)
ze=gauss3d7(3,kk)
weight=weight3d7(kk)
endif
else
if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
c xi=gauss3d3(1,kk)+1.d0
xi=gauss3d3(1,kk)
et=gauss3d3(2,kk)
ze=gauss3d3(3,kk)
weight=weight3d3(kk)
elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
xi=gauss3d6(1,kk)
et=gauss3d6(2,kk)
ze=gauss3d6(3,kk)
weight=weight3d6(kk)
else
xi=gauss3d9(1,kk)
et=gauss3d9(2,kk)
ze=gauss3d9(3,kk)
weight=weight3d9(kk)
endif
endif
!
! calculation of the shape functions and their derivatives
! in the gauss point
!
if(nope.eq.20) then
call shape20h(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.8) then
call shape8h(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.10) then
call shape10tet(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.4) then
call shape4tet(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.15) then
call shape15w(xi,et,ze,xl,xsj,shp)
else
call shape6w(xi,et,ze,xl,xsj,shp)
endif
!
! check the jacobian determinant
!
if(xsj.lt.1.d-20) then
write(*,*) '*WARNING in e_c3d: nonpositive jacobian'
write(*,*) ' determinant in element',nelem
write(*,*)
xsj=dabs(xsj)
nmethod=0
endif
!
if((iperturb.ne.0).and.stiffness.and.(.not.buckling))
& then
!
! stresses for 2nd order static and modal theory
!
s11=sti(1,kk,nelem)
s22=sti(2,kk,nelem)
s33=sti(3,kk,nelem)
s12=sti(4,kk,nelem)
s13=sti(5,kk,nelem)
s23=sti(6,kk,nelem)
endif
!
! calculating the temperature in the integration
! point
!
t0l=0.d0
t1l=0.d0
if(ithermal.eq.1) then
if(lakonl(4:5).eq.'8 ') then
do i1=1,nope
t0l=t0l+t0(konl(i1))/8.d0
t1l=t1l+t1(konl(i1))/8.d0
enddo
elseif(lakonl(4:6).eq.'20 ') then
call lintemp(t0,t1,konl,nope,kk,t0l,t1l)
else
do i1=1,nope
t0l=t0l+shp(4,i1)*t0(konl(i1))
t1l=t1l+shp(4,i1)*t1(konl(i1))
enddo
endif
elseif(ithermal.ge.2) then
if(lakonl(4:5).eq.'8 ') then
do i1=1,nope
t0l=t0l+t0(konl(i1))/8.d0
t1l=t1l+vold(0,konl(i1))/8.d0
enddo
elseif(lakonl(4:6).eq.'20 ') then
call lintemp_th(t0,vold,konl,nope,kk,t0l,t1l)
else
do i1=1,nope
t0l=t0l+shp(4,i1)*t0(konl(i1))
t1l=t1l+shp(4,i1)*vold(0,konl(i1))
enddo
endif
endif
tt=t1l-t0l
!
! calculating the coordinates of the integration point
! for material orientation purposes (for cylindrical
! coordinate systems)
!
if(iorien.gt.0) then
do j=1,3
pgauss(j)=0.d0
do i1=1,nope
pgauss(j)=pgauss(j)+shp(4,i1)*co(j,konl(i1))
enddo
enddo
endif
!
! for deformation plasticity: calculating the Jacobian
! and the inverse of the deformation gradient
! needed to convert the stiffness matrix in the spatial
! frame of reference to the material frame
!
kode=nelcon(1,imat)
!
! material data and local stiffness matrix
!
istiff=1
call materialdata(elcon,nelcon,rhcon,nrhcon,alcon,nalcon,
& imat,amat,iorien,pgauss,orab,ntmat_,elas,alph,rho,
& nelem,ithermal,alzero,mattyp,t0l,t1l,
& ihyper,istiff,elconloc,eth,kode,plicon,
& nplicon,plkcon,nplkcon,npmat_,
& plconloc,mint_,dtime,nelem,kk,
& xstiff,ncmat_)
!
if(mattyp.eq.1) then
e=elas(1)
un=elas(2)
um=e/(1.d0+un)
al=un*um/(1.d0-2.d0*un)
um=um/2.d0
elseif(mattyp.eq.2) then
call orthotropic(elas,anisox)
else
call anisotropic(elas,anisox)
endif
!
! initialisation for the body forces
!
om=omx*rho
if(rhs) then
if(ibod.ne.0) then
do ii=1,3
bodyf(ii)=bodyfx(ii)*rho
enddo
endif
endif
!
if(rhs) then
!
! information for the rhs
!
! residual stresses
!
if((iprestr.eq.1).or.(ithermal.eq.1)) then
if(iprestr.eq.0) then
do ii=1,6
beta(ii)=0.d0
enddo
else
do ii=1,6
beta(ii)=-prestr(ii,nelem)
enddo
endif
endif
!
! calculation of the thermal stresses in an undeformed body
! assumption; beta corresponds to initial stresses.
!
if(ithermal.eq.1) then
if(ihyper.eq.0) then
icmdl=2
call linel(ithermal,mattyp,beta,al,um,am1,alph,tt,
& elas,icmdl,exx,eyy,ezz,exy,exz,eyz,stre,
& anisox)
endif
endif
!
elseif(buckling) then
!
! buckling stresses
!
s11b=stx(1,kk,nelem)
s22b=stx(2,kk,nelem)
s33b=stx(3,kk,nelem)
s12b=stx(4,kk,nelem)
s13b=stx(5,kk,nelem)
s23b=stx(6,kk,nelem)
!
endif
!
! incorporating the jacobian determinant in the shape
! functions
!
xsjj=dsqrt(xsj)
do i1=1,nope
shpj(1,i1)=shp(1,i1)*xsjj
shpj(2,i1)=shp(2,i1)*xsjj
shpj(3,i1)=shp(3,i1)*xsjj
shpj(4,i1)=shp(4,i1)*xsj
enddo
!
! determination of the stiffness, and/or mass and/or
! buckling matrix
!
if(stiffness.or.mass.or.buckling) then
!
if((iperturb.eq.0).or.buckling)
& then
jj1=1
do jj=1,nope
!
ii1=1
do ii=1,jj
!
! all products of the shape functions for a given ii
! and jj
!
do i1=1,3
do j1=1,3
w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
enddo
enddo
!
! the following section calculates the static
! part of the stiffness matrix which, for buckling
! calculations, is done in a preliminary static
! call
!
if(.not.buckling) then
!
if(mattyp.eq.1) then
!
s(ii1,jj1)=s(ii1,jj1)+(al*w(1,1)+
& um*(2.d0*w(1,1)+w(2,2)+w(3,3)))*weight
s(ii1,jj1+1)=s(ii1,jj1+1)+(al*w(1,2)+
& um*w(2,1))*weight
s(ii1,jj1+2)=s(ii1,jj1+2)+(al*w(1,3)+
& um*w(3,1))*weight
s(ii1+1,jj1)=s(ii1+1,jj1)+(al*w(2,1)+
& um*w(1,2))*weight
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+(al*w(2,2)+
& um*(2.d0*w(2,2)+w(1,1)+w(3,3)))*weight
s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+(al*w(2,3)+
& um*w(3,2))*weight
s(ii1+2,jj1)=s(ii1+2,jj1)+(al*w(3,1)+
& um*w(1,3))*weight
s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+(al*w(3,2)+
& um*w(2,3))*weight
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+(al*w(3,3)+
& um*(2.d0*w(3,3)+w(2,2)+w(1,1)))*weight
!
elseif(mattyp.eq.2) then
!
s(ii1,jj1)=s(ii1,jj1)+(elas(1)*w(1,1)+
& elas(7)*w(2,2)+elas(8)*w(3,3))*weight
s(ii1,jj1+1)=s(ii1,jj1+1)+(elas(2)*w(1,2)+
& elas(7)*w(2,1))*weight
s(ii1,jj1+2)=s(ii1,jj1+2)+(elas(4)*w(1,3)+
& elas(8)*w(3,1))*weight
s(ii1+1,jj1)=s(ii1+1,jj1)+(elas(7)*w(1,2)+
& elas(2)*w(2,1))*weight
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+
& (elas(7)*w(1,1)+
& elas(3)*w(2,2)+elas(9)*w(3,3))*weight
s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+
& (elas(5)*w(2,3)+
& elas(9)*w(3,2))*weight
s(ii1+2,jj1)=s(ii1+2,jj1)+
& (elas(8)*w(1,3)+
& elas(4)*w(3,1))*weight
s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+
& (elas(9)*w(2,3)+
& elas(5)*w(3,2))*weight
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+
& (elas(8)*w(1,1)+
& elas(9)*w(2,2)+elas(6)*w(3,3))*weight
!
else
!
do i1=1,3
do j1=1,3
do k1=1,3
do l1=1,3
s(ii1+i1-1,jj1+j1-1)=
& s(ii1+i1-1,jj1+j1-1)
& +anisox(i1,k1,j1,l1)
& *w(k1,l1)*weight
enddo
enddo
enddo
enddo
!
endif
!
! mass matrix
!
if(mass) then
sm(ii1,jj1)=sm(ii1,jj1)
& +rho*shpj(4,ii)*shp(4,jj)*weight
sm(ii1+1,jj1+1)=sm(ii1,jj1)
sm(ii1+2,jj1+2)=sm(ii1,jj1)
endif
!
else
!
! buckling matrix
!
senergyb=
& (s11b*w(1,1)+s12b*(w(1,2)+w(2,1))
& +s13b*(w(1,3)+w(3,1))+s22b*w(2,2)
& +s23b*(w(2,3)+w(3,2))+s33b*w(3,3))*weight
sm(ii1,jj1)=sm(ii1,jj1)-senergyb
sm(ii1+1,jj1+1)=sm(ii1+1,jj1+1)-senergyb
sm(ii1+2,jj1+2)=sm(ii1+2,jj1+2)-senergyb
!
endif
!
ii1=ii1+3
enddo
jj1=jj1+3
enddo
else
!
! stiffness matrix for static and modal
! 2nd order calculations
!
! large displacement stiffness
!
do i1=1,3
do j1=1,3
vo(i1,j1)=0.d0
do k1=1,nope
vo(i1,j1)=vo(i1,j1)+shp(j1,k1)*voldl(i1,k1)
enddo
enddo
enddo
!
if(mattyp.eq.1) then
call wcoef(v,vo,al,um)
endif
!
! calculating the total mass of the element for
! lumping purposes: only for explicit nonlinear
! dynamic calculations
!
if(mass.and.(iexpl.eq.1)) then
summass=summass+rho*xsj
endif
!
jj1=1
do jj=1,nope
!
ii1=1
do ii=1,jj
!
! all products of the shape functions for a given ii
! and jj
!
do i1=1,3
do j1=1,3
w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
enddo
enddo
!
if(mattyp.eq.1) then
!
do m1=1,3
do m2=1,3
do m3=1,3
do m4=1,3
s(ii1+m2-1,jj1+m1-1)=
& s(ii1+m2-1,jj1+m1-1)
& +v(m4,m3,m2,m1)*w(m4,m3)*weight
enddo
enddo
enddo
enddo
!
elseif(mattyp.eq.2) then
!
! call orthonl(w,vo,elas,s,ii1,jj1,weight)
s(ii1,jj1)=s(ii1,jj1)+((elas( 1)+elas( 1)*vo(1,1)
&+(elas( 1)+elas( 1)*vo(1,1)
&)*vo(1,1)+(elas( 7)*vo(1,2))*vo(1,2)
&+(elas( 8)*vo(1,3))
&*vo(1,3))*w(1,1)
&+(elas( 2)*vo(1,2)+(elas( 2)*vo(1,2))*vo(1,1)+(elas( 7)
&+elas( 7)*vo(1,1))*vo(1,2)
&)*w(1,2)
&+(elas( 4)*vo(1,3)+(elas( 4)*vo(1,3))*vo(1,1)
&+(elas( 8)+elas( 8)*vo(1,1))
&*vo(1,3))*w(1,3)
&+(elas( 7)*vo(1,2)+(elas( 7)*vo(1,2))*vo(1,1)+(elas( 2)
&+elas( 2)*vo(1,1))*vo(1,2)
&)*w(2,1)
&+(elas( 7)+elas( 7)*vo(1,1)
&+(elas( 7)+elas( 7)*vo(1,1)
&)*vo(1,1)+(elas( 3)*vo(1,2))*vo(1,2)
&+(elas( 9)*vo(1,3))
&*vo(1,3))*w(2,2)
&+((elas( 5)*vo(1,3))*vo(1,2)
&+(elas( 9)*vo(1,2))
&*vo(1,3))*w(2,3)
&+(elas( 8)*vo(1,3)+(elas( 8)*vo(1,3))*vo(1,1)
&+(elas( 4)+elas( 4)*vo(1,1))
&*vo(1,3))*w(3,1)
&+((elas( 9)*vo(1,3))*vo(1,2)
&+(elas( 5)*vo(1,2))
&*vo(1,3))*w(3,2)
&+(elas( 8)+elas( 8)*vo(1,1)
&+(elas( 8)+elas( 8)*vo(1,1)
&)*vo(1,1)+(elas( 9)*vo(1,2))*vo(1,2)
&+(elas( 6)*vo(1,3))
&*vo(1,3))*w(3,3))*weight
s(ii1,jj1+1)=s(ii1,jj1+1)+((elas( 1)*vo(2,1)
&+(elas( 1)*vo(2,1)
&)*vo(1,1)+(elas( 7)
&+elas( 7)*vo(2,2))*vo(1,2)
&+(elas( 8)*vo(2,3))
&*vo(1,3))*w(1,1)
&+(elas( 2)
&+elas( 2)*vo(2,2)+(elas( 2)
&+elas( 2)*vo(2,2))*vo(1,1)+(elas( 7)*vo(2,1))*vo(1,2)
&)*w(1,2)
&+(elas( 4)*vo(2,3)+(elas( 4)*vo(2,3))*vo(1,1)
&+(elas( 8)*vo(2,1))
&*vo(1,3))*w(1,3)
&+(elas( 7)
&+elas( 7)*vo(2,2)+(elas( 7)
&+elas( 7)*vo(2,2))*vo(1,1)+(elas( 2)*vo(2,1))*vo(1,2)
&)*w(2,1)
&+(elas( 7)*vo(2,1)
&+(elas( 7)*vo(2,1)
&)*vo(1,1)+(elas( 3)
&+elas( 3)*vo(2,2))*vo(1,2)
&+(elas( 9)*vo(2,3))
&*vo(1,3))*w(2,2)
&+((elas( 5)*vo(2,3))*vo(1,2)
&+(elas( 9)+elas( 9)*vo(2,2))
&*vo(1,3))*w(2,3)
&+(elas( 8)*vo(2,3)+(elas( 8)*vo(2,3))*vo(1,1)
&+(elas( 4)*vo(2,1))
&*vo(1,3))*w(3,1)
&+((elas( 9)*vo(2,3))*vo(1,2)
&+(elas( 5)+elas( 5)*vo(2,2))
&*vo(1,3))*w(3,2)
&+(elas( 8)*vo(2,1)
&+(elas( 8)*vo(2,1)
&)*vo(1,1)+(elas( 9)
&+elas( 9)*vo(2,2))*vo(1,2)
&+(elas( 6)*vo(2,3))
&*vo(1,3))*w(3,3))*weight
s(ii1,jj1+2)=s(ii1,jj1+2)+((elas( 1)*vo(3,1)
&+(elas( 1)*vo(3,1)
&)*vo(1,1)+(elas( 7)*vo(3,2))*vo(1,2)
&+(elas( 8)+elas( 8)*vo(3,3))
&*vo(1,3))*w(1,1)
&+(elas( 2)*vo(3,2)
&+(elas( 2)*vo(3,2))*vo(1,1)+(elas( 7)*vo(3,1))*vo(1,2)
&)*w(1,2)
&+(elas( 4)
&+elas( 4)*vo(3,3)+(elas( 4)
&+elas( 4)*vo(3,3))*vo(1,1)
&+(elas( 8)*vo(3,1))
&*vo(1,3))*w(1,3)
&+(elas( 7)*vo(3,2)+(elas( 7)*vo(3,2))*vo(1,1)
&+(elas( 2)*vo(3,1))*vo(1,2)
&)*w(2,1)
&+(elas( 7)*vo(3,1)
&+(elas( 7)*vo(3,1)
&)*vo(1,1)+(elas( 3)*vo(3,2))*vo(1,2)
&+(elas( 9)+elas( 9)*vo(3,3))
&*vo(1,3))*w(2,2)
&+((elas( 5)
&+elas( 5)*vo(3,3))*vo(1,2)
&+(elas( 9)*vo(3,2))
&*vo(1,3))*w(2,3)
&+(elas( 8)
&+elas( 8)*vo(3,3)+(elas( 8)
&+elas( 8)*vo(3,3))*vo(1,1)
&+(elas( 4)*vo(3,1))
&*vo(1,3))*w(3,1)
&+((elas( 9)
&+elas( 9)*vo(3,3))*vo(1,2)
&+(elas( 5)*vo(3,2))
&*vo(1,3))*w(3,2)
&+(elas( 8)*vo(3,1)
&+(elas( 8)*vo(3,1)
&)*vo(1,1)+(elas( 9)*vo(3,2))*vo(1,2)
&+(elas( 6)+elas( 6)*vo(3,3))
&*vo(1,3))*w(3,3))*weight
s(ii1+1,jj1)=s(ii1+1,jj1)+((elas( 7)*vo(1,2)
&+(elas( 1)+elas( 1)*vo(1,1)
&)*vo(2,1)+(elas( 7)*vo(1,2))*vo(2,2)
&+(elas( 8)*vo(1,3))
&*vo(2,3))*w(1,1)
&+(elas( 7)+elas( 7)*vo(1,1)
&+(elas( 2)*vo(1,2))*vo(2,1)+(elas( 7)
&+elas( 7)*vo(1,1))*vo(2,2)
&)*w(1,2)
&+((elas( 4)*vo(1,3))*vo(2,1)
&+(elas( 8)+elas( 8)*vo(1,1))
&*vo(2,3))*w(1,3)
&+(elas( 2)+elas( 2)*vo(1,1)
&+(elas( 7)*vo(1,2))*vo(2,1)+(elas( 2)
&+elas( 2)*vo(1,1))*vo(2,2)
&)*w(2,1)
&+(elas( 3)*vo(1,2)+(elas( 7)+elas( 7)*vo(1,1)
&)*vo(2,1)+(elas( 3)*vo(1,2))*vo(2,2)
&+(elas( 9)*vo(1,3))
&*vo(2,3))*w(2,2)
&+(elas( 5)*vo(1,3)+(elas( 5)*vo(1,3))*vo(2,2)
&+(elas( 9)*vo(1,2))
&*vo(2,3))*w(2,3)
&+((elas( 8)*vo(1,3))*vo(2,1)
&+(elas( 4)+elas( 4)*vo(1,1))
&*vo(2,3))*w(3,1)
&+(elas( 9)*vo(1,3)+(elas( 9)*vo(1,3))*vo(2,2)
&+(elas( 5)*vo(1,2))
&*vo(2,3))*w(3,2)
&+(elas( 9)*vo(1,2)+(elas( 8)+elas( 8)*vo(1,1)
&)*vo(2,1)+(elas( 9)*vo(1,2))*vo(2,2)
&+(elas( 6)*vo(1,3))
&*vo(2,3))*w(3,3))*weight
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+((elas( 7)
&+elas( 7)*vo(2,2)+(elas( 1)*vo(2,1)
&)*vo(2,1)+(elas( 7)
&+elas( 7)*vo(2,2))*vo(2,2)
&+(elas( 8)*vo(2,3))
&*vo(2,3))*w(1,1)
&+(elas( 7)*vo(2,1)
&+(elas( 2)
&+elas( 2)*vo(2,2))*vo(2,1)+(elas( 7)*vo(2,1))*vo(2,2)
&)*w(1,2)
&+((elas( 4)*vo(2,3))*vo(2,1)
&+(elas( 8)*vo(2,1))
&*vo(2,3))*w(1,3)
&+(elas( 2)*vo(2,1)
&+(elas( 7)
&+elas( 7)*vo(2,2))*vo(2,1)+(elas( 2)*vo(2,1))*vo(2,2)
&)*w(2,1)
&+(elas( 3)
&+elas( 3)*vo(2,2)+(elas( 7)*vo(2,1)
&)*vo(2,1)+(elas( 3)
&+elas( 3)*vo(2,2))*vo(2,2)
&+(elas( 9)*vo(2,3))
&*vo(2,3))*w(2,2)
&+(elas( 5)*vo(2,3)+(elas( 5)*vo(2,3))*vo(2,2)
&+(elas( 9)+elas( 9)*vo(2,2))
&*vo(2,3))*w(2,3)
&+((elas( 8)*vo(2,3))*vo(2,1)
&+(elas( 4)*vo(2,1))
&*vo(2,3))*w(3,1)
&+(elas( 9)*vo(2,3)+(elas( 9)*vo(2,3))*vo(2,2)
&+(elas( 5)+elas( 5)*vo(2,2))
&*vo(2,3))*w(3,2)
&+(elas( 9)
&+elas( 9)*vo(2,2)+(elas( 8)*vo(2,1)
&)*vo(2,1)+(elas( 9)
&+elas( 9)*vo(2,2))*vo(2,2)
&+(elas( 6)*vo(2,3))
&*vo(2,3))*w(3,3))*weight
s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+((elas( 7)*vo(3,2)+(elas( 1)*vo(3,1)
&)*vo(2,1)+(elas( 7)*vo(3,2))*vo(2,2)
&+(elas( 8)+elas( 8)*vo(3,3))
&*vo(2,3))*w(1,1)
&+(elas( 7)*vo(3,1)
&+(elas( 2)*vo(3,2))*vo(2,1)+(elas( 7)*vo(3,1))*vo(2,2)
&)*w(1,2)
&+((elas( 4)
&+elas( 4)*vo(3,3))*vo(2,1)
&+(elas( 8)*vo(3,1))
&*vo(2,3))*w(1,3)
&+(elas( 2)*vo(3,1)
&+(elas( 7)*vo(3,2))*vo(2,1)+(elas( 2)*vo(3,1))*vo(2,2)
&)*w(2,1)
&+(elas( 3)*vo(3,2)+(elas( 7)*vo(3,1)
&)*vo(2,1)+(elas( 3)*vo(3,2))*vo(2,2)
&+(elas( 9)+elas( 9)*vo(3,3))
&*vo(2,3))*w(2,2)
&+(elas( 5)
&+elas( 5)*vo(3,3)+(elas( 5)
&+elas( 5)*vo(3,3))*vo(2,2)
&+(elas( 9)*vo(3,2))
&*vo(2,3))*w(2,3)
&+((elas( 8)
&+elas( 8)*vo(3,3))*vo(2,1)
&+(elas( 4)*vo(3,1))
&*vo(2,3))*w(3,1)
&+(elas( 9)
&+elas( 9)*vo(3,3)+(elas( 9)
&+elas( 9)*vo(3,3))*vo(2,2)
&+(elas( 5)*vo(3,2))
&*vo(2,3))*w(3,2)
&+(elas( 9)*vo(3,2)+(elas( 8)*vo(3,1)
&)*vo(2,1)+(elas( 9)*vo(3,2))*vo(2,2)
&+(elas( 6)+elas( 6)*vo(3,3))
&*vo(2,3))*w(3,3))*weight
s(ii1+2,jj1)=s(ii1+2,jj1)+((elas( 8)*vo(1,3)
&+(elas( 1)+elas( 1)*vo(1,1)
&)*vo(3,1)+(elas( 7)*vo(1,2))*vo(3,2)
&+(elas( 8)*vo(1,3))
&*vo(3,3))*w(1,1)
&+((elas( 2)*vo(1,2))*vo(3,1)+(elas( 7)
&+elas( 7)*vo(1,1))*vo(3,2)
&)*w(1,2)
&+(elas( 8)+elas( 8)*vo(1,1)
&+(elas( 4)*vo(1,3))*vo(3,1)
&+(elas( 8)+elas( 8)*vo(1,1))
&*vo(3,3))*w(1,3)
&+((elas( 7)*vo(1,2))*vo(3,1)+(elas( 2)
&+elas( 2)*vo(1,1))*vo(3,2)
&)*w(2,1)
&+(elas( 9)*vo(1,3)+(elas( 7)+elas( 7)*vo(1,1)
&)*vo(3,1)+(elas( 3)*vo(1,2))*vo(3,2)
&+(elas( 9)*vo(1,3))
&*vo(3,3))*w(2,2)
&+(elas( 9)*vo(1,2)+(elas( 5)*vo(1,3))*vo(3,2)
&+(elas( 9)*vo(1,2))
&*vo(3,3))*w(2,3)
&+(elas( 4)+elas( 4)*vo(1,1)
&+(elas( 8)*vo(1,3))*vo(3,1)
&+(elas( 4)+elas( 4)*vo(1,1))
&*vo(3,3))*w(3,1)
&+(elas( 5)*vo(1,2)+(elas( 9)*vo(1,3))*vo(3,2)
&+(elas( 5)*vo(1,2))
&*vo(3,3))*w(3,2)
&+(elas( 6)*vo(1,3)+(elas( 8)+elas( 8)*vo(1,1)
&)*vo(3,1)+(elas( 9)*vo(1,2))*vo(3,2)
&+(elas( 6)*vo(1,3))
&*vo(3,3))*w(3,3))*weight
s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+((elas( 8)*vo(2,3)
&+(elas( 1)*vo(2,1)
&)*vo(3,1)+(elas( 7)
&+elas( 7)*vo(2,2))*vo(3,2)
&+(elas( 8)*vo(2,3))
&*vo(3,3))*w(1,1)
&+((elas( 2)
&+elas( 2)*vo(2,2))*vo(3,1)+(elas( 7)*vo(2,1))*vo(3,2)
&)*w(1,2)
&+(elas( 8)*vo(2,1)
&+(elas( 4)*vo(2,3))*vo(3,1)
&+(elas( 8)*vo(2,1))
&*vo(3,3))*w(1,3)
&+((elas( 7)
&+elas( 7)*vo(2,2))*vo(3,1)+(elas( 2)*vo(2,1))*vo(3,2)
&)*w(2,1)
&+(elas( 9)*vo(2,3)+(elas( 7)*vo(2,1)
&)*vo(3,1)+(elas( 3)
&+elas( 3)*vo(2,2))*vo(3,2)
&+(elas( 9)*vo(2,3))
&*vo(3,3))*w(2,2)
&+(elas( 9)
&+elas( 9)*vo(2,2)+(elas( 5)*vo(2,3))*vo(3,2)
&+(elas( 9)+elas( 9)*vo(2,2))
&*vo(3,3))*w(2,3)
&+(elas( 4)*vo(2,1)
&+(elas( 8)*vo(2,3))*vo(3,1)
&+(elas( 4)*vo(2,1))
&*vo(3,3))*w(3,1)
&+(elas( 5)
&+elas( 5)*vo(2,2)+(elas( 9)*vo(2,3))*vo(3,2)
&+(elas( 5)+elas( 5)*vo(2,2))
&*vo(3,3))*w(3,2)
&+(elas( 6)*vo(2,3)+(elas( 8)*vo(2,1)
&)*vo(3,1)+(elas( 9)
&+elas( 9)*vo(2,2))*vo(3,2)
&+(elas( 6)*vo(2,3))
&*vo(3,3))*w(3,3))*weight
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+((elas( 8)
&+elas( 8)*vo(3,3)+(elas( 1)*vo(3,1)
&)*vo(3,1)+(elas( 7)*vo(3,2))*vo(3,2)
&+(elas( 8)+elas( 8)*vo(3,3))
&*vo(3,3))*w(1,1)
&+((elas( 2)*vo(3,2))*vo(3,1)+(elas( 7)*vo(3,1))*vo(3,2)
&)*w(1,2)
&+(elas( 8)*vo(3,1)
&+(elas( 4)
&+elas( 4)*vo(3,3))*vo(3,1)
&+(elas( 8)*vo(3,1))
&*vo(3,3))*w(1,3)
&+((elas( 7)*vo(3,2))*vo(3,1)+(elas( 2)*vo(3,1))*vo(3,2)
&)*w(2,1)
&+(elas( 9)
&+elas( 9)*vo(3,3)+(elas( 7)*vo(3,1)
&)*vo(3,1)+(elas( 3)*vo(3,2))*vo(3,2)
&+(elas( 9)+elas( 9)*vo(3,3))
&*vo(3,3))*w(2,2)
&+(elas( 9)*vo(3,2)+(elas( 5)
&+elas( 5)*vo(3,3))*vo(3,2)
&+(elas( 9)*vo(3,2))
&*vo(3,3))*w(2,3)
&+(elas( 4)*vo(3,1)
&+(elas( 8)
&+elas( 8)*vo(3,3))*vo(3,1)
&+(elas( 4)*vo(3,1))
&*vo(3,3))*w(3,1)
&+(elas( 5)*vo(3,2)+(elas( 9)
&+elas( 9)*vo(3,3))*vo(3,2)
&+(elas( 5)*vo(3,2))
&*vo(3,3))*w(3,2)
&+(elas( 6)
&+elas( 6)*vo(3,3)+(elas( 8)*vo(3,1)
&)*vo(3,1)+(elas( 9)*vo(3,2))*vo(3,2)
&+(elas( 6)+elas( 6)*vo(3,3))
&*vo(3,3))*w(3,3))*weight
!
else
!
do i1=1,3
iii1=ii1+i1-1
do j1=1,3
jjj1=jj1+j1-1
do k1=1,3
do l1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
do m1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(i1,k1,m1,l1)*w(k1,l1)
& *vo(j1,m1)*weight
& +anisox(m1,k1,j1,l1)*w(k1,l1)
& *vo(i1,m1)*weight
do n1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(m1,k1,n1,l1)
& *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
& *weight
enddo
enddo
enddo
enddo
enddo
enddo
!SPEC: The immediately preceding loop nest is also available in
!SPEC: program-generated (much longer) form from the author's
!SPEC: website (see 454.calculix/Docs) in file anisonl.f
!SPEC:
!SPEC: call anisonl(w,vo,elas,s,ii1,jj1,weight)
!SPEC:
endif
!
! stress stiffness
!
senergy=
& (s11*w(1,1)+s12*(w(1,2)+w(2,1))
& +s13*(w(1,3)+w(3,1))+s22*w(2,2)
& +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
s(ii1,jj1)=s(ii1,jj1)+senergy
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
!
! mass matrix
!
if(mass) then
sm(ii1,jj1)=sm(ii1,jj1)
& +rho*shpj(4,ii)*shp(4,jj)*weight
sm(ii1+1,jj1+1)=sm(ii1,jj1)
sm(ii1+2,jj1+2)=sm(ii1,jj1)
endif
!
! stiffness contribution of centrifugal forces
!
if(mass.and.(om.gt.0.d0)) then
dmass=shpj(4,ii)*shp(4,jj)*weight*om
do m1=1,3
s(ii1+m1-1,jj1+m1-1)=s(ii1+m1-1,jj1+m1-1)-
& dmass
do m2=1,3
s(ii1+m1-1,jj1+m2-1)=s(ii1+m1-1,jj1+m2-1)+
& dmass*p2(m1)*p2(m2)
enddo
enddo
endif
!
ii1=ii1+3
enddo
jj1=jj1+3
enddo
endif
!
endif
!
! computation of the right hand side
!
if(rhs) then
!
! body forces
!
if(ibod.ne.0) then
if(om.gt.0.d0) then
do i1=1,3
!
! computation of the global coordinates of the gauss
! point
!
q(i1)=0.d0
if(iperturb.eq.0) then
do j1=1,nope
q(i1)=q(i1)+shp(4,j1)*xl(i1,j1)
enddo
else
do j1=1,nope
q(i1)=q(i1)+shp(4,j1)*
& (xl(i1,j1)+voldl(i1,j1))
enddo
endif
!
q(i1)=q(i1)-p1(i1)
enddo
const=q(1)*p2(1)+q(2)*p2(2)+q(3)*p2(3)
!
! inclusion of the centrifugal force into the body force
!
do i1=1,3
bf(i1)=bodyf(i1)+(q(i1)-const*p2(i1))*om
enddo
else
do i1=1,3
bf(i1)=bodyf(i1)
enddo
endif
jj1=1
do jj=1,nope
f(jj1)=f(jj1)+bf(1)*shpj(4,jj)*weight
f(jj1+1)=f(jj1+1)+bf(2)*shpj(4,jj)*weight
f(jj1+2)=f(jj1+2)+bf(3)*shpj(4,jj)*weight
ff(jj1)=ff(jj1)+bf(1)*shpj(4,jj)*weight
ff(jj1+1)=ff(jj1+1)+bf(2)*shpj(4,jj)*weight
ff(jj1+2)=ff(jj1+2)+bf(3)*shpj(4,jj)*weight
jj1=jj1+3
enddo
endif
!
! thermal stresses and/or residual stresses
!
if((ithermal.ne.0).or.(iprestr.ne.0)) then
do jj=1,6
beta(jj)=beta(jj)*xsj
enddo
jj1=1
do jj=1,nope
f(jj1)=f(jj1)+(shp(1,jj)*beta(1)+
& (shp(2,jj)*beta(4)+shp(3,jj)*beta(5))/2.d0)
& *weight
f(jj1+1)=f(jj1+1)+(shp(2,jj)*beta(2)+
& (shp(1,jj)*beta(4)+shp(3,jj)*beta(6))/2.d0)
& *weight
f(jj1+2)=f(jj1+2)+(shp(3,jj)*beta(3)+
& (shp(1,jj)*beta(5)+shp(2,jj)*beta(6))/2.d0)
& *weight
jj1=jj1+3
enddo
endif
!
endif
!
enddo
!
c write(*,'(6(1x,e11.4))') ((s(i1,j1),i1=1,j1),j1=1,60)
c write(*,*)
c
if(.not.buckling) then
!
! distributed loads
!
if(nload.eq.0) then
return
endif
call nident2(nelemload,nelem,nload,id)
do
if((id.eq.0).or.(nelemload(1,id).ne.nelem)) exit
read(sideload(id)(2:2),'(i1)') ig
!
! treatment of wedge faces
!
if(lakonl(4:4).eq.'6') then
mint2d=1
if(ig.le.2) then
nopes=3
else
nopes=4
endif
endif
if(lakonl(4:5).eq.'15') then
if(ig.le.2) then
mint2d=3
nopes=6
else
mint2d=4
nopes=8
endif
endif
!
if((nope.eq.20).or.(nope.eq.8)) then
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifaceq(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifaceq(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifaceq(i,ig)))+
& vold(j,konl(ifaceq(i,ig)))
enddo
enddo
endif
elseif((nope.eq.10).or.(nope.eq.4)) then
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacet(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifacet(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacet(i,ig)))+
& vold(j,konl(ifacet(i,ig)))
enddo
enddo
endif
else
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacew(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifacew(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacew(i,ig)))+
& vold(j,konl(ifacew(i,ig)))
enddo
enddo
endif
endif
!
do i=1,mint2d
if((lakonl(4:5).eq.'8R').or.
& ((lakonl(4:4).eq.'6').and.(nopes.eq.4))) then
xi=gauss2d1(1,i)
et=gauss2d1(2,i)
weight=weight2d1(i)
elseif((lakonl(4:4).eq.'8').or.
& (lakonl(4:6).eq.'20R').or.
& ((lakonl(4:5).eq.'15').and.(nopes.eq.8))) then
xi=gauss2d2(1,i)
et=gauss2d2(2,i)
weight=weight2d2(i)
elseif(lakonl(4:4).eq.'2') then
xi=gauss2d3(1,i)
et=gauss2d3(2,i)
weight=weight2d3(i)
elseif((lakonl(4:5).eq.'10').or.
& ((lakonl(4:5).eq.'15').and.(nopes.eq.6))) then
xi=gauss2d5(1,i)
et=gauss2d5(2,i)
weight=weight2d5(i)
elseif((lakonl(4:4).eq.'4').or.
& ((lakonl(4:4).eq.'6').and.(nopes.eq.3))) then
xi=gauss2d4(1,i)
et=gauss2d4(2,i)
weight=weight2d4(i)
endif
!
if(rhs) then
if(nopes.eq.8) then
call shape8q(xi,et,xl2,xsj2,shp2)
elseif(nopes.eq.4) then
call shape4q(xi,et,xl2,xsj2,shp2)
elseif(nopes.eq.6) then
call shape6tri(xi,et,xl2,xsj2,shp2)
else
call shape3tri(xi,et,xl2,xsj2,shp2)
endif
!
do k=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointer=(ifaceq(k,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointer=(ifacet(k,ig)-1)*3
else
ipointer=(ifacew(k,ig)-1)*3
endif
f(ipointer+1)=f(ipointer+1)-shp2(4,k)*xload(1,id)
& *xsj2(1)*weight
f(ipointer+2)=f(ipointer+2)-shp2(4,k)*xload(1,id)
& *xsj2(2)*weight
f(ipointer+3)=f(ipointer+3)-shp2(4,k)*xload(1,id)
& *xsj2(3)*weight
ff(ipointer+1)=ff(ipointer+1)-shp2(4,k)*xload(1,id)
& *xsj2(1)*weight
ff(ipointer+2)=ff(ipointer+2)-shp2(4,k)*xload(1,id)
& *xsj2(2)*weight
ff(ipointer+3)=ff(ipointer+3)-shp2(4,k)*xload(1,id)
& *xsj2(3)*weight
enddo
!
! stiffness contribution of the distributed load
!
elseif(mass) then
if(nopes.eq.8) then
call shape8q(xi,et,xl1,xsj2,shp2)
elseif(nopes.eq.4) then
call shape4q(xi,et,xl1,xsj2,shp2)
elseif(nopes.eq.6) then
call shape6tri(xi,et,xl1,xsj2,shp2)
else
call shape3tri(xi,et,xl1,xsj2,shp2)
endif
!
! calculation of the deformation gradient
!
do k=1,3
do l=1,3
xkl(k,l)=0.d0
do ii=1,nopes
xkl(k,l)=xkl(k,l)+shp2(l,ii)*xl2(k,ii)
enddo
enddo
enddo
!
do ii=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointeri=(ifaceq(ii,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointeri=(ifacet(ii,ig)-1)*3
else
ipointeri=(ifacew(ii,ig)-1)*3
endif
do jj=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointerj=(ifaceq(jj,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointerj=(ifacet(jj,ig)-1)*3
else
ipointerj=(ifacew(jj,ig)-1)*3
endif
do k=1,3
do l=1,3
if(k.eq.l) cycle
if(k*l.eq.2) then
n=3
elseif(k*l.eq.3) then
n=2
else
n=1
endif
term=weight*xload(1,id)*shp2(4,jj)*
& (xsj2(1)*
& (xkl(n,2)*shp2(3,ii)-xkl(n,3)*shp2(2,ii))+
& xsj2(2)*
& (xkl(n,3)*shp2(1,ii)-xkl(n,1)*shp2(3,ii))+
& xsj2(3)*
& (xkl(n,1)*shp2(2,ii)-xkl(n,2)*shp2(1,ii)))
if(ipointeri+k.le.ipointerj+l) then
s(ipointeri+k,ipointerj+l)=
& s(ipointeri+k,ipointerj+l)+term/2.d0
else
s(ipointerj+l,ipointeri+k)=
& s(ipointerj+l,ipointeri+k)+term/2.d0
endif
enddo
enddo
enddo
enddo
!
endif
enddo
!
id=id-1
enddo
!
elseif(mass.and.(iexpl.eq.1)) then
!
! scaling the diagonal terms of the mass matrix such that the total mass
! is right (LUMPING; for explicit dynamic calculations)
!
sume=0.d0
summ=0.d0
do i=1,3*nopev,3
sume=sume+sm(i,i)
enddo
do i=3*nopev+1,3*nope,3
summ=summ+sm(i,i)
enddo
!
if(nope.eq.20) then
c alp=.2215d0
alp=.2917d0
! maybe alp=.2917d0 is better??
elseif(nope.eq.10) then
alp=0.1203d0
elseif(nope.eq.15) then
alp=0.2141d0
endif
!
if((nope.eq.20).or.(nope.eq.10).or.(nope.eq.15)) then
factore=summass*alp/(1.d0+alp)/sume
factorm=summass/(1.d0+alp)/summ
else
factore=summass/sume
endif
!
do i=1,3*nopev,3
sm(i,i)=sm(i,i)*factore
sm(i+1,i+1)=sm(i,i)
sm(i+2,i+2)=sm(i,i)
enddo
do i=3*nopev+1,3*nope,3
sm(i,i)=sm(i,i)*factorm
sm(i+1,i+1)=sm(i,i)
sm(i+2,i+2)=sm(i,i)
enddo
!
endif
!
return
end
[-- Attachment #3: e_c3d-orig.f --]
[-- Type: application/octet-stream, Size: 36912 bytes --]
!
! CalculiX - A 3-dimensional finite element program
! Copyright (C) 1998 Guido Dhondt
!
! This program is free software; you can redistribute it and/or
! modify it under the terms of the GNU General Public License as
! published by the Free Software Foundation(version 2);
!
!
! This program is distributed in the hope that it will be useful,
! but WITHOUT ANY WARRANTY; without even the implied warranty of
! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
! GNU General Public License for more details.
!
! You should have received a copy of the GNU General Public License
! along with this program; if not, write to the Free Software
! Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
!
subroutine e_c3d(co,nk,konl,lakonl,p1,p2,omx,bodyfx,ibod,s,sm,f,
& ff,nelem,nmethod,elcon,nelcon,rhcon,nrhcon,alcon,nalcon,alzero,
& ielmat,ielorien,norien,orab,ntmat_,
& t0,t1,ithermal,prestr,iprestr,vold,iperturb,nelemload,
& sideload,xload,nload,idist,sti,stx,eei,iexpl,plicon,
& nplicon,plkcon,nplkcon,xstiff,npmat_,dtime,
& matname,mint_,ncmat_,mass,stiffness,buckling,rhs,intscheme)
!
! computation of the element matrix and rhs for the element with
! the topology in konl
!
! f: rhs with temperature and eigenstress contribution: for linear
! calculations only
! ff: rhs without temperature and eigenstress contribution
!
! nmethod=0: check for positive Jacobian
! nmethod=1: stiffness matrix + right hand side
! nmethod=2: stiffness matrix + mass matrix
! nmethod=3: static stiffness + buckling stiffness
! nmethod=4: right hand side (linear, iperturb=0)
!
implicit none
!
logical mass,stiffness,buckling,rhs
!
character*5 sideload(*)
character*8 lakonl
character*20 matname(*),amat
!
integer konl(20),ifaceq(8,6),nelemload(2,*),nk,ibod,nelem,nmethod,
& mattyp,ithermal,iprestr,iperturb,nload,idist,i,j,k,l,i1,i2,j1,
& k1,l1,ii,jj,ii1,jj1,id,ipointer,ig,m1,m2,m3,m4,kk,
& nelcon(2,*),nrhcon(*),nalcon(2,*),ielmat(*),ielorien(*),
& ntmat_,nope,nopes,norien,icmdl,ihyper,iexpl,kode,imat,mint2d,
& mint3d,mint_,ifacet(6,4),nopev,iorien,istiff,ncmat_,
& ifacew(8,5),intscheme,n,ipointeri,ipointerj,iii1,jjj1,n1
!
integer nplicon(0:ntmat_,*),nplkcon(0:ntmat_,*),npmat_
!
real*8 co(3,*),xl(3,20),shp(4,20),
& s(60,60),w(3,3),p1(3),p2(3),f(60),bodyf(3),bodyfx(3),ff(60),
& bf(3),q(3),shpj(4,20),elcon(0:ncmat_,ntmat_,*),
& rhcon(0:1,ntmat_,*),xkl(3,3),
& alcon(0:6,ntmat_,*),alzero(*),orab(7,*),t0(*),t1(*),
& anisox(3,3,3,3),beta(6),prestr(6,*),voldl(3,20),vo(3,3),
& xl2(3,8),xsj2(3),shp2(4,8),vold(0:3,*),xload(2,*),v(3,3,3,3),
& om,omx,e,un,al,um,xi,et,ze,tt,const,xsj,xsjj,sm(60,60),
& sti(6,mint_,*),stx(6,mint_,*),s11,s22,s33,s12,s13,s23,s11b,
& s22b,s33b,s12b,s13b,s23b,eei(6,mint_,*),t0l,t1l,stre(6),
& senergy,senergyb,rho,elas(21),alph(6),summass,summ,
& sume,factorm,factore,alp,elconloc(21),eth(6),exx,eyy,ezz,
& exy,exz,eyz,am1,weight,pgauss(3),dmass,xl1(3,8),term
!
real*8 gauss2d1(2,1),gauss2d2(2,4),gauss2d3(2,9),gauss2d4(2,1),
& gauss2d5(2,3),gauss3d1(3,1),gauss3d2(3,8),gauss3d3(3,27),
& gauss3d4(3,1),gauss3d5(3,4),gauss3d6(3,15),gauss3d7(3,2),
& gauss3d8(3,9),gauss3d9(3,18),weight2d1(1),weight2d2(4),
& weight2d3(9),weight2d4(1),weight2d5(3),weight3d1(1),
& weight3d2(8),weight3d3(27),weight3d4(1),weight3d5(4),
& weight3d6(15),weight3d7(2),weight3d8(9),weight3d9(18)
!
real*8 plicon(0:2*npmat_,ntmat_,*),plkcon(0:2*npmat_,ntmat_,*),
& xstiff(21,mint_,*),
& plconloc(82),dtime
!
include "gauss.f"
!
data ifaceq /4,3,2,1,11,10,9,12,
& 5,6,7,8,13,14,15,16,
& 1,2,6,5,9,18,13,17,
& 2,3,7,6,10,19,14,18,
& 3,4,8,7,11,20,15,19,
& 4,1,5,8,12,17,16,20/
data ifacet /1,3,2,7,6,5,
& 1,2,4,5,9,8,
& 2,3,4,6,10,9,
& 1,4,3,8,10,7/
data ifacew /1,3,2,9,8,7,0,0,
& 4,5,6,10,11,12,0,0,
& 1,2,5,4,7,14,10,13,
& 2,3,6,5,8,15,11,14,
& 4,6,3,1,12,15,9,13/
!
c if(nmethod.eq.5) then
c intscheme=1
c nmethod=2
c else
c intscheme=0
c endif
c!
c mass=.false.
c stiffness=.false.
c buckling=.false.
c rhs=.false.
c!
c if(nmethod.eq.1) then
c stiffness=.true.
c rhs=.true.
c elseif(nmethod.eq.2) then
c mass=.true.
c stiffness=.true.
c elseif(nmethod.eq.3) then
c stiffness=.true.
c buckling=.true.
c elseif(nmethod.eq.4) then
c rhs=.true.
c endif
!
summass=0.d0
!
imat=ielmat(nelem)
amat=matname(imat)
if(norien.gt.0) then
iorien=ielorien(nelem)
else
iorien=0
endif
!
if(lakonl(4:4).eq.'2') then
nope=20
nopev=8
nopes=8
elseif(lakonl(4:4).eq.'8') then
nope=8
nopev=8
nopes=4
elseif(lakonl(4:5).eq.'10') then
nope=10
nopev=4
nopes=6
elseif(lakonl(4:4).eq.'4') then
nope=4
nopev=4
nopes=3
elseif(lakonl(4:5).eq.'15') then
nope=15
nopev=6
else
nope=6
nopev=6
endif
!
if(intscheme.eq.0) then
if(lakonl(4:5).eq.'8R') then
mint2d=1
mint3d=1
elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) then
mint2d=4
mint3d=8
elseif(lakonl(4:4).eq.'2') then
mint2d=9
mint3d=27
elseif(lakonl(4:5).eq.'10') then
mint2d=3
mint3d=4
elseif(lakonl(4:4).eq.'4') then
mint2d=1
mint3d=1
elseif(lakonl(4:5).eq.'15') then
mint3d=9
else
mint3d=2
endif
else
if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
mint3d=27
elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
mint3d=15
else
mint3d=18
endif
endif
!
! computation of the coordinates of the local nodes
!
do i=1,nope
do j=1,3
xl(j,i)=co(j,konl(i))
enddo
enddo
!
if(nelcon(1,imat).lt.0) then
ihyper=1
else
ihyper=0
endif
!
! initialisation for distributed forces
!
if(rhs) then
if(idist.ne.0) then
do i=1,3*nope
f(i)=0.d0
ff(i)=0.d0
enddo
endif
endif
!
! displacements for 2nd order static and modal theory
!
if((iperturb.ne.0).and.stiffness.and.(.not.buckling)) then
do i1=1,nope
do i2=1,3
voldl(i2,i1)=vold(i2,konl(i1))
enddo
enddo
endif
!
! initialisation of sm
!
if(mass.or.buckling) then
do i=1,3*nope
do j=1,3*nope
sm(i,j)=0.d0
enddo
enddo
endif
!
! initialisation of s
!
do i=1,3*nope
do j=1,3*nope
s(i,j)=0.d0
enddo
enddo
!
! computation of the matrix: loop over the Gauss points
!
do kk=1,mint3d
if(intscheme.eq.0) then
if(lakonl(4:5).eq.'8R') then
xi=gauss3d1(1,kk)
et=gauss3d1(2,kk)
ze=gauss3d1(3,kk)
weight=weight3d1(kk)
elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R'))
& then
xi=gauss3d2(1,kk)
c if(nope.eq.20) xi=xi+1.d0
et=gauss3d2(2,kk)
ze=gauss3d2(3,kk)
weight=weight3d2(kk)
elseif(lakonl(4:4).eq.'2') then
c xi=gauss3d3(1,kk)+1.d0
xi=gauss3d3(1,kk)
et=gauss3d3(2,kk)
ze=gauss3d3(3,kk)
weight=weight3d3(kk)
elseif(lakonl(4:5).eq.'10') then
xi=gauss3d5(1,kk)
et=gauss3d5(2,kk)
ze=gauss3d5(3,kk)
weight=weight3d5(kk)
elseif(lakonl(4:4).eq.'4') then
xi=gauss3d4(1,kk)
et=gauss3d4(2,kk)
ze=gauss3d4(3,kk)
weight=weight3d4(kk)
elseif(lakonl(4:5).eq.'15') then
xi=gauss3d8(1,kk)
et=gauss3d8(2,kk)
ze=gauss3d8(3,kk)
weight=weight3d8(kk)
else
xi=gauss3d7(1,kk)
et=gauss3d7(2,kk)
ze=gauss3d7(3,kk)
weight=weight3d7(kk)
endif
else
if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
c xi=gauss3d3(1,kk)+1.d0
xi=gauss3d3(1,kk)
et=gauss3d3(2,kk)
ze=gauss3d3(3,kk)
weight=weight3d3(kk)
elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
xi=gauss3d6(1,kk)
et=gauss3d6(2,kk)
ze=gauss3d6(3,kk)
weight=weight3d6(kk)
else
xi=gauss3d9(1,kk)
et=gauss3d9(2,kk)
ze=gauss3d9(3,kk)
weight=weight3d9(kk)
endif
endif
!
! calculation of the shape functions and their derivatives
! in the gauss point
!
if(nope.eq.20) then
call shape20h(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.8) then
call shape8h(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.10) then
call shape10tet(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.4) then
call shape4tet(xi,et,ze,xl,xsj,shp)
elseif(nope.eq.15) then
call shape15w(xi,et,ze,xl,xsj,shp)
else
call shape6w(xi,et,ze,xl,xsj,shp)
endif
!
! check the jacobian determinant
!
if(xsj.lt.1.d-20) then
write(*,*) '*WARNING in e_c3d: nonpositive jacobian'
write(*,*) ' determinant in element',nelem
write(*,*)
xsj=dabs(xsj)
nmethod=0
endif
!
if((iperturb.ne.0).and.stiffness.and.(.not.buckling))
& then
!
! stresses for 2nd order static and modal theory
!
s11=sti(1,kk,nelem)
s22=sti(2,kk,nelem)
s33=sti(3,kk,nelem)
s12=sti(4,kk,nelem)
s13=sti(5,kk,nelem)
s23=sti(6,kk,nelem)
endif
!
! calculating the temperature in the integration
! point
!
t0l=0.d0
t1l=0.d0
if(ithermal.eq.1) then
if(lakonl(4:5).eq.'8 ') then
do i1=1,nope
t0l=t0l+t0(konl(i1))/8.d0
t1l=t1l+t1(konl(i1))/8.d0
enddo
elseif(lakonl(4:6).eq.'20 ') then
call lintemp(t0,t1,konl,nope,kk,t0l,t1l)
else
do i1=1,nope
t0l=t0l+shp(4,i1)*t0(konl(i1))
t1l=t1l+shp(4,i1)*t1(konl(i1))
enddo
endif
elseif(ithermal.ge.2) then
if(lakonl(4:5).eq.'8 ') then
do i1=1,nope
t0l=t0l+t0(konl(i1))/8.d0
t1l=t1l+vold(0,konl(i1))/8.d0
enddo
elseif(lakonl(4:6).eq.'20 ') then
call lintemp_th(t0,vold,konl,nope,kk,t0l,t1l)
else
do i1=1,nope
t0l=t0l+shp(4,i1)*t0(konl(i1))
t1l=t1l+shp(4,i1)*vold(0,konl(i1))
enddo
endif
endif
tt=t1l-t0l
!
! calculating the coordinates of the integration point
! for material orientation purposes (for cylindrical
! coordinate systems)
!
if(iorien.gt.0) then
do j=1,3
pgauss(j)=0.d0
do i1=1,nope
pgauss(j)=pgauss(j)+shp(4,i1)*co(j,konl(i1))
enddo
enddo
endif
!
! for deformation plasticity: calculating the Jacobian
! and the inverse of the deformation gradient
! needed to convert the stiffness matrix in the spatial
! frame of reference to the material frame
!
kode=nelcon(1,imat)
!
! material data and local stiffness matrix
!
istiff=1
call materialdata(elcon,nelcon,rhcon,nrhcon,alcon,nalcon,
& imat,amat,iorien,pgauss,orab,ntmat_,elas,alph,rho,
& nelem,ithermal,alzero,mattyp,t0l,t1l,
& ihyper,istiff,elconloc,eth,kode,plicon,
& nplicon,plkcon,nplkcon,npmat_,
& plconloc,mint_,dtime,nelem,kk,
& xstiff,ncmat_)
!
if(mattyp.eq.1) then
e=elas(1)
un=elas(2)
um=e/(1.d0+un)
al=un*um/(1.d0-2.d0*un)
um=um/2.d0
elseif(mattyp.eq.2) then
call orthotropic(elas,anisox)
else
call anisotropic(elas,anisox)
endif
!
! initialisation for the body forces
!
om=omx*rho
if(rhs) then
if(ibod.ne.0) then
do ii=1,3
bodyf(ii)=bodyfx(ii)*rho
enddo
endif
endif
!
if(rhs) then
!
! information for the rhs
!
! residual stresses
!
if((iprestr.eq.1).or.(ithermal.eq.1)) then
if(iprestr.eq.0) then
do ii=1,6
beta(ii)=0.d0
enddo
else
do ii=1,6
beta(ii)=-prestr(ii,nelem)
enddo
endif
endif
!
! calculation of the thermal stresses in an undeformed body
! assumption; beta corresponds to initial stresses.
!
if(ithermal.eq.1) then
if(ihyper.eq.0) then
icmdl=2
call linel(ithermal,mattyp,beta,al,um,am1,alph,tt,
& elas,icmdl,exx,eyy,ezz,exy,exz,eyz,stre,
& anisox)
endif
endif
!
elseif(buckling) then
!
! buckling stresses
!
s11b=stx(1,kk,nelem)
s22b=stx(2,kk,nelem)
s33b=stx(3,kk,nelem)
s12b=stx(4,kk,nelem)
s13b=stx(5,kk,nelem)
s23b=stx(6,kk,nelem)
!
endif
!
! incorporating the jacobian determinant in the shape
! functions
!
xsjj=dsqrt(xsj)
do i1=1,nope
shpj(1,i1)=shp(1,i1)*xsjj
shpj(2,i1)=shp(2,i1)*xsjj
shpj(3,i1)=shp(3,i1)*xsjj
shpj(4,i1)=shp(4,i1)*xsj
enddo
!
! determination of the stiffness, and/or mass and/or
! buckling matrix
!
if(stiffness.or.mass.or.buckling) then
!
if((iperturb.eq.0).or.buckling)
& then
jj1=1
do jj=1,nope
!
ii1=1
do ii=1,jj
!
! all products of the shape functions for a given ii
! and jj
!
do i1=1,3
do j1=1,3
w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
enddo
enddo
!
! the following section calculates the static
! part of the stiffness matrix which, for buckling
! calculations, is done in a preliminary static
! call
!
if(.not.buckling) then
!
if(mattyp.eq.1) then
!
s(ii1,jj1)=s(ii1,jj1)+(al*w(1,1)+
& um*(2.d0*w(1,1)+w(2,2)+w(3,3)))*weight
s(ii1,jj1+1)=s(ii1,jj1+1)+(al*w(1,2)+
& um*w(2,1))*weight
s(ii1,jj1+2)=s(ii1,jj1+2)+(al*w(1,3)+
& um*w(3,1))*weight
s(ii1+1,jj1)=s(ii1+1,jj1)+(al*w(2,1)+
& um*w(1,2))*weight
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+(al*w(2,2)+
& um*(2.d0*w(2,2)+w(1,1)+w(3,3)))*weight
s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+(al*w(2,3)+
& um*w(3,2))*weight
s(ii1+2,jj1)=s(ii1+2,jj1)+(al*w(3,1)+
& um*w(1,3))*weight
s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+(al*w(3,2)+
& um*w(2,3))*weight
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+(al*w(3,3)+
& um*(2.d0*w(3,3)+w(2,2)+w(1,1)))*weight
!
elseif(mattyp.eq.2) then
!
s(ii1,jj1)=s(ii1,jj1)+(elas(1)*w(1,1)+
& elas(7)*w(2,2)+elas(8)*w(3,3))*weight
s(ii1,jj1+1)=s(ii1,jj1+1)+(elas(2)*w(1,2)+
& elas(7)*w(2,1))*weight
s(ii1,jj1+2)=s(ii1,jj1+2)+(elas(4)*w(1,3)+
& elas(8)*w(3,1))*weight
s(ii1+1,jj1)=s(ii1+1,jj1)+(elas(7)*w(1,2)+
& elas(2)*w(2,1))*weight
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+
& (elas(7)*w(1,1)+
& elas(3)*w(2,2)+elas(9)*w(3,3))*weight
s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+
& (elas(5)*w(2,3)+
& elas(9)*w(3,2))*weight
s(ii1+2,jj1)=s(ii1+2,jj1)+
& (elas(8)*w(1,3)+
& elas(4)*w(3,1))*weight
s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+
& (elas(9)*w(2,3)+
& elas(5)*w(3,2))*weight
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+
& (elas(8)*w(1,1)+
& elas(9)*w(2,2)+elas(6)*w(3,3))*weight
!
else
!
do i1=1,3
do j1=1,3
do k1=1,3
do l1=1,3
s(ii1+i1-1,jj1+j1-1)=
& s(ii1+i1-1,jj1+j1-1)
& +anisox(i1,k1,j1,l1)
& *w(k1,l1)*weight
enddo
enddo
enddo
enddo
!
endif
!
! mass matrix
!
if(mass) then
sm(ii1,jj1)=sm(ii1,jj1)
& +rho*shpj(4,ii)*shp(4,jj)*weight
sm(ii1+1,jj1+1)=sm(ii1,jj1)
sm(ii1+2,jj1+2)=sm(ii1,jj1)
endif
!
else
!
! buckling matrix
!
senergyb=
& (s11b*w(1,1)+s12b*(w(1,2)+w(2,1))
& +s13b*(w(1,3)+w(3,1))+s22b*w(2,2)
& +s23b*(w(2,3)+w(3,2))+s33b*w(3,3))*weight
sm(ii1,jj1)=sm(ii1,jj1)-senergyb
sm(ii1+1,jj1+1)=sm(ii1+1,jj1+1)-senergyb
sm(ii1+2,jj1+2)=sm(ii1+2,jj1+2)-senergyb
!
endif
!
ii1=ii1+3
enddo
jj1=jj1+3
enddo
else
!
! stiffness matrix for static and modal
! 2nd order calculations
!
! large displacement stiffness
!
do i1=1,3
do j1=1,3
vo(i1,j1)=0.d0
do k1=1,nope
vo(i1,j1)=vo(i1,j1)+shp(j1,k1)*voldl(i1,k1)
enddo
enddo
enddo
!
if(mattyp.eq.1) then
call wcoef(v,vo,al,um)
endif
!
! calculating the total mass of the element for
! lumping purposes: only for explicit nonlinear
! dynamic calculations
!
if(mass.and.(iexpl.eq.1)) then
summass=summass+rho*xsj
endif
!
jj1=1
do jj=1,nope
!
ii1=1
do ii=1,jj
!
! all products of the shape functions for a given ii
! and jj
!
do i1=1,3
do j1=1,3
w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
enddo
enddo
!
if(mattyp.eq.1) then
!
do m1=1,3
do m2=1,3
do m3=1,3
do m4=1,3
s(ii1+m2-1,jj1+m1-1)=
& s(ii1+m2-1,jj1+m1-1)
& +v(m4,m3,m2,m1)*w(m4,m3)*weight
enddo
enddo
enddo
enddo
!
elseif(mattyp.eq.2) then
!
call orthonl(w,vo,elas,s,ii1,jj1,weight)
!
else
!
do i1=1,3
iii1=ii1+i1-1
do j1=1,3
jjj1=jj1+j1-1
do k1=1,3
do l1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
do m1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(i1,k1,m1,l1)*w(k1,l1)
& *vo(j1,m1)*weight
& +anisox(m1,k1,j1,l1)*w(k1,l1)
& *vo(i1,m1)*weight
do n1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(m1,k1,n1,l1)
& *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
& *weight
enddo
enddo
enddo
enddo
enddo
enddo
!SPEC: The immediately preceding loop nest is also available in
!SPEC: program-generated (much longer) form from the author's
!SPEC: website (see 454.calculix/Docs) in file anisonl.f
!SPEC:
!SPEC: call anisonl(w,vo,elas,s,ii1,jj1,weight)
!SPEC:
endif
!
! stress stiffness
!
senergy=
& (s11*w(1,1)+s12*(w(1,2)+w(2,1))
& +s13*(w(1,3)+w(3,1))+s22*w(2,2)
& +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
s(ii1,jj1)=s(ii1,jj1)+senergy
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
!
! mass matrix
!
if(mass) then
sm(ii1,jj1)=sm(ii1,jj1)
& +rho*shpj(4,ii)*shp(4,jj)*weight
sm(ii1+1,jj1+1)=sm(ii1,jj1)
sm(ii1+2,jj1+2)=sm(ii1,jj1)
endif
!
! stiffness contribution of centrifugal forces
!
if(mass.and.(om.gt.0.d0)) then
dmass=shpj(4,ii)*shp(4,jj)*weight*om
do m1=1,3
s(ii1+m1-1,jj1+m1-1)=s(ii1+m1-1,jj1+m1-1)-
& dmass
do m2=1,3
s(ii1+m1-1,jj1+m2-1)=s(ii1+m1-1,jj1+m2-1)+
& dmass*p2(m1)*p2(m2)
enddo
enddo
endif
!
ii1=ii1+3
enddo
jj1=jj1+3
enddo
endif
!
endif
!
! computation of the right hand side
!
if(rhs) then
!
! body forces
!
if(ibod.ne.0) then
if(om.gt.0.d0) then
do i1=1,3
!
! computation of the global coordinates of the gauss
! point
!
q(i1)=0.d0
if(iperturb.eq.0) then
do j1=1,nope
q(i1)=q(i1)+shp(4,j1)*xl(i1,j1)
enddo
else
do j1=1,nope
q(i1)=q(i1)+shp(4,j1)*
& (xl(i1,j1)+voldl(i1,j1))
enddo
endif
!
q(i1)=q(i1)-p1(i1)
enddo
const=q(1)*p2(1)+q(2)*p2(2)+q(3)*p2(3)
!
! inclusion of the centrifugal force into the body force
!
do i1=1,3
bf(i1)=bodyf(i1)+(q(i1)-const*p2(i1))*om
enddo
else
do i1=1,3
bf(i1)=bodyf(i1)
enddo
endif
jj1=1
do jj=1,nope
f(jj1)=f(jj1)+bf(1)*shpj(4,jj)*weight
f(jj1+1)=f(jj1+1)+bf(2)*shpj(4,jj)*weight
f(jj1+2)=f(jj1+2)+bf(3)*shpj(4,jj)*weight
ff(jj1)=ff(jj1)+bf(1)*shpj(4,jj)*weight
ff(jj1+1)=ff(jj1+1)+bf(2)*shpj(4,jj)*weight
ff(jj1+2)=ff(jj1+2)+bf(3)*shpj(4,jj)*weight
jj1=jj1+3
enddo
endif
!
! thermal stresses and/or residual stresses
!
if((ithermal.ne.0).or.(iprestr.ne.0)) then
do jj=1,6
beta(jj)=beta(jj)*xsj
enddo
jj1=1
do jj=1,nope
f(jj1)=f(jj1)+(shp(1,jj)*beta(1)+
& (shp(2,jj)*beta(4)+shp(3,jj)*beta(5))/2.d0)
& *weight
f(jj1+1)=f(jj1+1)+(shp(2,jj)*beta(2)+
& (shp(1,jj)*beta(4)+shp(3,jj)*beta(6))/2.d0)
& *weight
f(jj1+2)=f(jj1+2)+(shp(3,jj)*beta(3)+
& (shp(1,jj)*beta(5)+shp(2,jj)*beta(6))/2.d0)
& *weight
jj1=jj1+3
enddo
endif
!
endif
!
enddo
!
c write(*,'(6(1x,e11.4))') ((s(i1,j1),i1=1,j1),j1=1,60)
c write(*,*)
c
if(.not.buckling) then
!
! distributed loads
!
if(nload.eq.0) then
return
endif
call nident2(nelemload,nelem,nload,id)
do
if((id.eq.0).or.(nelemload(1,id).ne.nelem)) exit
read(sideload(id)(2:2),'(i1)') ig
!
! treatment of wedge faces
!
if(lakonl(4:4).eq.'6') then
mint2d=1
if(ig.le.2) then
nopes=3
else
nopes=4
endif
endif
if(lakonl(4:5).eq.'15') then
if(ig.le.2) then
mint2d=3
nopes=6
else
mint2d=4
nopes=8
endif
endif
!
if((nope.eq.20).or.(nope.eq.8)) then
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifaceq(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifaceq(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifaceq(i,ig)))+
& vold(j,konl(ifaceq(i,ig)))
enddo
enddo
endif
elseif((nope.eq.10).or.(nope.eq.4)) then
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacet(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifacet(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacet(i,ig)))+
& vold(j,konl(ifacet(i,ig)))
enddo
enddo
endif
else
if(iperturb.eq.0) then
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacew(i,ig)))
enddo
enddo
else
if(mass) then
do i=1,nopes
do j=1,3
xl1(j,i)=co(j,konl(ifacew(i,ig)))
enddo
enddo
endif
do i=1,nopes
do j=1,3
xl2(j,i)=co(j,konl(ifacew(i,ig)))+
& vold(j,konl(ifacew(i,ig)))
enddo
enddo
endif
endif
!
do i=1,mint2d
if((lakonl(4:5).eq.'8R').or.
& ((lakonl(4:4).eq.'6').and.(nopes.eq.4))) then
xi=gauss2d1(1,i)
et=gauss2d1(2,i)
weight=weight2d1(i)
elseif((lakonl(4:4).eq.'8').or.
& (lakonl(4:6).eq.'20R').or.
& ((lakonl(4:5).eq.'15').and.(nopes.eq.8))) then
xi=gauss2d2(1,i)
et=gauss2d2(2,i)
weight=weight2d2(i)
elseif(lakonl(4:4).eq.'2') then
xi=gauss2d3(1,i)
et=gauss2d3(2,i)
weight=weight2d3(i)
elseif((lakonl(4:5).eq.'10').or.
& ((lakonl(4:5).eq.'15').and.(nopes.eq.6))) then
xi=gauss2d5(1,i)
et=gauss2d5(2,i)
weight=weight2d5(i)
elseif((lakonl(4:4).eq.'4').or.
& ((lakonl(4:4).eq.'6').and.(nopes.eq.3))) then
xi=gauss2d4(1,i)
et=gauss2d4(2,i)
weight=weight2d4(i)
endif
!
if(rhs) then
if(nopes.eq.8) then
call shape8q(xi,et,xl2,xsj2,shp2)
elseif(nopes.eq.4) then
call shape4q(xi,et,xl2,xsj2,shp2)
elseif(nopes.eq.6) then
call shape6tri(xi,et,xl2,xsj2,shp2)
else
call shape3tri(xi,et,xl2,xsj2,shp2)
endif
!
do k=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointer=(ifaceq(k,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointer=(ifacet(k,ig)-1)*3
else
ipointer=(ifacew(k,ig)-1)*3
endif
f(ipointer+1)=f(ipointer+1)-shp2(4,k)*xload(1,id)
& *xsj2(1)*weight
f(ipointer+2)=f(ipointer+2)-shp2(4,k)*xload(1,id)
& *xsj2(2)*weight
f(ipointer+3)=f(ipointer+3)-shp2(4,k)*xload(1,id)
& *xsj2(3)*weight
ff(ipointer+1)=ff(ipointer+1)-shp2(4,k)*xload(1,id)
& *xsj2(1)*weight
ff(ipointer+2)=ff(ipointer+2)-shp2(4,k)*xload(1,id)
& *xsj2(2)*weight
ff(ipointer+3)=ff(ipointer+3)-shp2(4,k)*xload(1,id)
& *xsj2(3)*weight
enddo
!
! stiffness contribution of the distributed load
!
elseif(mass) then
if(nopes.eq.8) then
call shape8q(xi,et,xl1,xsj2,shp2)
elseif(nopes.eq.4) then
call shape4q(xi,et,xl1,xsj2,shp2)
elseif(nopes.eq.6) then
call shape6tri(xi,et,xl1,xsj2,shp2)
else
call shape3tri(xi,et,xl1,xsj2,shp2)
endif
!
! calculation of the deformation gradient
!
do k=1,3
do l=1,3
xkl(k,l)=0.d0
do ii=1,nopes
xkl(k,l)=xkl(k,l)+shp2(l,ii)*xl2(k,ii)
enddo
enddo
enddo
!
do ii=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointeri=(ifaceq(ii,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointeri=(ifacet(ii,ig)-1)*3
else
ipointeri=(ifacew(ii,ig)-1)*3
endif
do jj=1,nopes
if((nope.eq.20).or.(nope.eq.8)) then
ipointerj=(ifaceq(jj,ig)-1)*3
elseif((nope.eq.10).or.(nope.eq.4)) then
ipointerj=(ifacet(jj,ig)-1)*3
else
ipointerj=(ifacew(jj,ig)-1)*3
endif
do k=1,3
do l=1,3
if(k.eq.l) cycle
if(k*l.eq.2) then
n=3
elseif(k*l.eq.3) then
n=2
else
n=1
endif
term=weight*xload(1,id)*shp2(4,jj)*
& (xsj2(1)*
& (xkl(n,2)*shp2(3,ii)-xkl(n,3)*shp2(2,ii))+
& xsj2(2)*
& (xkl(n,3)*shp2(1,ii)-xkl(n,1)*shp2(3,ii))+
& xsj2(3)*
& (xkl(n,1)*shp2(2,ii)-xkl(n,2)*shp2(1,ii)))
if(ipointeri+k.le.ipointerj+l) then
s(ipointeri+k,ipointerj+l)=
& s(ipointeri+k,ipointerj+l)+term/2.d0
else
s(ipointerj+l,ipointeri+k)=
& s(ipointerj+l,ipointeri+k)+term/2.d0
endif
enddo
enddo
enddo
enddo
!
endif
enddo
!
id=id-1
enddo
!
elseif(mass.and.(iexpl.eq.1)) then
!
! scaling the diagonal terms of the mass matrix such that the total mass
! is right (LUMPING; for explicit dynamic calculations)
!
sume=0.d0
summ=0.d0
do i=1,3*nopev,3
sume=sume+sm(i,i)
enddo
do i=3*nopev+1,3*nope,3
summ=summ+sm(i,i)
enddo
!
if(nope.eq.20) then
c alp=.2215d0
alp=.2917d0
! maybe alp=.2917d0 is better??
elseif(nope.eq.10) then
alp=0.1203d0
elseif(nope.eq.15) then
alp=0.2141d0
endif
!
if((nope.eq.20).or.(nope.eq.10).or.(nope.eq.15)) then
factore=summass*alp/(1.d0+alp)/sume
factorm=summass/(1.d0+alp)/summ
else
factore=summass/sume
endif
!
do i=1,3*nopev,3
sm(i,i)=sm(i,i)*factore
sm(i+1,i+1)=sm(i,i)
sm(i+2,i+2)=sm(i,i)
enddo
do i=3*nopev+1,3*nope,3
sm(i,i)=sm(i,i)*factorm
sm(i+1,i+1)=sm(i,i)
sm(i+2,i+2)=sm(i,i)
enddo
!
endif
!
return
end
[-- Attachment #4: gauss.f --]
[-- Type: application/octet-stream, Size: 9346 bytes --]
!
! contains Gauss point information
!
! gauss2d1: quad, 1-point integration (1 integration point)
! gauss2d2: quad, 2-point integration (4 integration points)
! gauss2d3: quad, 3-point integration (9 integration points)
! gauss2d4: tri, 1 integration point
! gauss2d5: tri, 3 integration points
! gauss3d1: hex, 1-point integration (1 integration point)
! gauss3d2: hex, 2-point integration (8 integration points)
! gauss3d3: hex, 3-point integration (27 integration points)
! gauss3d4: tet, 1 integration point
! gauss3d5: tet, 4 integration points
! gauss3d6: tet, 15 integration points
! gauss3d7: wedge, 2 integration points
! gauss3d8: wedge, 9 integration points
! gauss3d9: wedge, 18 integration points
!
! weight2d1,... contains the weights
!
data gauss2d1 /0.,0./
!
data gauss2d2 /
& -0.577350269189626d0,-0.577350269189626d0,
& -0.577350269189626d0,0.577350269189626d0,
& 0.577350269189626d0,-0.577350269189626d0,
& 0.577350269189626d0,0.577350269189626d0/
!
data gauss2d3 /
& -0.774596669241483d0,-0.774596669241483d0,
& -0.774596669241483d0,0.d0,
& -0.774596669241483d0,0.774596669241483d0,
& -0.d0,-0.774596669241483d0,
& -0.d0,0.d0,
& -0.d0,0.774596669241483d0,
& 0.774596669241483d0,-0.774596669241483d0,
& 0.774596669241483d0,0.d0,
& 0.774596669241483d0,0.774596669241483d0/
!
data gauss2d4 /0.333333333333333d0,0.333333333333333d0/
!
data gauss2d5 /.5d0,.5d0,0.d0,.5d0,.5d0,0.d0/
!
data gauss3d1 /0.,0.,0./
!
data gauss3d2 /
& -0.577350269189626d0,-0.577350269189626d0,-0.577350269189626d0,
& 0.577350269189626d0,-0.577350269189626d0,-0.577350269189626d0,
& -0.577350269189626d0,0.577350269189626d0,-0.577350269189626d0,
& 0.577350269189626d0,0.577350269189626d0,-0.577350269189626d0,
& -0.577350269189626d0,-0.577350269189626d0,0.577350269189626d0,
& 0.577350269189626d0,-0.577350269189626d0,0.577350269189626d0,
& -0.577350269189626d0,0.577350269189626d0,0.577350269189626d0,
& 0.577350269189626d0,0.577350269189626d0,0.577350269189626d0/
!
data gauss3d3 /
& -0.774596669241483d0,-0.774596669241483d0,-0.774596669241483d0,
& 0.d0,-0.774596669241483d0,-0.774596669241483d0,
& 0.774596669241483d0,-0.774596669241483d0,-0.774596669241483d0,
& -0.774596669241483d0,0.d0,-0.774596669241483d0,
& 0.d0,0.d0,-0.774596669241483d0,
& 0.774596669241483d0,0.d0,-0.774596669241483d0,
& -0.774596669241483d0,0.774596669241483d0,-0.774596669241483d0,
& 0.d0,0.774596669241483d0,-0.774596669241483d0,
& 0.774596669241483d0,0.774596669241483d0,-0.774596669241483d0,
& -0.774596669241483d0,-0.774596669241483d0,0.d0,
& 0.d0,-0.774596669241483d0,0.d0,
& 0.774596669241483d0,-0.774596669241483d0,0.d0,
& -0.774596669241483d0,0.d0,0.d0,
& 0.d0,0.d0,0.d0,
& 0.774596669241483d0,0.d0,0.d0,
& -0.774596669241483d0,0.774596669241483d0,0.d0,
& 0.d0,0.774596669241483d0,0.d0,
& 0.774596669241483d0,0.774596669241483d0,0.d0,
& -0.774596669241483d0,-0.774596669241483d0,0.774596669241483d0,
& 0.d0,-0.774596669241483d0,0.774596669241483d0,
& 0.774596669241483d0,-0.774596669241483d0,0.774596669241483d0,
& -0.774596669241483d0,0.d0,0.774596669241483d0,
& 0.d0,0.d0,0.774596669241483d0,
& 0.774596669241483d0,0.d0,0.774596669241483d0,
& -0.774596669241483d0,0.774596669241483d0,0.774596669241483d0,
& 0.d0,0.774596669241483d0,0.774596669241483d0,
& 0.774596669241483d0,0.774596669241483d0,0.774596669241483d0/
!
data gauss3d4 /0.25d0,0.25d0,0.25d0/
!
data gauss3d5 /
& 0.138196601125011d0,0.138196601125011d0,0.138196601125011d0,
& 0.585410196624968d0,0.138196601125011d0,0.138196601125011d0,
& 0.138196601125011d0,0.585410196624968d0,0.138196601125011d0,
& 0.138196601125011d0,0.138196601125011d0,0.585410196624968d0/
!
data gauss3d6 /
& 0.25,0.25,0.25d0,
& 0.091971078052723d0,0.091971078052723d0,0.091971078052723d0,
& 0.091971078052723d0,0.091971078052723d0,0.724086765841831d0,
& 0.091971078052723d0,0.724086765841831d0,0.091971078052723d0,
& 0.724086765841831d0,0.091971078052723d0,0.091971078052723d0,
& 0.319793627829630d0,0.319793627829630d0,0.319793627829630d0,
& 0.319793627829630d0,0.319793627829630d0,0.040619116511110d0,
& 0.319793627829630d0,0.040619116511110d0,0.319793627829630d0,
& 0.040619116511110d0,0.319793627829630d0,0.319793627829630d0,
& 0.056350832689629d0,0.056350832689629d0,0.443649167310371d0,
& 0.056350832689629d0,0.443649167310371d0,0.056350832689629d0,
& 0.443649167310371d0,0.056350832689629d0,0.056350832689629d0,
& 0.443649167310371d0,0.443649167310371d0,0.056350832689629d0,
& 0.443649167310371d0,0.056350832689629d0,0.443649167310371d0,
& 0.056350832689629d0,0.443649167310371d0,0.443649167310371d0/
!
data gauss3d7 /
& 0.333333333333333d0,0.333333333333333d0,-0.577350269189626d0,
& 0.333333333333333d0,0.333333333333333d0,0.577350269189626d0/
!
data gauss3d8 /
& 0.166666666666667d0,0.166666666666667d0,-0.774596669241483d0,
& 0.666666666666667d0,0.166666666666667d0,-0.774596669241483d0,
& 0.166666666666667d0,0.666666666666667d0,-0.774596669241483d0,
& 0.166666666666667d0,0.166666666666667d0,0.d0,
& 0.666666666666667d0,0.166666666666667d0,0.d0,
& 0.166666666666667d0,0.666666666666667d0,0.d0,
& 0.166666666666667d0,0.166666666666667d0,0.774596669241483d0,
& 0.666666666666667d0,0.166666666666667d0,0.774596669241483d0,
& 0.166666666666667d0,0.666666666666667d0,0.774596669241483d0/
!
data gauss3d9 /
& 0.166666666666667d0,0.166666666666667d0,-0.774596669241483d0,
& 0.166666666666667d0,0.666666666666667d0,-0.774596669241483d0,
& 0.666666666666667d0,0.166666666666667d0,-0.774596669241483d0,
& 0.000000000000000d0,0.500000000000000d0,-0.774596669241483d0,
& 0.500000000000000d0,0.000000000000000d0,-0.774596669241483d0,
& 0.500000000000000d0,0.500000000000000d0,-0.774596669241483d0,
& 0.166666666666667d0,0.166666666666667d0,0.d0,
& 0.166666666666667d0,0.666666666666667d0,0.d0,
& 0.666666666666667d0,0.166666666666667d0,0.d0,
& 0.000000000000000d0,0.500000000000000d0,0.d0,
& 0.500000000000000d0,0.000000000000000d0,0.d0,
& 0.500000000000000d0,0.500000000000000d0,0.d0,
& 0.166666666666667d0,0.166666666666667d0,0.774596669241483d0,
& 0.166666666666667d0,0.666666666666667d0,0.774596669241483d0,
& 0.666666666666667d0,0.166666666666667d0,0.774596669241483d0,
& 0.000000000000000d0,0.500000000000000d0,0.774596669241483d0,
& 0.500000000000000d0,0.000000000000000d0,0.774596669241483d0,
& 0.500000000000000d0,0.500000000000000d0,0.774596669241483d0/
!
data weight2d1 /4.d0/
!
data weight2d2 /1.d0,1.d0,1.d0,1.d0/
!
data weight2d3 /
& 0.308641975308642d0,0.493827160493827d0,0.308641975308642d0,
& 0.493827160493827d0,0.790123456790123d0,0.493827160493827d0,
& 0.308641975308642d0,0.493827160493827d0,0.308641975308642d0/
!
data weight2d4 /0.5d0/
!
data weight2d5 /
& 0.166666666666666d0,0.166666666666666d0,0.166666666666666d0/
!
data weight3d1 /8.d0/
!
data weight3d2 /1.d0,1.d0,1.d0,1.d0,1.d0,1.d0,1.d0,1.d0/
!
data weight3d3 /
& 0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
& 0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
& 0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
& 0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
& 0.438957475994513d0,0.702331961591221d0,0.438957475994513d0,
& 0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
& 0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
& 0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
& 0.171467764060357d0,0.274348422496571d0,0.171467764060357d0/
!
data weight3d4 /0.166666666666667d0/
!
data weight3d5 /
& 0.041666666666667d0,0.041666666666667d0,0.041666666666667d0,
& 0.041666666666667d0/
!
data weight3d6 /
& 0.019753086419753d0,0.011989513963170d0,0.011989513963170d0,
& 0.011989513963170d0,0.011989513963170d0,0.011511367871045d0,
& 0.011511367871045d0,0.011511367871045d0,0.011511367871045d0,
& 0.008818342151675d0,0.008818342151675d0,0.008818342151675d0,
& 0.008818342151675d0,0.008818342151675d0,0.008818342151675d0/
!
data weight3d7 /0.5d0,0.5d0/
!
data weight3d8 /
& 0.092592592592593d0,0.092592592592593d0,0.092592592592593d0,
& 0.148148148148148d0,0.148148148148148d0,0.148148148148148d0,
& 0.092592592592593d0,0.092592592592593d0,0.092592592592593d0/
!
data weight3d9 /
& 0.083333333333333d0,0.083333333333333d0,0.083333333333333d0,
& 0.009259259259259d0,0.009259259259259d0,0.009259259259259d0,
& 0.133333333333333d0,0.133333333333333d0,0.133333333333333d0,
& 0.014814814814815d0,0.014814814814815d0,0.014814814814815d0,
& 0.083333333333333d0,0.083333333333333d0,0.083333333333333d0,
& 0.009259259259259d0,0.009259259259259d0,0.009259259259259d0/
!
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-26 10:32 LTO slows down calculix by more than 10% on aarch64 Prathamesh Kulkarni
@ 2020-08-26 11:20 ` Richard Biener
2020-08-28 11:16 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-08-26 11:20 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: GCC Development
On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
<gcc@gcc.gnu.org> wrote:
>
> Hi,
> We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> on aarch64 in our validation CI. I tried to investigate this issue a
> bit, and it seems the regression comes from inlining of orthonl into
> e_c3d. Disabling that brings back the performance. However, inlining
> orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> 16.9% which isn't too large.
>
> I have attached two test-cases, e_c3d.f that has orthonl manually
> inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> which contains unmodified function.
> (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> sufficient.
>
> It seems that inlining orthonl, causes 20 hoistings into block 181,
> which are then hoisted to block 173, in particular hoistings of w(1,
> 1) ... w(3, 3), which wasn't
> possible without inlining. The hoistings happen because of basic block
> that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> following block in line 1035 in e_c3d.f:
>
> senergy=
> & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
>
> Disabling hoisting into blocks 173 (and 181), brings back most of the
> performance. I am not able to understand why (if?) these hoistings of
> w(1, 1) ...
> w(3, 3) are causing slowdown however. Looking at assembly, the hot
> code-path from perf in e_c3d shows following code-gen diff:
> For inlined version:
> .L122:
> ldr d15, [x1, -248]
> add w0, w0, 1
> add x2, x2, 24
> add x1, x1, 72
> fmul d15, d17, d15
> fmul d15, d15, d18
> fmul d14, d15, d14
> fmadd d16, d14, d31, d16
> cmp w0, 4
> beq .L121
> ldr d14, [x2, -8]
> b .L122
>
> and for non-inlined version:
> .L118:
> ldr d0, [x1, -248]
> add w0, w0, 1
> ldr d2, [x2, -8]
> add x1, x1, 72
> add x2, x2, 24
> fmul d0, d3, d0
> fmul d0, d0, d5
> fmul d0, d0, d2
> fmadd d1, d4, d0, d1
> cmp w0, 4
> bne .L118
I wonder if you have profles. The inlined version has a
non-empty latch block (looks like some PRE is happening
there?). Eventually your uarch does not like the close
(does your assembly show the layour as it is?) branches?
> which corresponds to the following loop in line 1014.
> do n1=1,3
> s(iii1,jjj1)=s(iii1,jjj1)
> & +anisox(m1,k1,n1,l1)
> & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> & *weight
>
> I am not sure why would hoisting have any direct effect on this loop
> except perhaps that hoisting allocated more reigsters, and led to
> increased register pressure. Perhaps that's why it's using highered
> number regs for code-gen in inlined version ? However disabling
> hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> (by grepping for str to sp), so
> hoisting is also helping here ? I am not sure how to proceed further,
> and would be grateful for suggestions.
>
> Thanks,
> Prathamesh
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-26 11:20 ` Richard Biener
@ 2020-08-28 11:16 ` Prathamesh Kulkarni
2020-08-28 11:57 ` Richard Biener
2020-08-28 12:03 ` Alexander Monakov
0 siblings, 2 replies; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-28 11:16 UTC (permalink / raw)
To: Richard Biener; +Cc: GCC Development
On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> <gcc@gcc.gnu.org> wrote:
> >
> > Hi,
> > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > on aarch64 in our validation CI. I tried to investigate this issue a
> > bit, and it seems the regression comes from inlining of orthonl into
> > e_c3d. Disabling that brings back the performance. However, inlining
> > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > 16.9% which isn't too large.
> >
> > I have attached two test-cases, e_c3d.f that has orthonl manually
> > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > which contains unmodified function.
> > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > sufficient.
> >
> > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > which are then hoisted to block 173, in particular hoistings of w(1,
> > 1) ... w(3, 3), which wasn't
> > possible without inlining. The hoistings happen because of basic block
> > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > following block in line 1035 in e_c3d.f:
> >
> > senergy=
> > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> >
> > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > performance. I am not able to understand why (if?) these hoistings of
> > w(1, 1) ...
> > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > code-path from perf in e_c3d shows following code-gen diff:
> > For inlined version:
> > .L122:
> > ldr d15, [x1, -248]
> > add w0, w0, 1
> > add x2, x2, 24
> > add x1, x1, 72
> > fmul d15, d17, d15
> > fmul d15, d15, d18
> > fmul d14, d15, d14
> > fmadd d16, d14, d31, d16
> > cmp w0, 4
> > beq .L121
> > ldr d14, [x2, -8]
> > b .L122
> >
> > and for non-inlined version:
> > .L118:
> > ldr d0, [x1, -248]
> > add w0, w0, 1
> > ldr d2, [x2, -8]
> > add x1, x1, 72
> > add x2, x2, 24
> > fmul d0, d3, d0
> > fmul d0, d0, d5
> > fmul d0, d0, d2
> > fmadd d1, d4, d0, d1
> > cmp w0, 4
> > bne .L118
>
> I wonder if you have profles. The inlined version has a
> non-empty latch block (looks like some PRE is happening
> there?). Eventually your uarch does not like the close
> (does your assembly show the layour as it is?) branches?
Hi Richard,
I have uploaded profiles obtained by perf here:
-O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
-O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
For the above loop, it shows the following:
-O2:
0.01 │ f1c: ldur d0, [x1, #-248]
3.53 │ add w0, w0, #0x1
│ ldur d2, [x2, #-8]
3.54 │ add x1, x1, #0x48
│ add x2, x2, #0x18
5.89 │ fmul d0, d3, d0
14.12 │ fmul d0, d0, d5
14.14 │ fmul d0, d0, d2
14.13 │ fmadd d1, d4, d0, d1
0.00 │ cmp w0, #0x4
3.52 │ ↑ b.ne f1c
-O2 -flto:
5.47 |1124: ldur d15, [x1, #-248]
2.19 │ add w0, w0, #0x1
1.10 │ add x2, x2, #0x18
2.18 │ add x1, x1, #0x48
4.37 │ fmul d15, d17, d15
13.13 │ fmul d15, d15, d18
13.13 │ fmul d14, d15, d14
13.14 │ fmadd d16, d14, d31, d16
│ cmp w0, #0x4
3.28 │ ↓ b.eq 1154
0.00 │ ldur d14, [x2, #-8]
2.19 │ ↑ b 1124
IIUC, the biggest relative difference comes from load [x1, #-248]
which in LTO's case takes 5.47% of overall samples:
5.47 |1124: ldur d15, [x1, #-248]
while in case of -O2, it's just 0.01:
0.01 │ f1c: ldur d0, [x1, #-248]
I wonder if that's (one of) the main factor(s) behind slowdown or it's
not too relevant ?
Thanks,
Prathamesh
>
> > which corresponds to the following loop in line 1014.
> > do n1=1,3
> > s(iii1,jjj1)=s(iii1,jjj1)
> > & +anisox(m1,k1,n1,l1)
> > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > & *weight
> >
> > I am not sure why would hoisting have any direct effect on this loop
> > except perhaps that hoisting allocated more reigsters, and led to
> > increased register pressure. Perhaps that's why it's using highered
> > number regs for code-gen in inlined version ? However disabling
> > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > (by grepping for str to sp), so
> > hoisting is also helping here ? I am not sure how to proceed further,
> > and would be grateful for suggestions.
> >
> > Thanks,
> > Prathamesh
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-28 11:16 ` Prathamesh Kulkarni
@ 2020-08-28 11:57 ` Richard Biener
2020-08-31 11:21 ` Prathamesh Kulkarni
2020-08-28 12:03 ` Alexander Monakov
1 sibling, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-08-28 11:57 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: GCC Development
On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > bit, and it seems the regression comes from inlining of orthonl into
> > > e_c3d. Disabling that brings back the performance. However, inlining
> > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > 16.9% which isn't too large.
> > >
> > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > which contains unmodified function.
> > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > sufficient.
> > >
> > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > 1) ... w(3, 3), which wasn't
> > > possible without inlining. The hoistings happen because of basic block
> > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > following block in line 1035 in e_c3d.f:
> > >
> > > senergy=
> > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > >
> > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > performance. I am not able to understand why (if?) these hoistings of
> > > w(1, 1) ...
> > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > code-path from perf in e_c3d shows following code-gen diff:
> > > For inlined version:
> > > .L122:
> > > ldr d15, [x1, -248]
> > > add w0, w0, 1
> > > add x2, x2, 24
> > > add x1, x1, 72
> > > fmul d15, d17, d15
> > > fmul d15, d15, d18
> > > fmul d14, d15, d14
> > > fmadd d16, d14, d31, d16
> > > cmp w0, 4
> > > beq .L121
> > > ldr d14, [x2, -8]
> > > b .L122
> > >
> > > and for non-inlined version:
> > > .L118:
> > > ldr d0, [x1, -248]
> > > add w0, w0, 1
> > > ldr d2, [x2, -8]
> > > add x1, x1, 72
> > > add x2, x2, 24
> > > fmul d0, d3, d0
> > > fmul d0, d0, d5
> > > fmul d0, d0, d2
> > > fmadd d1, d4, d0, d1
> > > cmp w0, 4
> > > bne .L118
> >
> > I wonder if you have profles. The inlined version has a
> > non-empty latch block (looks like some PRE is happening
> > there?). Eventually your uarch does not like the close
> > (does your assembly show the layour as it is?) branches?
> Hi Richard,
> I have uploaded profiles obtained by perf here:
> -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
>
> For the above loop, it shows the following:
> -O2:
> 0.01 │ f1c: ldur d0, [x1, #-248]
> 3.53 │ add w0, w0, #0x1
> │ ldur d2, [x2, #-8]
> 3.54 │ add x1, x1, #0x48
> │ add x2, x2, #0x18
> 5.89 │ fmul d0, d3, d0
> 14.12 │ fmul d0, d0, d5
> 14.14 │ fmul d0, d0, d2
> 14.13 │ fmadd d1, d4, d0, d1
> 0.00 │ cmp w0, #0x4
> 3.52 │ ↑ b.ne f1c
>
> -O2 -flto:
> 5.47 |1124: ldur d15, [x1, #-248]
> 2.19 │ add w0, w0, #0x1
> 1.10 │ add x2, x2, #0x18
> 2.18 │ add x1, x1, #0x48
> 4.37 │ fmul d15, d17, d15
> 13.13 │ fmul d15, d15, d18
> 13.13 │ fmul d14, d15, d14
> 13.14 │ fmadd d16, d14, d31, d16
> │ cmp w0, #0x4
> 3.28 │ ↓ b.eq 1154
> 0.00 │ ldur d14, [x2, #-8]
> 2.19 │ ↑ b 1124
>
> IIUC, the biggest relative difference comes from load [x1, #-248]
> which in LTO's case takes 5.47% of overall samples:
> 5.47 |1124: ldur d15, [x1, #-248]
> while in case of -O2, it's just 0.01:
> 0.01 │ f1c: ldur d0, [x1, #-248]
>
> I wonder if that's (one of) the main factor(s) behind slowdown or it's
> not too relevant ?
This looks more like the branch since usually branch costs
are attributed to the target rather than the branch itself. You could
try re-ordering the code so the loop entry jumps around the
latch which can then fall thru so see if that makes a difference.
Richard.
> Thanks,
> Prathamesh
> >
> > > which corresponds to the following loop in line 1014.
> > > do n1=1,3
> > > s(iii1,jjj1)=s(iii1,jjj1)
> > > & +anisox(m1,k1,n1,l1)
> > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > & *weight
> > >
> > > I am not sure why would hoisting have any direct effect on this loop
> > > except perhaps that hoisting allocated more reigsters, and led to
> > > increased register pressure. Perhaps that's why it's using highered
> > > number regs for code-gen in inlined version ? However disabling
> > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > (by grepping for str to sp), so
> > > hoisting is also helping here ? I am not sure how to proceed further,
> > > and would be grateful for suggestions.
> > >
> > > Thanks,
> > > Prathamesh
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-28 11:16 ` Prathamesh Kulkarni
2020-08-28 11:57 ` Richard Biener
@ 2020-08-28 12:03 ` Alexander Monakov
2020-08-31 11:23 ` Prathamesh Kulkarni
1 sibling, 1 reply; 25+ messages in thread
From: Alexander Monakov @ 2020-08-28 12:03 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development
On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> I wonder if that's (one of) the main factor(s) behind slowdown or it's
> not too relevant ?
Probably not. Some advice to make your search more directed:
Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
when they are computed against different bases, it's much easier to see that
a loop is slowing down if it went from 4000 to 4500 in absolute sample count
as opposed to 90% to 91% in relative sample ratio.
Before diving down 'perf report', be sure to fully account for differences
in 'perf stat' output. Do the programs execute the same number of instructions,
so the difference only in scheduling? Do the programs suffer from the same
amount of branch mispredictions? Please show output of 'perf stat' on the
mailing list too, so everyone is on the same page about that.
I also suspect that the dramatic slowdown has to do with the extra branch.
Your CPU might have some specialized counters for branch prediction, see
'perf list'.
Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-28 11:57 ` Richard Biener
@ 2020-08-31 11:21 ` Prathamesh Kulkarni
2020-08-31 11:40 ` Jan Hubicka
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-31 11:21 UTC (permalink / raw)
To: Richard Biener; +Cc: GCC Development
On Fri, 28 Aug 2020 at 17:27, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > > <gcc@gcc.gnu.org> wrote:
> > > >
> > > > Hi,
> > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > > bit, and it seems the regression comes from inlining of orthonl into
> > > > e_c3d. Disabling that brings back the performance. However, inlining
> > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > > 16.9% which isn't too large.
> > > >
> > > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > > which contains unmodified function.
> > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > > sufficient.
> > > >
> > > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > > 1) ... w(3, 3), which wasn't
> > > > possible without inlining. The hoistings happen because of basic block
> > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > > following block in line 1035 in e_c3d.f:
> > > >
> > > > senergy=
> > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >
> > > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > > performance. I am not able to understand why (if?) these hoistings of
> > > > w(1, 1) ...
> > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > > code-path from perf in e_c3d shows following code-gen diff:
> > > > For inlined version:
> > > > .L122:
> > > > ldr d15, [x1, -248]
> > > > add w0, w0, 1
> > > > add x2, x2, 24
> > > > add x1, x1, 72
> > > > fmul d15, d17, d15
> > > > fmul d15, d15, d18
> > > > fmul d14, d15, d14
> > > > fmadd d16, d14, d31, d16
> > > > cmp w0, 4
> > > > beq .L121
> > > > ldr d14, [x2, -8]
> > > > b .L122
> > > >
> > > > and for non-inlined version:
> > > > .L118:
> > > > ldr d0, [x1, -248]
> > > > add w0, w0, 1
> > > > ldr d2, [x2, -8]
> > > > add x1, x1, 72
> > > > add x2, x2, 24
> > > > fmul d0, d3, d0
> > > > fmul d0, d0, d5
> > > > fmul d0, d0, d2
> > > > fmadd d1, d4, d0, d1
> > > > cmp w0, 4
> > > > bne .L118
> > >
> > > I wonder if you have profles. The inlined version has a
> > > non-empty latch block (looks like some PRE is happening
> > > there?). Eventually your uarch does not like the close
> > > (does your assembly show the layour as it is?) branches?
> > Hi Richard,
> > I have uploaded profiles obtained by perf here:
> > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
> >
> > For the above loop, it shows the following:
> > -O2:
> > 0.01 │ f1c: ldur d0, [x1, #-248]
> > 3.53 │ add w0, w0, #0x1
> > │ ldur d2, [x2, #-8]
> > 3.54 │ add x1, x1, #0x48
> > │ add x2, x2, #0x18
> > 5.89 │ fmul d0, d3, d0
> > 14.12 │ fmul d0, d0, d5
> > 14.14 │ fmul d0, d0, d2
> > 14.13 │ fmadd d1, d4, d0, d1
> > 0.00 │ cmp w0, #0x4
> > 3.52 │ ↑ b.ne f1c
> >
> > -O2 -flto:
> > 5.47 |1124: ldur d15, [x1, #-248]
> > 2.19 │ add w0, w0, #0x1
> > 1.10 │ add x2, x2, #0x18
> > 2.18 │ add x1, x1, #0x48
> > 4.37 │ fmul d15, d17, d15
> > 13.13 │ fmul d15, d15, d18
> > 13.13 │ fmul d14, d15, d14
> > 13.14 │ fmadd d16, d14, d31, d16
> > │ cmp w0, #0x4
> > 3.28 │ ↓ b.eq 1154
> > 0.00 │ ldur d14, [x2, #-8]
> > 2.19 │ ↑ b 1124
> >
> > IIUC, the biggest relative difference comes from load [x1, #-248]
> > which in LTO's case takes 5.47% of overall samples:
> > 5.47 |1124: ldur d15, [x1, #-248]
> > while in case of -O2, it's just 0.01:
> > 0.01 │ f1c: ldur d0, [x1, #-248]
> >
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> This looks more like the branch since usually branch costs
> are attributed to the target rather than the branch itself. You could
> try re-ordering the code so the loop entry jumps around the
> latch which can then fall thru so see if that makes a difference.
Thanks for the suggestions.
Is it possible to modify assembly files emitted after ltrans phase ?
IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
the obtained object files which doesn't make it possible to hand edit
assembly files post ltrans ?
In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
contains e_c3d to avoid the extra branch.
(If that doesn't work out, I can proceed with manually inlining in the
source and then modifying generated assembly).
Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > > which corresponds to the following loop in line 1014.
> > > > do n1=1,3
> > > > s(iii1,jjj1)=s(iii1,jjj1)
> > > > & +anisox(m1,k1,n1,l1)
> > > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > > & *weight
> > > >
> > > > I am not sure why would hoisting have any direct effect on this loop
> > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > increased register pressure. Perhaps that's why it's using highered
> > > > number regs for code-gen in inlined version ? However disabling
> > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > (by grepping for str to sp), so
> > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > and would be grateful for suggestions.
> > > >
> > > > Thanks,
> > > > Prathamesh
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-28 12:03 ` Alexander Monakov
@ 2020-08-31 11:23 ` Prathamesh Kulkarni
2020-09-04 9:52 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-31 11:23 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Richard Biener, GCC Development
On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
>
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> Probably not. Some advice to make your search more directed:
>
> Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> when they are computed against different bases, it's much easier to see that
> a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> as opposed to 90% to 91% in relative sample ratio.
>
> Before diving down 'perf report', be sure to fully account for differences
> in 'perf stat' output. Do the programs execute the same number of instructions,
> so the difference only in scheduling? Do the programs suffer from the same
> amount of branch mispredictions? Please show output of 'perf stat' on the
> mailing list too, so everyone is on the same page about that.
>
> I also suspect that the dramatic slowdown has to do with the extra branch.
> Your CPU might have some specialized counters for branch prediction, see
> 'perf list'.
Hi Alexander,
Thanks for the suggestions! I am in the process of doing the
benchmarking experiments,
and will post the results soon.
Thanks,
Prathamesh
>
> Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-31 11:21 ` Prathamesh Kulkarni
@ 2020-08-31 11:40 ` Jan Hubicka
0 siblings, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2020-08-31 11:40 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development
> Thanks for the suggestions.
> Is it possible to modify assembly files emitted after ltrans phase ?
> IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
> the obtained object files which doesn't make it possible to hand edit
> assembly files post ltrans ?
> In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
> contains e_c3d to avoid the extra branch.
> (If that doesn't work out, I can proceed with manually inlining in the
> source and then modifying generated assembly).
It is not intended to work that way, but for smaller benchmark you can
just keep the .s files, modify them and then compile again with gfortran
*.s or so.
Honza
>
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > > which corresponds to the following loop in line 1014.
> > > > > do n1=1,3
> > > > > s(iii1,jjj1)=s(iii1,jjj1)
> > > > > & +anisox(m1,k1,n1,l1)
> > > > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > > > & *weight
> > > > >
> > > > > I am not sure why would hoisting have any direct effect on this loop
> > > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > > increased register pressure. Perhaps that's why it's using highered
> > > > > number regs for code-gen in inlined version ? However disabling
> > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > > (by grepping for str to sp), so
> > > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > > and would be grateful for suggestions.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-08-31 11:23 ` Prathamesh Kulkarni
@ 2020-09-04 9:52 ` Prathamesh Kulkarni
2020-09-04 11:38 ` Alexander Monakov
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-04 9:52 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Richard Biener, GCC Development
On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amonakov@ispras.ru> wrote:
> >
> > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> >
> > > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > > not too relevant ?
> >
> > Probably not. Some advice to make your search more directed:
> >
> > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> > when they are computed against different bases, it's much easier to see that
> > a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> > as opposed to 90% to 91% in relative sample ratio.
> >
> > Before diving down 'perf report', be sure to fully account for differences
> > in 'perf stat' output. Do the programs execute the same number of instructions,
> > so the difference only in scheduling? Do the programs suffer from the same
> > amount of branch mispredictions? Please show output of 'perf stat' on the
> > mailing list too, so everyone is on the same page about that.
> >
> > I also suspect that the dramatic slowdown has to do with the extra branch.
> > Your CPU might have some specialized counters for branch prediction, see
> > 'perf list'.
> Hi Alexander,
> Thanks for the suggestions! I am in the process of doing the
> benchmarking experiments,
> and will post the results soon.
Hi,
I obtained perf stat results for following benchmark runs:
-O2:
7856832.692380 task-clock (msec) # 1.000 CPUs utilized
3758 context-switches # 0.000 K/sec
40 cpu-migrations # 0.000 K/sec
40847 page-faults # 0.005 K/sec
7856782413676 cycles # 1.000 GHz
6034510093417 instructions # 0.77 insn per cycle
363937274287 branches # 46.321 M/sec
48557110132 branch-misses # 13.34% of all branches
-O2 with orthonl inlined:
8319643.114380 task-clock (msec) # 1.000 CPUs utilized
4285 context-switches # 0.001 K/sec
28 cpu-migrations # 0.000 K/sec
40843 page-faults # 0.005 K/sec
8319591038295 cycles # 1.000 GHz
6276338800377 instructions # 0.75 insn per cycle
467400726106 branches # 56.180 M/sec
45986364011 branch-misses # 9.84% of all branches
-O2 with orthonl inlined and PRE disabled (this removes the extra branches):
8207331.088040 task-clock (msec) # 1.000 CPUs utilized
2266 context-switches # 0.000 K/sec
32 cpu-migrations # 0.000 K/sec
40846 page-faults # 0.005 K/sec
8207292032467 cycles # 1.000 GHz
6035724436440 instructions # 0.74 insn per cycle
364415440156 branches # 44.401 M/sec
53138327276 branch-misses # 14.58% of all branches
-O2 with orthonl inlined and hoisting disabled:
7797265.206850 task-clock (msec) # 1.000 CPUs utilized
3139 context-switches # 0.000 K/sec
20 cpu-migrations # 0.000 K/sec
40846 page-faults # 0.005 K/sec
7797221351467 cycles # 1.000 GHz
6187348757324 instructions # 0.79 insn per cycle
461840800061 branches # 59.231 M/sec
26920311761 branch-misses # 5.83% of all branches
Perf profiles for
-O2 -fno-code-hoisting and inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
3196866 |1f04: ldur d1, [x1, #-248]
216348301800│ add w0, w0, #0x1
985098 | add x2, x2, #0x18
216215999206│ add x1, x1, #0x48
215630376504│ fmul d1, d5, d1
863829148015│ fmul d1, d1, d6
864228353526│ fmul d0, d1, d0
864568163014│ fmadd d2, d0, d16, d2
│ cmp w0, #0x4
216125427594│ ↓ b.eq 1f34
15010377│ ldur d0, [x2, #-8]
143753737468│ ↑ b 1f04
-O2 with inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
359871503840│ 1ef8: ldur d15, [x1, #-248]
144055883055│ add w0, w0, #0x1
72262104254│ add x2, x2, #0x18
143991169721│ add x1, x1, #0x48
288648917780│ fmul d15, d17, d15
864665644756│ fmul d15, d15, d18
863868426387│ fmul d14, d15, d14
865228159813│ fmadd d16, d14, d31, d16
245967│ cmp w0, #0x4
215396760545│ ↓ b.eq 1f28
704732365│ ldur d14, [x2, #-8]
143775979620│ ↑ b 1ef8
AFAIU,
(a) Disabling PRE, results in removal of extra branch around the loop,
but that results only in slight performance increase (around 1.3%).
(b) Disabling hoisting brings back performance to (slightly more than)
-O2 without inlining orthonl. The generated code for the loop, has
similar layout as -O2 with inlined orthonl, but uses low numbered
regs. Again, not sure if it's relevant, the load from [x1, #-248]
seems to take much lesser time with hoisting disabled. I tried to
check if this was possibly an alignment issue but that seems not to be
the case because in both cases (with / without hoisting) address
pointed to by x1 was aligned properly, and only with a difference of
32 bytes between both cases.
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-04 9:52 ` Prathamesh Kulkarni
@ 2020-09-04 11:38 ` Alexander Monakov
2020-09-21 9:49 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Alexander Monakov @ 2020-09-04 11:38 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development
> I obtained perf stat results for following benchmark runs:
>
> -O2:
>
> 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> 3758 context-switches # 0.000 K/sec
> 40 cpu-migrations # 0.000 K/sec
> 40847 page-faults # 0.005 K/sec
> 7856782413676 cycles # 1.000 GHz
> 6034510093417 instructions # 0.77 insn per cycle
> 363937274287 branches # 46.321 M/sec
> 48557110132 branch-misses # 13.34% of all branches
(ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
enough for this kind of code)
> -O2 with orthonl inlined:
>
> 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> 4285 context-switches # 0.001 K/sec
> 28 cpu-migrations # 0.000 K/sec
> 40843 page-faults # 0.005 K/sec
> 8319591038295 cycles # 1.000 GHz
> 6276338800377 instructions # 0.75 insn per cycle
> 467400726106 branches # 56.180 M/sec
> 45986364011 branch-misses # 9.84% of all branches
So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
that extra instructions are appearing in this loop nest, but not in the innermost
loop. As a reminder for others, the innermost loop has only 3 iterations.
> -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
>
> 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> 2266 context-switches # 0.000 K/sec
> 32 cpu-migrations # 0.000 K/sec
> 40846 page-faults # 0.005 K/sec
> 8207292032467 cycles # 1.000 GHz
> 6035724436440 instructions # 0.74 insn per cycle
> 364415440156 branches # 44.401 M/sec
> 53138327276 branch-misses # 14.58% of all branches
This seems to match baseline in terms of instruction count, but without PRE
the loop nest may be carrying some dependencies over memory. I would simply
check the assembly for the entire 6-level loop nest in question, I hope it's
not very complicated (though Fortran array addressing...).
> -O2 with orthonl inlined and hoisting disabled:
>
> 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> 3139 context-switches # 0.000 K/sec
> 20 cpu-migrations # 0.000 K/sec
> 40846 page-faults # 0.005 K/sec
> 7797221351467 cycles # 1.000 GHz
> 6187348757324 instructions # 0.79 insn per cycle
> 461840800061 branches # 59.231 M/sec
> 26920311761 branch-misses # 5.83% of all branches
There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
I don't think the former fully covers the latter (there's also a 90e9 reduction
in insn count).
Given that the inner loop iterates only 3 times, my main suggestion is to
consider how the profile for the entire loop nest looks like (it's 6 loops deep,
each iterating only 3 times).
> Perf profiles for
> -O2 -fno-code-hoisting and inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
>
> 3196866 |1f04: ldur d1, [x1, #-248]
> 216348301800│ add w0, w0, #0x1
> 985098 | add x2, x2, #0x18
> 216215999206│ add x1, x1, #0x48
> 215630376504│ fmul d1, d5, d1
> 863829148015│ fmul d1, d1, d6
> 864228353526│ fmul d0, d1, d0
> 864568163014│ fmadd d2, d0, d16, d2
> │ cmp w0, #0x4
> 216125427594│ ↓ b.eq 1f34
> 15010377│ ldur d0, [x2, #-8]
> 143753737468│ ↑ b 1f04
>
> -O2 with inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
>
> 359871503840│ 1ef8: ldur d15, [x1, #-248]
> 144055883055│ add w0, w0, #0x1
> 72262104254│ add x2, x2, #0x18
> 143991169721│ add x1, x1, #0x48
> 288648917780│ fmul d15, d17, d15
> 864665644756│ fmul d15, d15, d18
> 863868426387│ fmul d14, d15, d14
> 865228159813│ fmadd d16, d14, d31, d16
> 245967│ cmp w0, #0x4
> 215396760545│ ↓ b.eq 1f28
> 704732365│ ldur d14, [x2, #-8]
> 143775979620│ ↑ b 1ef8
This indicates that the loop only covers about 46-48% of overall time.
High count on the initial ldur instruction could be explained if the loop
is not entered by "fallthru" from the preceding block, or if its backedge
is mispredicted. Sampling mispredictions should be possible with perf record,
and you may be able to check if loop entry is fallthrough by inspecting
assembly.
It may also be possible to check if code alignment matters, by compiling with
-falign-loops=32.
Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-04 11:38 ` Alexander Monakov
@ 2020-09-21 9:49 ` Prathamesh Kulkarni
2020-09-21 12:44 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-21 9:49 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Richard Biener, GCC Development
[-- Attachment #1: Type: text/plain, Size: 8873 bytes --]
On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
>
> > I obtained perf stat results for following benchmark runs:
> >
> > -O2:
> >
> > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > 3758 context-switches # 0.000 K/sec
> > 40 cpu-migrations # 0.000 K/sec
> > 40847 page-faults # 0.005 K/sec
> > 7856782413676 cycles # 1.000 GHz
> > 6034510093417 instructions # 0.77 insn per cycle
> > 363937274287 branches # 46.321 M/sec
> > 48557110132 branch-misses # 13.34% of all branches
>
> (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> enough for this kind of code)
>
> > -O2 with orthonl inlined:
> >
> > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > 4285 context-switches # 0.001 K/sec
> > 28 cpu-migrations # 0.000 K/sec
> > 40843 page-faults # 0.005 K/sec
> > 8319591038295 cycles # 1.000 GHz
> > 6276338800377 instructions # 0.75 insn per cycle
> > 467400726106 branches # 56.180 M/sec
> > 45986364011 branch-misses # 9.84% of all branches
>
> So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> that extra instructions are appearing in this loop nest, but not in the innermost
> loop. As a reminder for others, the innermost loop has only 3 iterations.
>
> > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> >
> > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > 2266 context-switches # 0.000 K/sec
> > 32 cpu-migrations # 0.000 K/sec
> > 40846 page-faults # 0.005 K/sec
> > 8207292032467 cycles # 1.000 GHz
> > 6035724436440 instructions # 0.74 insn per cycle
> > 364415440156 branches # 44.401 M/sec
> > 53138327276 branch-misses # 14.58% of all branches
>
> This seems to match baseline in terms of instruction count, but without PRE
> the loop nest may be carrying some dependencies over memory. I would simply
> check the assembly for the entire 6-level loop nest in question, I hope it's
> not very complicated (though Fortran array addressing...).
>
> > -O2 with orthonl inlined and hoisting disabled:
> >
> > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > 3139 context-switches # 0.000 K/sec
> > 20 cpu-migrations # 0.000 K/sec
> > 40846 page-faults # 0.005 K/sec
> > 7797221351467 cycles # 1.000 GHz
> > 6187348757324 instructions # 0.79 insn per cycle
> > 461840800061 branches # 59.231 M/sec
> > 26920311761 branch-misses # 5.83% of all branches
>
> There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> I don't think the former fully covers the latter (there's also a 90e9 reduction
> in insn count).
>
> Given that the inner loop iterates only 3 times, my main suggestion is to
> consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> each iterating only 3 times).
>
> > Perf profiles for
> > -O2 -fno-code-hoisting and inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> > 3196866 |1f04: ldur d1, [x1, #-248]
> > 216348301800│ add w0, w0, #0x1
> > 985098 | add x2, x2, #0x18
> > 216215999206│ add x1, x1, #0x48
> > 215630376504│ fmul d1, d5, d1
> > 863829148015│ fmul d1, d1, d6
> > 864228353526│ fmul d0, d1, d0
> > 864568163014│ fmadd d2, d0, d16, d2
> > │ cmp w0, #0x4
> > 216125427594│ ↓ b.eq 1f34
> > 15010377│ ldur d0, [x2, #-8]
> > 143753737468│ ↑ b 1f04
> >
> > -O2 with inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > 144055883055│ add w0, w0, #0x1
> > 72262104254│ add x2, x2, #0x18
> > 143991169721│ add x1, x1, #0x48
> > 288648917780│ fmul d15, d17, d15
> > 864665644756│ fmul d15, d15, d18
> > 863868426387│ fmul d14, d15, d14
> > 865228159813│ fmadd d16, d14, d31, d16
> > 245967│ cmp w0, #0x4
> > 215396760545│ ↓ b.eq 1f28
> > 704732365│ ldur d14, [x2, #-8]
> > 143775979620│ ↑ b 1ef8
>
> This indicates that the loop only covers about 46-48% of overall time.
>
> High count on the initial ldur instruction could be explained if the loop
> is not entered by "fallthru" from the preceding block, or if its backedge
> is mispredicted. Sampling mispredictions should be possible with perf record,
> and you may be able to check if loop entry is fallthrough by inspecting
> assembly.
>
> It may also be possible to check if code alignment matters, by compiling with
> -falign-loops=32.
Hi,
Thanks a lot for the detailed feedback, and I am sorry for late response.
The hoisting region is:
if(mattyp.eq.1) then
4 loops
elseif(mattyp.eq.2) then
{
orthonl inlined into basic block;
loads w[0] .. w[8]
}
else
6 loops // load anisox
followed by basic block:
senergy=
& (s11*w(1,1)+s12*(w(1,2)+w(2,1))
& +s13*(w(1,3)+w(3,1))+s22*w(2,2)
& +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
s(ii1,jj1)=s(ii1,jj1)+senergy
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
right in block 181, which is:
if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
which is then further hoisted to block 173:
if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
From block 181, we have two paths towards senergy block (bb 194):
bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
AND
bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
which has a path length of around 18 blocks.
(bb 194 post-dominates bb 181 and bb 173).
Disabling only load hoisting within blocks 173 and 181
(simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
avoid hoisting of 'w' array and brings back most of performance. Which
verifies that it is hoisting of the
'w' array (w[0] ... w[8]), which is causing the slowdown ?
I obtained perf profiles for full hoisting, and disabled hoisting of
'w' array for the 6 loops, and the most drastic difference was
for ldur instruction:
With full hoisting:
359871503840│ 1ef8: ldur d15, [x1, #-248]
Without full hoisting:
3441224 │1edc: ldur d1, [x1, #-248]
(The loop entry seems to be fall thru in both cases. I have attached
profiles for both cases).
IIUC, the instruction seems to be loading the first element from anisox array,
which makes me wonder if the issue was with data-cache miss for slower version.
I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
and it reported two cache misses on the ldur instruction in full hoisting case,
while it reported zero for the disabled load hoisting case.
So I wonder if the slowdown happens because hoisting of 'w' array
possibly results
in eviction of anisox thus causing a cache miss inside the inner loop
and making load slower ?
Hoisting also seems to improve the number of overall cache misses tho.
For disabled hoisting of 'w' array case, there were a total of 463
cache misses, while with full hoisting there were 357 cache misses
(with period = 1 million).
Does that happen because hoisting probably reduces cache misses along
the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
Thanks,
Prathamesh
>
> Alexander
[-- Attachment #2: fullhoist_profile.txt --]
[-- Type: text/plain, Size: 12850 bytes --]
884982389 │1e40: ldr x0, [sp, #448] ◆ │ fmov d19, d6 ▒ 871517886 │ ldr x1, [sp, #808] ▒ │ add x16, sp, #0x720 ▒ 904652642 │ ldr x13, [sp, #784] ▒ │ sub x15, x26, #0x1 ▒ 892180199 │ mov x24, x27 ▒ │ add x28, x27, #0xf8 ▒ 881362543 │ add x22, x1, x0, lsl #3 ▒ │ mov x12, #0x9 // #9 ▒
906876972 │ mov x23, #0x1 // #1 ▒
5342906864 │1e6c: fmov d17, d1 ▒
2622786801 │ mov x14, #0x1778 // #6008
│ mov x20, x28 ▒ 2680397945 │ add x19, sp, x14 ▒
│ mov x18, x24 ▒
2629152729 │ mov x21, x30 ▒
│ ldr d16, [x22] ▒
4571598336 │ mov x17, #0x1e // #30 ▒
15904018941 │1e8c: mov x11, x19 ▒
8106237022 │ mov x10, x20 ▒
│ mov x14, x21 ▒
7958740225 │ mov x9, x18 ▒
│ mov x8, #0x1b // #27 ▒
41353477432 │1ea0: ldr d14, [x9] ▒
1220553185 │ fmov d18, d22 ◆
22852558475 │ fmov d20, d19 ▒ 1199867833 │ mov x3, x11 ▒
22706386191 │ mov x7, x16 ▒
1177543527 │ mov x6, x10 ▒
22767111709 │ fmul d14, d17, d14 ▒
1195454897 │ mov x5, #0x1 // #1 ▒
94868835951 │ fmadd d16, d14, d31, d16 ▒
48021203056 │1ec4: ldur d15, [x6, #-248] ▒
30707657072 │ sub x4, x3, #0x140 ▒
41301831015 │ fmov d14, d19 ▒
32467499777 │ mov x2, x13 ▒
39498561992 │ mov x1, x3 ▒
32503985332 │ mov w0, #0x1 // #1 ▒39636367978 │ fmul d15, d17, d15 ▒56642417403 │ ldr d21, [x4, x12, lsl #3] ▒215900325343│ fmul d21, d17, d21 ▒49939836468 │ fmul d15, d15, d20 ▒238451679574│ fmul d20, d21, d18 ▒49692127013 │ fmadd d15, d15, d31, d16 ▒287649913912│ fmadd d16, d20, d31, d15 ▒359871503840│ 1ef8: ldur d15, [x1, #-248] ▒144055883055│ add w0, w0, #0x1 ▒72262104254 │ add x2, x2, #0x18 ▒143991169721│ add x1, x1, #0x48 ▒288648917780│ fmul d15, d17, d15 ▒864665644756│ fmul d15, d15, d18 ▒863868426387│ fmul d14, d15, d14 ◆
865228159813│ fmadd d16, d14, d31, d16 ▒
245967 │ cmp w0, #0x4 ▒
215396760545│ ↓ b.eq 1f28 ▒
704732365 │ ldur d14, [x2, #-8] ▒
143775979620│ ↑ b 1ef8 ▒
2623253706 │1f28: add x5, x5, #0x1 ▒71700007726 │ add x6, x6, #0x48 ▒ 291326727 │ add x3, x3, #0x8 ▒41539387956 │ cmp x5, #0x4 ▒ 291327452 │ ↓ b.eq 1f4c ▒
152721910227│ ldr d18, [x7, x15, lsl #3] ▒
8561615599 │ add x7, x7, #0x18 ▒
96142935717 │ ldur d20, [x7, #-24] ▒
8495464096 │ ↑ b 1ec4 ▒
201164546300│ 1f4c: add x8, x8, #0x1b ▒
22086088222 │ add x9, x9, #0xd8 ▒
1882100212 │ add x14, x14, #0x18 ▒
22119311849 │ add x10, x10, #0xd8 ▒
1892034271 │ add x11, x11, #0xd8 ▒
13413581701 │ cmp x8, #0x6c ▒
1191551884 │ ↓ b.eq 1f70 ▒26310755425 │ ldur d17, [x14, #-8] ▒ 1210506566 │ ↑ b 1ea0 ▒71960439728 │1f70: add x17, x17, #0x3 ▒ │ add x18, x18, #0x18 ◆
8069920125 │ add x20, x20, #0x18 ▒
│ add x19, x19, #0x18 ▒
4645045210 │ cmp x17, #0x27 ▒
│ ↓ b.eq 1f90 ▒
10962695888 │ ldr d17, [x21], #8 ▒
│ ↑ b 1e8c ▒
23927242012 │1f90: add x23, x23, #0x1 ▒ │ str d16, [x22] ▒ 2672842806 │ add x16, x16, #0x8 ▒ │ add x12, x12, #0x9 ▒
2653094829 │ sub x15, x15, #0x1 ▒ │ add x24, x24, #0x48 ▒ 2692030697 │ add x22, x22, #0x1e0 ▒ │ cmp x23, #0x4 ▒ 1721216607 │ ↓ b.eq 1fbc ▒ 448331273 │ ldr d19, [x13], #8 ▒ 1778236919 │ ↑ b 1e6c ▒ 7971009272 │1fbc: ldr x0, [sp, #448] ▒ 911313572 │ add x26, x26, #0x1 ▒ │ add x27, x27, #0x8 ▒ 902215785 │ add x0, x0, #0x1 ▒ │ str x0, [sp, #448] ▒ 478032817 │ cmp x26, #0x4 ▒ │ ↓ b.eq 1fe8 ▒ 1475545769 │ add x0, sp, #0x708 ◆
│ add x0, x0, x26, lsl #3 ▒ 1806982272 │ ldur d22, [x0, #-8] ▒ │ ↑ b 1e40
[-- Attachment #3: noloadhoist_profile.txt --]
[-- Type: text/plain, Size: 10969 bytes --]
589937229 │1e30: mov x15, #0x1760 // #5984 ◆ 904297989 │ add x0, sp, x15 ▒ 870649879 │ add x22, x0, x22 ▒
│ fmov d7, d24 ▒
891274869 │ ldr x0, [sp, #448] ▒
│ add x14, sp, #0x710 ▒
909978719 │ ldr x12, [sp, #728] ▒
│ sub x13, x27, #0x1 ▒
882715766 │ add x18, x28, #0x8 ▒
│ sub x19, x0, x28 ▒
885884552 │ mov x9, #0x9 // #9 ▒
│ mov x20, #0x1 // #1 ▒
6279074827 │1e60: mov x17, x22
│ mov x16, x30 ▒ 2666213304 │ ldr d2, [x19] ▒
│ mov x15, #0x3 // #3 ▒
18990367400 │1e70: mov x8, x17 ▒
│ mov x11, x16 ▒
8057495884 │ mov x10, #0x1b // #27 ▒
14947123246 │1e7c: sub x0, x8, #0x140 ▒
22985623052 │ ldur d5, [x11, #-8] ▒
1060364445 │ fmov d6, d8 ▒
23956420799 │ fmov d3, d7 ▒
│ add x3, x18, x8 ◆
24065319873 │ mov x7, x14 ▒ │ ldr d0, [x0, x9, lsl #3] ▒24187025828 │ mov x6, x8 ▒ │ mov x5, #0x1 // #1 ▒
48132474841 │ fmul d0, d5, d0 ▒
96001335773 │ fmadd d2, d0, d16, d2 ▒
61067761742 │1ea8: ldur d4, [x6, #-248] ▒
14089308947 │ sub x4, x3, #0x140 ▒
58091146403 │ fmov d0, d7 ▒
14028168886 │ mov x2, x12 ▒
57897209384 │ mov x1, x3 ▒
13994185270 │ mov w0, #0x1 // #1
67891460180 │ fmul d4, d5, d4 ▒28006688701 │ ldr d1, [x4, x9, lsl #3] ▒215655048826│ fmul d1, d5, d1 ▒57701202743 │ fmul d3, d4, d3 ▒230116393416│ fmul d1, d1, d6 ▒57977229144 │ fmadd d2, d3, d16, d2 ▒301775181164│ fmadd d2, d1, d16, d2 ▒ 3441224 │1edc: ldur d1, [x1, #-248] ▒216111094536│ add w0, w0, #0x1 ▒ 1473566 │ add x2, x2, #0x18 ▒215873683406│ add x1, x1, #0x48 ▒216166335905│ fmul d1, d5, d1 ▒864007322335│ fmul d1, d1, d6 ▒863815029515│ fmul d0, d1, d0 ▒864900327399│ fmadd d2, d0, d16, d2 ◆
│ cmp w0, #0x4 ▒
216329679631│ ↓ b.eq 1f0c ▒
22872044 │ ldur d0, [x2, #-8] ▒
143941131893│ ↑ b 1edc ▒
277804663 │1f0c: add x5, x5, #0x1 ▒72179847520 │ add x6, x6, #0x48 ▒ │ add x3, x3, #0x8 ▒65738463940 │ cmp x5, #0x4 ▒ │ ↓ b.eq 1f30
123097375558│ ldr d6, [x7, x13, lsl #3] ▒
│ add x7, x7, #0x18 ▒
96061189670 │ ldur d3, [x7, #-24] ▒
│ ↑ b 1ea8 ▒
42647845407 │1f30: add x10, x10, #0x1b ▒
│ add x11, x11, #0x18 ▒
24141022972 │ add x8, x8, #0xd8 ▒
│ cmp x10, #0x6c ▒
14573046432 │ ↑ b.ne 1e7c ▒
72139544087 │ add x15, x15, #0x3 ▒
8028370830 │ add x16, x16, #0x8 ▒
│ add x17, x17, #0x18 ▒ 4860057143 │ cmp x15, #0xc ▒ │ ↑ b.ne 1e70 ▒23912996709 │ add x20, x20, #0x1 ◆
│ str d2, [x19] ▒
2670529487 │ add x14, x14, #0x8 ▒ │ add x9, x9, #0x9 ▒ 2659625346 │ sub x13, x13, #0x1 ▒ │ add x19, x19, #0x1e0 ▒ 1606030574 │ cmp x20, #0x4 ▒ │ ↓ b.eq 1f80 ▒ 3096553445 │ ldr d7, [x12], #8 ▒ │ ↑ b 1e60 ▒ 7964390214 │1f80: add x27, x27, #0x1 ▒ │ sub x28, x28, #0x8
529029469 │ cmp x27, #0x4 ◆
│ ↓ b.eq 2028 ▒ 1176126379 │ lsl x22, x27, #3 ▒ │ add x0, sp, #0x6f8 ▒ 593893747 │ add x0, x0, x22 ▒ 1798781807 │ ldur d8, [x0, #-8] ▒ 580872685 │ ↑ b 1e30
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-21 9:49 ` Prathamesh Kulkarni
@ 2020-09-21 12:44 ` Prathamesh Kulkarni
2020-09-22 5:08 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-21 12:44 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Richard Biener, GCC Development
On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> >
> > > I obtained perf stat results for following benchmark runs:
> > >
> > > -O2:
> > >
> > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > 3758 context-switches # 0.000 K/sec
> > > 40 cpu-migrations # 0.000 K/sec
> > > 40847 page-faults # 0.005 K/sec
> > > 7856782413676 cycles # 1.000 GHz
> > > 6034510093417 instructions # 0.77 insn per cycle
> > > 363937274287 branches # 46.321 M/sec
> > > 48557110132 branch-misses # 13.34% of all branches
> >
> > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > enough for this kind of code)
> >
> > > -O2 with orthonl inlined:
> > >
> > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > 4285 context-switches # 0.001 K/sec
> > > 28 cpu-migrations # 0.000 K/sec
> > > 40843 page-faults # 0.005 K/sec
> > > 8319591038295 cycles # 1.000 GHz
> > > 6276338800377 instructions # 0.75 insn per cycle
> > > 467400726106 branches # 56.180 M/sec
> > > 45986364011 branch-misses # 9.84% of all branches
> >
> > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > that extra instructions are appearing in this loop nest, but not in the innermost
> > loop. As a reminder for others, the innermost loop has only 3 iterations.
> >
> > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > >
> > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > 2266 context-switches # 0.000 K/sec
> > > 32 cpu-migrations # 0.000 K/sec
> > > 40846 page-faults # 0.005 K/sec
> > > 8207292032467 cycles # 1.000 GHz
> > > 6035724436440 instructions # 0.74 insn per cycle
> > > 364415440156 branches # 44.401 M/sec
> > > 53138327276 branch-misses # 14.58% of all branches
> >
> > This seems to match baseline in terms of instruction count, but without PRE
> > the loop nest may be carrying some dependencies over memory. I would simply
> > check the assembly for the entire 6-level loop nest in question, I hope it's
> > not very complicated (though Fortran array addressing...).
> >
> > > -O2 with orthonl inlined and hoisting disabled:
> > >
> > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > 3139 context-switches # 0.000 K/sec
> > > 20 cpu-migrations # 0.000 K/sec
> > > 40846 page-faults # 0.005 K/sec
> > > 7797221351467 cycles # 1.000 GHz
> > > 6187348757324 instructions # 0.79 insn per cycle
> > > 461840800061 branches # 59.231 M/sec
> > > 26920311761 branch-misses # 5.83% of all branches
> >
> > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > in insn count).
> >
> > Given that the inner loop iterates only 3 times, my main suggestion is to
> > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > each iterating only 3 times).
> >
> > > Perf profiles for
> > > -O2 -fno-code-hoisting and inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > 216348301800│ add w0, w0, #0x1
> > > 985098 | add x2, x2, #0x18
> > > 216215999206│ add x1, x1, #0x48
> > > 215630376504│ fmul d1, d5, d1
> > > 863829148015│ fmul d1, d1, d6
> > > 864228353526│ fmul d0, d1, d0
> > > 864568163014│ fmadd d2, d0, d16, d2
> > > │ cmp w0, #0x4
> > > 216125427594│ ↓ b.eq 1f34
> > > 15010377│ ldur d0, [x2, #-8]
> > > 143753737468│ ↑ b 1f04
> > >
> > > -O2 with inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > 144055883055│ add w0, w0, #0x1
> > > 72262104254│ add x2, x2, #0x18
> > > 143991169721│ add x1, x1, #0x48
> > > 288648917780│ fmul d15, d17, d15
> > > 864665644756│ fmul d15, d15, d18
> > > 863868426387│ fmul d14, d15, d14
> > > 865228159813│ fmadd d16, d14, d31, d16
> > > 245967│ cmp w0, #0x4
> > > 215396760545│ ↓ b.eq 1f28
> > > 704732365│ ldur d14, [x2, #-8]
> > > 143775979620│ ↑ b 1ef8
> >
> > This indicates that the loop only covers about 46-48% of overall time.
> >
> > High count on the initial ldur instruction could be explained if the loop
> > is not entered by "fallthru" from the preceding block, or if its backedge
> > is mispredicted. Sampling mispredictions should be possible with perf record,
> > and you may be able to check if loop entry is fallthrough by inspecting
> > assembly.
> >
> > It may also be possible to check if code alignment matters, by compiling with
> > -falign-loops=32.
> Hi,
> Thanks a lot for the detailed feedback, and I am sorry for late response.
>
> The hoisting region is:
> if(mattyp.eq.1) then
> 4 loops
> elseif(mattyp.eq.2) then
> {
> orthonl inlined into basic block;
> loads w[0] .. w[8]
> }
> else
> 6 loops // load anisox
>
> followed by basic block:
>
> senergy=
> & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> s(ii1,jj1)=s(ii1,jj1)+senergy
> s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
>
> Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> right in block 181, which is:
> if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
>
> which is then further hoisted to block 173:
> if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
>
> From block 181, we have two paths towards senergy block (bb 194):
> bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> AND
> bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> which has a path length of around 18 blocks.
> (bb 194 post-dominates bb 181 and bb 173).
>
> Disabling only load hoisting within blocks 173 and 181
> (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> avoid hoisting of 'w' array and brings back most of performance. Which
> verifies that it is hoisting of the
> 'w' array (w[0] ... w[8]), which is causing the slowdown ?
>
> I obtained perf profiles for full hoisting, and disabled hoisting of
> 'w' array for the 6 loops, and the most drastic difference was
> for ldur instruction:
>
> With full hoisting:
> 359871503840│ 1ef8: ldur d15, [x1, #-248]
>
> Without full hoisting:
> 3441224 │1edc: ldur d1, [x1, #-248]
>
> (The loop entry seems to be fall thru in both cases. I have attached
> profiles for both cases).
>
> IIUC, the instruction seems to be loading the first element from anisox array,
> which makes me wonder if the issue was with data-cache miss for slower version.
> I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> and it reported two cache misses on the ldur instruction in full hoisting case,
> while it reported zero for the disabled load hoisting case.
> So I wonder if the slowdown happens because hoisting of 'w' array
> possibly results
> in eviction of anisox thus causing a cache miss inside the inner loop
> and making load slower ?
>
> Hoisting also seems to improve the number of overall cache misses tho.
> For disabled hoisting of 'w' array case, there were a total of 463
> cache misses, while with full hoisting there were 357 cache misses
> (with period = 1 million).
> Does that happen because hoisting probably reduces cache misses along
> the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
Hi,
In general I feel for this or PR80155 case, the issues come with long
range hoistings, inside a large CFG, since we don't have an accurate
way to model target resources (register pressure in PR80155 case / or
possibly cache pressure in this case?) at tree level and we end up
with register spill or cache miss inside loops, which may offset the
benefit of hoisting. As previously discussed the right way to go is a
live range splitting pass, at GIMPLE -> RTL border which can also help
with other code-movement optimizations (or if the source had variables
with long live ranges).
I was wondering tho as a cheap workaround, would it make sense to
check if we are hoisting across a "large" region of nested loops, and
avoid in that case since hoisting may exert resource pressure inside
loop region ? (Especially, in the cases where hoisted expressions were
not originally AVAIL in any of the loop blocks, and the loop region
doesn't benefit from hoisting).
For instance:
FOR_EACH_EDGE (e, ei, block)
{
/* Avoid hoisting across more than 3 nested loops */
if (e->dest is a loop pre-header or loop header
&& nesting depth of loop is > 3)
return false;
}
I think this would work for resolving the calculix issue because it
hoists across one region of 6 loops and another of 4 loops (didn' test
yet).
It's not bulletproof in that it will miss detecting cases where loop
header (or pre-header) isn't a successor of candidate block (checking
for
that might get expensive tho?). I will test it on gcc suite and SPEC
for any regressions.
Does this sound like a reasonable heuristic ?
Thanks,
Prathamesh
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-21 12:44 ` Prathamesh Kulkarni
@ 2020-09-22 5:08 ` Prathamesh Kulkarni
2020-09-22 7:25 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22 5:08 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Richard Biener, GCC Development
[-- Attachment #1: Type: text/plain, Size: 12256 bytes --]
On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > >
> > > > I obtained perf stat results for following benchmark runs:
> > > >
> > > > -O2:
> > > >
> > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > 3758 context-switches # 0.000 K/sec
> > > > 40 cpu-migrations # 0.000 K/sec
> > > > 40847 page-faults # 0.005 K/sec
> > > > 7856782413676 cycles # 1.000 GHz
> > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > 363937274287 branches # 46.321 M/sec
> > > > 48557110132 branch-misses # 13.34% of all branches
> > >
> > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > enough for this kind of code)
> > >
> > > > -O2 with orthonl inlined:
> > > >
> > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > 4285 context-switches # 0.001 K/sec
> > > > 28 cpu-migrations # 0.000 K/sec
> > > > 40843 page-faults # 0.005 K/sec
> > > > 8319591038295 cycles # 1.000 GHz
> > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > 467400726106 branches # 56.180 M/sec
> > > > 45986364011 branch-misses # 9.84% of all branches
> > >
> > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > >
> > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > >
> > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > 2266 context-switches # 0.000 K/sec
> > > > 32 cpu-migrations # 0.000 K/sec
> > > > 40846 page-faults # 0.005 K/sec
> > > > 8207292032467 cycles # 1.000 GHz
> > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > 364415440156 branches # 44.401 M/sec
> > > > 53138327276 branch-misses # 14.58% of all branches
> > >
> > > This seems to match baseline in terms of instruction count, but without PRE
> > > the loop nest may be carrying some dependencies over memory. I would simply
> > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > not very complicated (though Fortran array addressing...).
> > >
> > > > -O2 with orthonl inlined and hoisting disabled:
> > > >
> > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > 3139 context-switches # 0.000 K/sec
> > > > 20 cpu-migrations # 0.000 K/sec
> > > > 40846 page-faults # 0.005 K/sec
> > > > 7797221351467 cycles # 1.000 GHz
> > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > 461840800061 branches # 59.231 M/sec
> > > > 26920311761 branch-misses # 5.83% of all branches
> > >
> > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > in insn count).
> > >
> > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > each iterating only 3 times).
> > >
> > > > Perf profiles for
> > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > >
> > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > 216348301800│ add w0, w0, #0x1
> > > > 985098 | add x2, x2, #0x18
> > > > 216215999206│ add x1, x1, #0x48
> > > > 215630376504│ fmul d1, d5, d1
> > > > 863829148015│ fmul d1, d1, d6
> > > > 864228353526│ fmul d0, d1, d0
> > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > │ cmp w0, #0x4
> > > > 216125427594│ ↓ b.eq 1f34
> > > > 15010377│ ldur d0, [x2, #-8]
> > > > 143753737468│ ↑ b 1f04
> > > >
> > > > -O2 with inlined orthonl:
> > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > >
> > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > 144055883055│ add w0, w0, #0x1
> > > > 72262104254│ add x2, x2, #0x18
> > > > 143991169721│ add x1, x1, #0x48
> > > > 288648917780│ fmul d15, d17, d15
> > > > 864665644756│ fmul d15, d15, d18
> > > > 863868426387│ fmul d14, d15, d14
> > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > 245967│ cmp w0, #0x4
> > > > 215396760545│ ↓ b.eq 1f28
> > > > 704732365│ ldur d14, [x2, #-8]
> > > > 143775979620│ ↑ b 1ef8
> > >
> > > This indicates that the loop only covers about 46-48% of overall time.
> > >
> > > High count on the initial ldur instruction could be explained if the loop
> > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > and you may be able to check if loop entry is fallthrough by inspecting
> > > assembly.
> > >
> > > It may also be possible to check if code alignment matters, by compiling with
> > > -falign-loops=32.
> > Hi,
> > Thanks a lot for the detailed feedback, and I am sorry for late response.
> >
> > The hoisting region is:
> > if(mattyp.eq.1) then
> > 4 loops
> > elseif(mattyp.eq.2) then
> > {
> > orthonl inlined into basic block;
> > loads w[0] .. w[8]
> > }
> > else
> > 6 loops // load anisox
> >
> > followed by basic block:
> >
> > senergy=
> > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > s(ii1,jj1)=s(ii1,jj1)+senergy
> > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> >
> > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > right in block 181, which is:
> > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> >
> > which is then further hoisted to block 173:
> > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> >
> > From block 181, we have two paths towards senergy block (bb 194):
> > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > AND
> > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > which has a path length of around 18 blocks.
> > (bb 194 post-dominates bb 181 and bb 173).
> >
> > Disabling only load hoisting within blocks 173 and 181
> > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > avoid hoisting of 'w' array and brings back most of performance. Which
> > verifies that it is hoisting of the
> > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> >
> > I obtained perf profiles for full hoisting, and disabled hoisting of
> > 'w' array for the 6 loops, and the most drastic difference was
> > for ldur instruction:
> >
> > With full hoisting:
> > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> >
> > Without full hoisting:
> > 3441224 │1edc: ldur d1, [x1, #-248]
> >
> > (The loop entry seems to be fall thru in both cases. I have attached
> > profiles for both cases).
> >
> > IIUC, the instruction seems to be loading the first element from anisox array,
> > which makes me wonder if the issue was with data-cache miss for slower version.
> > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > and it reported two cache misses on the ldur instruction in full hoisting case,
> > while it reported zero for the disabled load hoisting case.
> > So I wonder if the slowdown happens because hoisting of 'w' array
> > possibly results
> > in eviction of anisox thus causing a cache miss inside the inner loop
> > and making load slower ?
> >
> > Hoisting also seems to improve the number of overall cache misses tho.
> > For disabled hoisting of 'w' array case, there were a total of 463
> > cache misses, while with full hoisting there were 357 cache misses
> > (with period = 1 million).
> > Does that happen because hoisting probably reduces cache misses along
> > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> Hi,
> In general I feel for this or PR80155 case, the issues come with long
> range hoistings, inside a large CFG, since we don't have an accurate
> way to model target resources (register pressure in PR80155 case / or
> possibly cache pressure in this case?) at tree level and we end up
> with register spill or cache miss inside loops, which may offset the
> benefit of hoisting. As previously discussed the right way to go is a
> live range splitting pass, at GIMPLE -> RTL border which can also help
> with other code-movement optimizations (or if the source had variables
> with long live ranges).
>
> I was wondering tho as a cheap workaround, would it make sense to
> check if we are hoisting across a "large" region of nested loops, and
> avoid in that case since hoisting may exert resource pressure inside
> loop region ? (Especially, in the cases where hoisted expressions were
> not originally AVAIL in any of the loop blocks, and the loop region
> doesn't benefit from hoisting).
>
> For instance:
> FOR_EACH_EDGE (e, ei, block)
> {
> /* Avoid hoisting across more than 3 nested loops */
> if (e->dest is a loop pre-header or loop header
> && nesting depth of loop is > 3)
> return false;
> }
>
> I think this would work for resolving the calculix issue because it
> hoists across one region of 6 loops and another of 4 loops (didn' test
> yet).
> It's not bulletproof in that it will miss detecting cases where loop
> header (or pre-header) isn't a successor of candidate block (checking
> for
> that might get expensive tho?). I will test it on gcc suite and SPEC
> for any regressions.
> Does this sound like a reasonable heuristic ?
Hi,
The attached patch implements the above heuristic.
Bootstrapped + tested on x86_64-linux-gnu with no regressions.
And it brings back most of performance for calculix on par with O2
(without inlining orthonl).
I verified that with patch there is no cache miss happening on load
insn inside loop
(with perf report -e L1-dcache-load-misses/period=1000000/)
I am in the process of benchmarking the patch on aarch64 for SPEC for
speed and will report numbers
in couple of days. (If required, we could parametrize number of nested
loops, hardcoded (arbitrarily to) 3 in this patch,
and set it in backend to not affect other targets).
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
>
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Alexander
[-- Attachment #2: gnu-659-loop-1.diff --]
[-- Type: application/octet-stream, Size: 1786 bytes --]
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..9017ad9a4cb 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3528,13 +3528,30 @@ do_hoist_insertion (basic_block block)
bitmap_head availout_in_some;
bitmap_initialize (&availout_in_some, &grand_bitmap_obstack);
FOR_EACH_EDGE (e, ei, block->succs)
- /* Do not consider expressions solely because their availability
- on loop exits. They'd be ANTIC-IN throughout the whole loop
- and thus effectively hoisted across loops by combination of
- PRE and hoisting. */
- if (! loop_exit_edge_p (block->loop_father, e))
- bitmap_ior_and_into (&availout_in_some, &hoistable_set.values,
- &AVAIL_OUT (e->dest)->values);
+ {
+ /* Do not consider expressions solely because their availability
+ on loop exits. They'd be ANTIC-IN throughout the whole loop
+ and thus effectively hoisted across loops by combination of
+ PRE and hoisting. */
+ if (! loop_exit_edge_p (block->loop_father, e))
+ bitmap_ior_and_into (&availout_in_some, &hoistable_set.values,
+ &AVAIL_OUT (e->dest)->values);
+
+ /* Avoid hoisting if a successor block is either loop header or pre-header,
+ and the loop region has more than 3 nested loops to not exert resource
+ pressure inside loop region. */
+
+ basic_block header_bb = NULL;
+ if (bb_loop_header_p (e->dest))
+ header_bb = e->dest;
+ else if (single_succ_p (e->dest)
+ && bb_loop_header_p (single_succ (e->dest)))
+ header_bb = single_succ (e->dest);
+
+ if (header_bb && header_bb->loop_father
+ && get_loop_level (header_bb->loop_father) > 3)
+ return false;
+ }
bitmap_clear (&hoistable_set.values);
/* Short-cut for a common case: availout_in_some is empty. */
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-22 5:08 ` Prathamesh Kulkarni
@ 2020-09-22 7:25 ` Richard Biener
2020-09-22 9:37 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-22 7:25 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > >
> > > > > I obtained perf stat results for following benchmark runs:
> > > > >
> > > > > -O2:
> > > > >
> > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > 3758 context-switches # 0.000 K/sec
> > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > 40847 page-faults # 0.005 K/sec
> > > > > 7856782413676 cycles # 1.000 GHz
> > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > 363937274287 branches # 46.321 M/sec
> > > > > 48557110132 branch-misses # 13.34% of all branches
> > > >
> > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > enough for this kind of code)
> > > >
> > > > > -O2 with orthonl inlined:
> > > > >
> > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > 4285 context-switches # 0.001 K/sec
> > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > 40843 page-faults # 0.005 K/sec
> > > > > 8319591038295 cycles # 1.000 GHz
> > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > 467400726106 branches # 56.180 M/sec
> > > > > 45986364011 branch-misses # 9.84% of all branches
> > > >
> > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > >
> > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > >
> > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > 2266 context-switches # 0.000 K/sec
> > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > 40846 page-faults # 0.005 K/sec
> > > > > 8207292032467 cycles # 1.000 GHz
> > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > 364415440156 branches # 44.401 M/sec
> > > > > 53138327276 branch-misses # 14.58% of all branches
> > > >
> > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > not very complicated (though Fortran array addressing...).
> > > >
> > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > >
> > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > 3139 context-switches # 0.000 K/sec
> > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > 40846 page-faults # 0.005 K/sec
> > > > > 7797221351467 cycles # 1.000 GHz
> > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > 461840800061 branches # 59.231 M/sec
> > > > > 26920311761 branch-misses # 5.83% of all branches
> > > >
> > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > in insn count).
> > > >
> > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > each iterating only 3 times).
> > > >
> > > > > Perf profiles for
> > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > >
> > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > 216348301800│ add w0, w0, #0x1
> > > > > 985098 | add x2, x2, #0x18
> > > > > 216215999206│ add x1, x1, #0x48
> > > > > 215630376504│ fmul d1, d5, d1
> > > > > 863829148015│ fmul d1, d1, d6
> > > > > 864228353526│ fmul d0, d1, d0
> > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > │ cmp w0, #0x4
> > > > > 216125427594│ ↓ b.eq 1f34
> > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > 143753737468│ ↑ b 1f04
> > > > >
> > > > > -O2 with inlined orthonl:
> > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > >
> > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > 144055883055│ add w0, w0, #0x1
> > > > > 72262104254│ add x2, x2, #0x18
> > > > > 143991169721│ add x1, x1, #0x48
> > > > > 288648917780│ fmul d15, d17, d15
> > > > > 864665644756│ fmul d15, d15, d18
> > > > > 863868426387│ fmul d14, d15, d14
> > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > 245967│ cmp w0, #0x4
> > > > > 215396760545│ ↓ b.eq 1f28
> > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > 143775979620│ ↑ b 1ef8
> > > >
> > > > This indicates that the loop only covers about 46-48% of overall time.
> > > >
> > > > High count on the initial ldur instruction could be explained if the loop
> > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > assembly.
> > > >
> > > > It may also be possible to check if code alignment matters, by compiling with
> > > > -falign-loops=32.
> > > Hi,
> > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > >
> > > The hoisting region is:
> > > if(mattyp.eq.1) then
> > > 4 loops
> > > elseif(mattyp.eq.2) then
> > > {
> > > orthonl inlined into basic block;
> > > loads w[0] .. w[8]
> > > }
> > > else
> > > 6 loops // load anisox
> > >
> > > followed by basic block:
> > >
> > > senergy=
> > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > >
> > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > right in block 181, which is:
> > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > >
> > > which is then further hoisted to block 173:
> > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > >
> > > From block 181, we have two paths towards senergy block (bb 194):
> > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > AND
> > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > which has a path length of around 18 blocks.
> > > (bb 194 post-dominates bb 181 and bb 173).
> > >
> > > Disabling only load hoisting within blocks 173 and 181
> > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > verifies that it is hoisting of the
> > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > >
> > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > 'w' array for the 6 loops, and the most drastic difference was
> > > for ldur instruction:
> > >
> > > With full hoisting:
> > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > >
> > > Without full hoisting:
> > > 3441224 │1edc: ldur d1, [x1, #-248]
> > >
> > > (The loop entry seems to be fall thru in both cases. I have attached
> > > profiles for both cases).
> > >
> > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > while it reported zero for the disabled load hoisting case.
> > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > possibly results
> > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > and making load slower ?
> > >
> > > Hoisting also seems to improve the number of overall cache misses tho.
> > > For disabled hoisting of 'w' array case, there were a total of 463
> > > cache misses, while with full hoisting there were 357 cache misses
> > > (with period = 1 million).
> > > Does that happen because hoisting probably reduces cache misses along
> > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > Hi,
> > In general I feel for this or PR80155 case, the issues come with long
> > range hoistings, inside a large CFG, since we don't have an accurate
> > way to model target resources (register pressure in PR80155 case / or
> > possibly cache pressure in this case?) at tree level and we end up
> > with register spill or cache miss inside loops, which may offset the
> > benefit of hoisting. As previously discussed the right way to go is a
> > live range splitting pass, at GIMPLE -> RTL border which can also help
> > with other code-movement optimizations (or if the source had variables
> > with long live ranges).
> >
> > I was wondering tho as a cheap workaround, would it make sense to
> > check if we are hoisting across a "large" region of nested loops, and
> > avoid in that case since hoisting may exert resource pressure inside
> > loop region ? (Especially, in the cases where hoisted expressions were
> > not originally AVAIL in any of the loop blocks, and the loop region
> > doesn't benefit from hoisting).
> >
> > For instance:
> > FOR_EACH_EDGE (e, ei, block)
> > {
> > /* Avoid hoisting across more than 3 nested loops */
> > if (e->dest is a loop pre-header or loop header
> > && nesting depth of loop is > 3)
> > return false;
> > }
> >
> > I think this would work for resolving the calculix issue because it
> > hoists across one region of 6 loops and another of 4 loops (didn' test
> > yet).
> > It's not bulletproof in that it will miss detecting cases where loop
> > header (or pre-header) isn't a successor of candidate block (checking
> > for
> > that might get expensive tho?). I will test it on gcc suite and SPEC
> > for any regressions.
> > Does this sound like a reasonable heuristic ?
> Hi,
> The attached patch implements the above heuristic.
> Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> And it brings back most of performance for calculix on par with O2
> (without inlining orthonl).
> I verified that with patch there is no cache miss happening on load
> insn inside loop
> (with perf report -e L1-dcache-load-misses/period=1000000/)
>
> I am in the process of benchmarking the patch on aarch64 for SPEC for
> speed and will report numbers
> in couple of days. (If required, we could parametrize number of nested
> loops, hardcoded (arbitrarily to) 3 in this patch,
> and set it in backend to not affect other targets).
I don't think this patch captures the case in a sensible way - it will simply
never hoist computations out of loop header blocks with depth > 3 which
is certainly not what you want. Also the pre-header check is odd - we're
looking for computations in successors of BLOCK but clearly a computation
in a pre-header is not at the same loop level as one in the header itself.
Note the difficulty to capture "distance" is that the distance is simply not
available at this point - it is the anticipated values from the successors
that do _not_ compute the value itself that are the issue. To capture
"distance" you'd need to somehow "age" anticipated value when
propagating them upwards during compute_antic (where it then
would also apply to PRE in general).
As with all other heuristics the only place one could do hackish attempts
with at least some reasoning is the elimination phase where
we make use of the (hoist) insertions - of course for hoisting we already
know we'll get the "close" use in one of the successors so I fear even
there it will be impossible to do something sensible.
Richard.
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> >
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-22 7:25 ` Richard Biener
@ 2020-09-22 9:37 ` Prathamesh Kulkarni
2020-09-22 11:06 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22 9:37 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
[-- Attachment #1: Type: text/plain, Size: 15825 bytes --]
On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > >
> > > > > > I obtained perf stat results for following benchmark runs:
> > > > > >
> > > > > > -O2:
> > > > > >
> > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > >
> > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > enough for this kind of code)
> > > > >
> > > > > > -O2 with orthonl inlined:
> > > > > >
> > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > >
> > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > >
> > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > >
> > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > >
> > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > not very complicated (though Fortran array addressing...).
> > > > >
> > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > >
> > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > >
> > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > in insn count).
> > > > >
> > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > each iterating only 3 times).
> > > > >
> > > > > > Perf profiles for
> > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > >
> > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > 985098 | add x2, x2, #0x18
> > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > │ cmp w0, #0x4
> > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > 143753737468│ ↑ b 1f04
> > > > > >
> > > > > > -O2 with inlined orthonl:
> > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > >
> > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > 245967│ cmp w0, #0x4
> > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > 143775979620│ ↑ b 1ef8
> > > > >
> > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > >
> > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > assembly.
> > > > >
> > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > -falign-loops=32.
> > > > Hi,
> > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > >
> > > > The hoisting region is:
> > > > if(mattyp.eq.1) then
> > > > 4 loops
> > > > elseif(mattyp.eq.2) then
> > > > {
> > > > orthonl inlined into basic block;
> > > > loads w[0] .. w[8]
> > > > }
> > > > else
> > > > 6 loops // load anisox
> > > >
> > > > followed by basic block:
> > > >
> > > > senergy=
> > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > >
> > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > right in block 181, which is:
> > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > >
> > > > which is then further hoisted to block 173:
> > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > >
> > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > AND
> > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > which has a path length of around 18 blocks.
> > > > (bb 194 post-dominates bb 181 and bb 173).
> > > >
> > > > Disabling only load hoisting within blocks 173 and 181
> > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > verifies that it is hoisting of the
> > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > >
> > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > for ldur instruction:
> > > >
> > > > With full hoisting:
> > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > >
> > > > Without full hoisting:
> > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > >
> > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > profiles for both cases).
> > > >
> > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > while it reported zero for the disabled load hoisting case.
> > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > possibly results
> > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > and making load slower ?
> > > >
> > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > cache misses, while with full hoisting there were 357 cache misses
> > > > (with period = 1 million).
> > > > Does that happen because hoisting probably reduces cache misses along
> > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > Hi,
> > > In general I feel for this or PR80155 case, the issues come with long
> > > range hoistings, inside a large CFG, since we don't have an accurate
> > > way to model target resources (register pressure in PR80155 case / or
> > > possibly cache pressure in this case?) at tree level and we end up
> > > with register spill or cache miss inside loops, which may offset the
> > > benefit of hoisting. As previously discussed the right way to go is a
> > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > with other code-movement optimizations (or if the source had variables
> > > with long live ranges).
> > >
> > > I was wondering tho as a cheap workaround, would it make sense to
> > > check if we are hoisting across a "large" region of nested loops, and
> > > avoid in that case since hoisting may exert resource pressure inside
> > > loop region ? (Especially, in the cases where hoisted expressions were
> > > not originally AVAIL in any of the loop blocks, and the loop region
> > > doesn't benefit from hoisting).
> > >
> > > For instance:
> > > FOR_EACH_EDGE (e, ei, block)
> > > {
> > > /* Avoid hoisting across more than 3 nested loops */
> > > if (e->dest is a loop pre-header or loop header
> > > && nesting depth of loop is > 3)
> > > return false;
> > > }
> > >
> > > I think this would work for resolving the calculix issue because it
> > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > yet).
> > > It's not bulletproof in that it will miss detecting cases where loop
> > > header (or pre-header) isn't a successor of candidate block (checking
> > > for
> > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > for any regressions.
> > > Does this sound like a reasonable heuristic ?
> > Hi,
> > The attached patch implements the above heuristic.
> > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > And it brings back most of performance for calculix on par with O2
> > (without inlining orthonl).
> > I verified that with patch there is no cache miss happening on load
> > insn inside loop
> > (with perf report -e L1-dcache-load-misses/period=1000000/)
> >
> > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > speed and will report numbers
> > in couple of days. (If required, we could parametrize number of nested
> > loops, hardcoded (arbitrarily to) 3 in this patch,
> > and set it in backend to not affect other targets).
>
> I don't think this patch captures the case in a sensible way - it will simply
> never hoist computations out of loop header blocks with depth > 3 which
> is certainly not what you want. Also the pre-header check is odd - we're
> looking for computations in successors of BLOCK but clearly a computation
> in a pre-header is not at the same loop level as one in the header itself.
Well, my intent was to check if we are hoisting across a region,
which has multiple nested loops, and in that case, avoid hoisting expressions
that do not belong to any loop blocks, because that may increase
resource pressure
inside loops. For instance, in the calculix issue we hoist 'w' array
from post-dom and neither
loop region has any uses of 'w'. I agree checking just for loop level
is too coarse.
The check with pre-header was essentially the same to see if we are
hoisting across a loop region,
not necessarily from within the loops.
>
> Note the difficulty to capture "distance" is that the distance is simply not
> available at this point - it is the anticipated values from the successors
> that do _not_ compute the value itself that are the issue. To capture
> "distance" you'd need to somehow "age" anticipated value when
> propagating them upwards during compute_antic (where it then
> would also apply to PRE in general).
Yes, indeed. As a hack, would it make sense to avoid inserting an
expression in the block,
if it's ANTIC in post-dom block as a trade-off between extending live
range and hoisting
if the "distance" between block and post-dom is "too far" ? In
general, as you point out, we'd need to compute,
distance info for successors block during compute_antic, but special
casing for post-dom should be easy enough
during do_hoist_insertion, and hoisting an expr that is ANTIC in
post-dom could be potentially "long range", if the region is large.
It's still a coarse heuristic tho. I tried it in the attached patch.
Thanks,
Prathamesh
>
> As with all other heuristics the only place one could do hackish attempts
> with at least some reasoning is the elimination phase where
> we make use of the (hoist) insertions - of course for hoisting we already
> know we'll get the "close" use in one of the successors so I fear even
> there it will be impossible to do something sensible.
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Alexander
[-- Attachment #2: gnu-659-pdom-3.diff --]
[-- Type: application/octet-stream, Size: 2870 bytes --]
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..64842e01fa5 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3477,6 +3477,43 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
return new_stuff;
}
+/* Return true if PDOM_BB is within DIST_LIMIT of BLOCK,
+ where "distance" is measured in terms of number of basic blocks. */
+
+static bool
+pdom_within_dist_p_1 (basic_block block, basic_block pdom_bb,
+ bool *visited, unsigned dist_limit,
+ unsigned dist_from_block)
+{
+ if (dist_from_block >= dist_limit)
+ return false;
+
+ if (block == pdom_bb)
+ return true;
+
+ edge e;
+ edge_iterator ei;
+ visited[block->index] = true;
+
+ FOR_EACH_EDGE (e, ei, block->succs)
+ if (!visited[e->dest->index]
+ && !pdom_within_dist_p_1 (e->dest, pdom_bb, visited,
+ dist_limit, dist_from_block + 1))
+ return false;
+ return true;
+}
+
+static bool
+pdom_within_dist_p (basic_block bb, basic_block pdom_bb, unsigned dist)
+{
+ unsigned n_bbs = n_basic_blocks_for_fn (cfun);
+ bool *visited = new bool[n_bbs];
+ memset (visited, false, n_bbs);
+ bool ret = pdom_within_dist_p_1 (bb, pdom_bb, visited, dist, 0);
+ delete[] visited;
+ return ret;
+}
+
/* Insert expressions in BLOCK to compute hoistable values up.
Return TRUE if something was inserted, otherwise return FALSE.
The caller has to make sure that BLOCK has at least two successors. */
@@ -3541,6 +3578,19 @@ do_hoist_insertion (basic_block block)
if (bitmap_empty_p (&availout_in_some))
return false;
+ /* Check if any of the hoistable expressions are ANTIC in post-dom,
+ and avoid inserting those, if ppost-dom is beyond threshold distance
+ from the block. */
+
+ basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+ bitmap_head hoist_from_pdom;
+ bitmap_initialize (&hoist_from_pdom, &grand_bitmap_obstack);
+ bitmap_and (&hoist_from_pdom, &availout_in_some, &ANTIC_IN (pdom_bb)->values);
+
+ if (!bitmap_empty_p (&hoist_from_pdom)
+ && !pdom_within_dist_p (block, pdom_bb, 17))
+ bitmap_and_compl_into (&availout_in_some, &ANTIC_IN (pdom_bb)->values);
+
/* Hack hoitable_set in-place so we can use sorted_array_from_bitmap_set. */
bitmap_move (&hoistable_set.values, &availout_in_some);
hoistable_set.expressions = ANTIC_IN (block)->expressions;
@@ -4099,6 +4149,7 @@ init_pre (void)
alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
calculate_dominance_info (CDI_DOMINATORS);
+ calculate_dominance_info (CDI_POST_DOMINATORS);
bitmap_obstack_initialize (&grand_bitmap_obstack);
phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4182,7 @@ fini_pre ()
name_to_id.release ();
free_aux_for_blocks ();
+ free_dominance_info (CDI_POST_DOMINATORS);
}
namespace {
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-22 9:37 ` Prathamesh Kulkarni
@ 2020-09-22 11:06 ` Richard Biener
2020-09-22 16:24 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-22 11:06 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > >
> > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > >
> > > > > > > -O2:
> > > > > > >
> > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > >
> > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > enough for this kind of code)
> > > > > >
> > > > > > > -O2 with orthonl inlined:
> > > > > > >
> > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > >
> > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > >
> > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > >
> > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > >
> > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > not very complicated (though Fortran array addressing...).
> > > > > >
> > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > >
> > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > >
> > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > in insn count).
> > > > > >
> > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > each iterating only 3 times).
> > > > > >
> > > > > > > Perf profiles for
> > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > >
> > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > │ cmp w0, #0x4
> > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > 143753737468│ ↑ b 1f04
> > > > > > >
> > > > > > > -O2 with inlined orthonl:
> > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > >
> > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > 245967│ cmp w0, #0x4
> > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > 143775979620│ ↑ b 1ef8
> > > > > >
> > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > >
> > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > assembly.
> > > > > >
> > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > -falign-loops=32.
> > > > > Hi,
> > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > >
> > > > > The hoisting region is:
> > > > > if(mattyp.eq.1) then
> > > > > 4 loops
> > > > > elseif(mattyp.eq.2) then
> > > > > {
> > > > > orthonl inlined into basic block;
> > > > > loads w[0] .. w[8]
> > > > > }
> > > > > else
> > > > > 6 loops // load anisox
> > > > >
> > > > > followed by basic block:
> > > > >
> > > > > senergy=
> > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > >
> > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > right in block 181, which is:
> > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > >
> > > > > which is then further hoisted to block 173:
> > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > >
> > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > AND
> > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > which has a path length of around 18 blocks.
> > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > >
> > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > verifies that it is hoisting of the
> > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > >
> > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > for ldur instruction:
> > > > >
> > > > > With full hoisting:
> > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > >
> > > > > Without full hoisting:
> > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > >
> > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > profiles for both cases).
> > > > >
> > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > while it reported zero for the disabled load hoisting case.
> > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > possibly results
> > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > and making load slower ?
> > > > >
> > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > (with period = 1 million).
> > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > Hi,
> > > > In general I feel for this or PR80155 case, the issues come with long
> > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > way to model target resources (register pressure in PR80155 case / or
> > > > possibly cache pressure in this case?) at tree level and we end up
> > > > with register spill or cache miss inside loops, which may offset the
> > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > with other code-movement optimizations (or if the source had variables
> > > > with long live ranges).
> > > >
> > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > check if we are hoisting across a "large" region of nested loops, and
> > > > avoid in that case since hoisting may exert resource pressure inside
> > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > doesn't benefit from hoisting).
> > > >
> > > > For instance:
> > > > FOR_EACH_EDGE (e, ei, block)
> > > > {
> > > > /* Avoid hoisting across more than 3 nested loops */
> > > > if (e->dest is a loop pre-header or loop header
> > > > && nesting depth of loop is > 3)
> > > > return false;
> > > > }
> > > >
> > > > I think this would work for resolving the calculix issue because it
> > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > yet).
> > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > for
> > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > for any regressions.
> > > > Does this sound like a reasonable heuristic ?
> > > Hi,
> > > The attached patch implements the above heuristic.
> > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > And it brings back most of performance for calculix on par with O2
> > > (without inlining orthonl).
> > > I verified that with patch there is no cache miss happening on load
> > > insn inside loop
> > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > >
> > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > speed and will report numbers
> > > in couple of days. (If required, we could parametrize number of nested
> > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > and set it in backend to not affect other targets).
> >
> > I don't think this patch captures the case in a sensible way - it will simply
> > never hoist computations out of loop header blocks with depth > 3 which
> > is certainly not what you want. Also the pre-header check is odd - we're
> > looking for computations in successors of BLOCK but clearly a computation
> > in a pre-header is not at the same loop level as one in the header itself.
> Well, my intent was to check if we are hoisting across a region,
> which has multiple nested loops, and in that case, avoid hoisting expressions
> that do not belong to any loop blocks, because that may increase
> resource pressure
> inside loops. For instance, in the calculix issue we hoist 'w' array
> from post-dom and neither
> loop region has any uses of 'w'. I agree checking just for loop level
> is too coarse.
> The check with pre-header was essentially the same to see if we are
> hoisting across a loop region,
> not necessarily from within the loops.
But it will fail to hoist *p in
if (x)
{
a = *p;
}
else
{
b = *p;
}
<make distance large>
pdom:
c = *p;
so it isn't what matters either. What happens at the immediate post-dominator
isn't directly relevant - what matters would be if the pdom is the one making
the value antic on one of the outgoing edges. In that case we're also going
to PRE *p into the arm not computing *p (albeit in a later position). But
that property is impossible to compute from the sets itself (not even mentioning
the arbitrary CFG that can be inbetween the block and its pdom or the weird
pdoms we compute for regions not having a path to exit, like infinite loops).
You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
in each of them we _might_ have the situation you want to protect against.
But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
of them ...
> >
> > Note the difficulty to capture "distance" is that the distance is simply not
> > available at this point - it is the anticipated values from the successors
> > that do _not_ compute the value itself that are the issue. To capture
> > "distance" you'd need to somehow "age" anticipated value when
> > propagating them upwards during compute_antic (where it then
> > would also apply to PRE in general).
> Yes, indeed. As a hack, would it make sense to avoid inserting an
> expression in the block,
> if it's ANTIC in post-dom block as a trade-off between extending live
> range and hoisting
> if the "distance" between block and post-dom is "too far" ? In
> general, as you point out, we'd need to compute,
> distance info for successors block during compute_antic, but special
> casing for post-dom should be easy enough
> during do_hoist_insertion, and hoisting an expr that is ANTIC in
> post-dom could be potentially "long range", if the region is large.
> It's still a coarse heuristic tho. I tried it in the attached patch.
>
> Thanks,
> Prathamesh
>
>
> >
> > As with all other heuristics the only place one could do hackish attempts
> > with at least some reasoning is the elimination phase where
> > we make use of the (hoist) insertions - of course for hoisting we already
> > know we'll get the "close" use in one of the successors so I fear even
> > there it will be impossible to do something sensible.
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-22 11:06 ` Richard Biener
@ 2020-09-22 16:24 ` Prathamesh Kulkarni
2020-09-23 7:52 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22 16:24 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > >
> > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > >
> > > > > > > > -O2:
> > > > > > > >
> > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > >
> > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > enough for this kind of code)
> > > > > > >
> > > > > > > > -O2 with orthonl inlined:
> > > > > > > >
> > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > >
> > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > >
> > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > >
> > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > >
> > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > >
> > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > in insn count).
> > > > > > >
> > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > each iterating only 3 times).
> > > > > > >
> > > > > > > > Perf profiles for
> > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > │ cmp w0, #0x4
> > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > >
> > > > > > > > -O2 with inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > >
> > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > >
> > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > assembly.
> > > > > > >
> > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > -falign-loops=32.
> > > > > > Hi,
> > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > >
> > > > > > The hoisting region is:
> > > > > > if(mattyp.eq.1) then
> > > > > > 4 loops
> > > > > > elseif(mattyp.eq.2) then
> > > > > > {
> > > > > > orthonl inlined into basic block;
> > > > > > loads w[0] .. w[8]
> > > > > > }
> > > > > > else
> > > > > > 6 loops // load anisox
> > > > > >
> > > > > > followed by basic block:
> > > > > >
> > > > > > senergy=
> > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > >
> > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > right in block 181, which is:
> > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > >
> > > > > > which is then further hoisted to block 173:
> > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > >
> > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > AND
> > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > which has a path length of around 18 blocks.
> > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > >
> > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > verifies that it is hoisting of the
> > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > >
> > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > for ldur instruction:
> > > > > >
> > > > > > With full hoisting:
> > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > >
> > > > > > Without full hoisting:
> > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > >
> > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > profiles for both cases).
> > > > > >
> > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > possibly results
> > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > and making load slower ?
> > > > > >
> > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > (with period = 1 million).
> > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > Hi,
> > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > with register spill or cache miss inside loops, which may offset the
> > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > with other code-movement optimizations (or if the source had variables
> > > > > with long live ranges).
> > > > >
> > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > doesn't benefit from hoisting).
> > > > >
> > > > > For instance:
> > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > {
> > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > if (e->dest is a loop pre-header or loop header
> > > > > && nesting depth of loop is > 3)
> > > > > return false;
> > > > > }
> > > > >
> > > > > I think this would work for resolving the calculix issue because it
> > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > yet).
> > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > for
> > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > for any regressions.
> > > > > Does this sound like a reasonable heuristic ?
> > > > Hi,
> > > > The attached patch implements the above heuristic.
> > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > And it brings back most of performance for calculix on par with O2
> > > > (without inlining orthonl).
> > > > I verified that with patch there is no cache miss happening on load
> > > > insn inside loop
> > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > >
> > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > speed and will report numbers
> > > > in couple of days. (If required, we could parametrize number of nested
> > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > and set it in backend to not affect other targets).
> > >
> > > I don't think this patch captures the case in a sensible way - it will simply
> > > never hoist computations out of loop header blocks with depth > 3 which
> > > is certainly not what you want. Also the pre-header check is odd - we're
> > > looking for computations in successors of BLOCK but clearly a computation
> > > in a pre-header is not at the same loop level as one in the header itself.
> > Well, my intent was to check if we are hoisting across a region,
> > which has multiple nested loops, and in that case, avoid hoisting expressions
> > that do not belong to any loop blocks, because that may increase
> > resource pressure
> > inside loops. For instance, in the calculix issue we hoist 'w' array
> > from post-dom and neither
> > loop region has any uses of 'w'. I agree checking just for loop level
> > is too coarse.
> > The check with pre-header was essentially the same to see if we are
> > hoisting across a loop region,
> > not necessarily from within the loops.
>
> But it will fail to hoist *p in
>
> if (x)
> {
> a = *p;
> }
> else
> {
> b = *p;
> }
>
> <make distance large>
> pdom:
> c = *p;
>
> so it isn't what matters either. What happens at the immediate post-dominator
> isn't directly relevant - what matters would be if the pdom is the one making
> the value antic on one of the outgoing edges. In that case we're also going
> to PRE *p into the arm not computing *p (albeit in a later position). But
> that property is impossible to compute from the sets itself (not even mentioning
> the arbitrary CFG that can be inbetween the block and its pdom or the weird
> pdoms we compute for regions not having a path to exit, like infinite loops).
>
> You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> in each of them we _might_ have the situation you want to protect against.
> But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> of them ...
Hi Richard,
Thanks for the suggestions. Right, the issue seems to be here that
post-dom block is making expressions ANTIC. Before doing insert, could
we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
as a guard against PRE eventually inserting expressions in pred blocks
of pdom and making them available?
And during hoisting, we could check if each expr that is ANTIC_IN
(pdom) is ORIG_AVAIL_OUT in each pred of pdom,
if the distance is "large".
Thanks,
Prathamesh
>
> > >
> > > Note the difficulty to capture "distance" is that the distance is simply not
> > > available at this point - it is the anticipated values from the successors
> > > that do _not_ compute the value itself that are the issue. To capture
> > > "distance" you'd need to somehow "age" anticipated value when
> > > propagating them upwards during compute_antic (where it then
> > > would also apply to PRE in general).
> > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > expression in the block,
> > if it's ANTIC in post-dom block as a trade-off between extending live
> > range and hoisting
> > if the "distance" between block and post-dom is "too far" ? In
> > general, as you point out, we'd need to compute,
> > distance info for successors block during compute_antic, but special
> > casing for post-dom should be easy enough
> > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > post-dom could be potentially "long range", if the region is large.
> > It's still a coarse heuristic tho. I tried it in the attached patch.
> >
> > Thanks,
> > Prathamesh
> >
> >
> > >
> > > As with all other heuristics the only place one could do hackish attempts
> > > with at least some reasoning is the elimination phase where
> > > we make use of the (hoist) insertions - of course for hoisting we already
> > > know we'll get the "close" use in one of the successors so I fear even
> > > there it will be impossible to do something sensible.
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-22 16:24 ` Prathamesh Kulkarni
@ 2020-09-23 7:52 ` Richard Biener
2020-09-23 10:10 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-23 7:52 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > >
> > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > >
> > > > > > > > > -O2:
> > > > > > > > >
> > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > >
> > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > enough for this kind of code)
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > >
> > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > >
> > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > >
> > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > >
> > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > >
> > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > >
> > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > in insn count).
> > > > > > > >
> > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > each iterating only 3 times).
> > > > > > > >
> > > > > > > > > Perf profiles for
> > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > >
> > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > >
> > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > >
> > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > >
> > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > >
> > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > assembly.
> > > > > > > >
> > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > -falign-loops=32.
> > > > > > > Hi,
> > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > >
> > > > > > > The hoisting region is:
> > > > > > > if(mattyp.eq.1) then
> > > > > > > 4 loops
> > > > > > > elseif(mattyp.eq.2) then
> > > > > > > {
> > > > > > > orthonl inlined into basic block;
> > > > > > > loads w[0] .. w[8]
> > > > > > > }
> > > > > > > else
> > > > > > > 6 loops // load anisox
> > > > > > >
> > > > > > > followed by basic block:
> > > > > > >
> > > > > > > senergy=
> > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > >
> > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > right in block 181, which is:
> > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > >
> > > > > > > which is then further hoisted to block 173:
> > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > >
> > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > AND
> > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > which has a path length of around 18 blocks.
> > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > >
> > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > verifies that it is hoisting of the
> > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > >
> > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > for ldur instruction:
> > > > > > >
> > > > > > > With full hoisting:
> > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > >
> > > > > > > Without full hoisting:
> > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > >
> > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > profiles for both cases).
> > > > > > >
> > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > possibly results
> > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > and making load slower ?
> > > > > > >
> > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > (with period = 1 million).
> > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > Hi,
> > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > with long live ranges).
> > > > > >
> > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > doesn't benefit from hoisting).
> > > > > >
> > > > > > For instance:
> > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > {
> > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > && nesting depth of loop is > 3)
> > > > > > return false;
> > > > > > }
> > > > > >
> > > > > > I think this would work for resolving the calculix issue because it
> > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > yet).
> > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > for
> > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > for any regressions.
> > > > > > Does this sound like a reasonable heuristic ?
> > > > > Hi,
> > > > > The attached patch implements the above heuristic.
> > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > And it brings back most of performance for calculix on par with O2
> > > > > (without inlining orthonl).
> > > > > I verified that with patch there is no cache miss happening on load
> > > > > insn inside loop
> > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > >
> > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > speed and will report numbers
> > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > and set it in backend to not affect other targets).
> > > >
> > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > looking for computations in successors of BLOCK but clearly a computation
> > > > in a pre-header is not at the same loop level as one in the header itself.
> > > Well, my intent was to check if we are hoisting across a region,
> > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > that do not belong to any loop blocks, because that may increase
> > > resource pressure
> > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > from post-dom and neither
> > > loop region has any uses of 'w'. I agree checking just for loop level
> > > is too coarse.
> > > The check with pre-header was essentially the same to see if we are
> > > hoisting across a loop region,
> > > not necessarily from within the loops.
> >
> > But it will fail to hoist *p in
> >
> > if (x)
> > {
> > a = *p;
> > }
> > else
> > {
> > b = *p;
> > }
> >
> > <make distance large>
> > pdom:
> > c = *p;
> >
> > so it isn't what matters either. What happens at the immediate post-dominator
> > isn't directly relevant - what matters would be if the pdom is the one making
> > the value antic on one of the outgoing edges. In that case we're also going
> > to PRE *p into the arm not computing *p (albeit in a later position). But
> > that property is impossible to compute from the sets itself (not even mentioning
> > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > pdoms we compute for regions not having a path to exit, like infinite loops).
> >
> > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > in each of them we _might_ have the situation you want to protect against.
> > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > of them ...
> Hi Richard,
> Thanks for the suggestions. Right, the issue seems to be here that
> post-dom block is making expressions ANTIC. Before doing insert, could
> we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> as a guard against PRE eventually inserting expressions in pred blocks
> of pdom and making them available?
> And during hoisting, we could check if each expr that is ANTIC_IN
> (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> if the distance is "large".
Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
very large (it's actually quadratic in size of the CFG * # values), in
particular
we're inserting in RPO and update AVAIL_OUT only up to the current block
(from dominators) so the PDOM block should have the original AVAIL_OUT
(but from the last iteration - we do iterate insertion).
Note I'm still not happy with pulling off this kind of heuristics.
What the suggestion
means is that for
if (x)
y = *p;
z = *p;
we'll end up with
if (x)
y = *p;
else
z = *p;
instead of
tem = *p;
if (x)
y = tem;
else
z = tem;
that is, we get the runtime benefit but not the code-size one
(hoisting mostly helps code size plus allows if-conversion as followup
in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
then the above would still cause hoisting - so the heuristic isn't sustainable.
Again, sth like "distance" isn't really available.
Richard.
> Thanks,
> Prathamesh
>
>
> >
> > > >
> > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > available at this point - it is the anticipated values from the successors
> > > > that do _not_ compute the value itself that are the issue. To capture
> > > > "distance" you'd need to somehow "age" anticipated value when
> > > > propagating them upwards during compute_antic (where it then
> > > > would also apply to PRE in general).
> > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > expression in the block,
> > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > range and hoisting
> > > if the "distance" between block and post-dom is "too far" ? In
> > > general, as you point out, we'd need to compute,
> > > distance info for successors block during compute_antic, but special
> > > casing for post-dom should be easy enough
> > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > post-dom could be potentially "long range", if the region is large.
> > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > >
> > > > As with all other heuristics the only place one could do hackish attempts
> > > > with at least some reasoning is the elimination phase where
> > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > know we'll get the "close" use in one of the successors so I fear even
> > > > there it will be impossible to do something sensible.
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-23 7:52 ` Richard Biener
@ 2020-09-23 10:10 ` Prathamesh Kulkarni
2020-09-23 11:10 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-23 10:10 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
[-- Attachment #1: Type: text/plain, Size: 22587 bytes --]
On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > >
> > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > >
> > > > > > > > > > -O2:
> > > > > > > > > >
> > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > >
> > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > enough for this kind of code)
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > >
> > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > >
> > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > >
> > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > >
> > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > >
> > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > >
> > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > in insn count).
> > > > > > > > >
> > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > each iterating only 3 times).
> > > > > > > > >
> > > > > > > > > > Perf profiles for
> > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > >
> > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > >
> > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > >
> > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > >
> > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > >
> > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > assembly.
> > > > > > > > >
> > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > -falign-loops=32.
> > > > > > > > Hi,
> > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > >
> > > > > > > > The hoisting region is:
> > > > > > > > if(mattyp.eq.1) then
> > > > > > > > 4 loops
> > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > {
> > > > > > > > orthonl inlined into basic block;
> > > > > > > > loads w[0] .. w[8]
> > > > > > > > }
> > > > > > > > else
> > > > > > > > 6 loops // load anisox
> > > > > > > >
> > > > > > > > followed by basic block:
> > > > > > > >
> > > > > > > > senergy=
> > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > >
> > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > right in block 181, which is:
> > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > >
> > > > > > > > which is then further hoisted to block 173:
> > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > >
> > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > AND
> > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > >
> > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > verifies that it is hoisting of the
> > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > >
> > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > for ldur instruction:
> > > > > > > >
> > > > > > > > With full hoisting:
> > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > >
> > > > > > > > Without full hoisting:
> > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > >
> > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > profiles for both cases).
> > > > > > > >
> > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > possibly results
> > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > and making load slower ?
> > > > > > > >
> > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > (with period = 1 million).
> > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > Hi,
> > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > with long live ranges).
> > > > > > >
> > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > doesn't benefit from hoisting).
> > > > > > >
> > > > > > > For instance:
> > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > {
> > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > && nesting depth of loop is > 3)
> > > > > > > return false;
> > > > > > > }
> > > > > > >
> > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > yet).
> > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > for
> > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > for any regressions.
> > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > Hi,
> > > > > > The attached patch implements the above heuristic.
> > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > (without inlining orthonl).
> > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > insn inside loop
> > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > >
> > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > speed and will report numbers
> > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > and set it in backend to not affect other targets).
> > > > >
> > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > Well, my intent was to check if we are hoisting across a region,
> > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > that do not belong to any loop blocks, because that may increase
> > > > resource pressure
> > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > from post-dom and neither
> > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > is too coarse.
> > > > The check with pre-header was essentially the same to see if we are
> > > > hoisting across a loop region,
> > > > not necessarily from within the loops.
> > >
> > > But it will fail to hoist *p in
> > >
> > > if (x)
> > > {
> > > a = *p;
> > > }
> > > else
> > > {
> > > b = *p;
> > > }
> > >
> > > <make distance large>
> > > pdom:
> > > c = *p;
> > >
> > > so it isn't what matters either. What happens at the immediate post-dominator
> > > isn't directly relevant - what matters would be if the pdom is the one making
> > > the value antic on one of the outgoing edges. In that case we're also going
> > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > that property is impossible to compute from the sets itself (not even mentioning
> > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > >
> > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > in each of them we _might_ have the situation you want to protect against.
> > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > of them ...
> > Hi Richard,
> > Thanks for the suggestions. Right, the issue seems to be here that
> > post-dom block is making expressions ANTIC. Before doing insert, could
> > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > as a guard against PRE eventually inserting expressions in pred blocks
> > of pdom and making them available?
> > And during hoisting, we could check if each expr that is ANTIC_IN
> > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > if the distance is "large".
>
> Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> very large (it's actually quadratic in size of the CFG * # values), in
> particular
> we're inserting in RPO and update AVAIL_OUT only up to the current block
> (from dominators) so the PDOM block should have the original AVAIL_OUT
> (but from the last iteration - we do iterate insertion).
>
> Note I'm still not happy with pulling off this kind of heuristics.
> What the suggestion
> means is that for
>
> if (x)
> y = *p;
> z = *p;
>
> we'll end up with
>
> if (x)
> y = *p;
> else
> z = *p;
>
> instead of
>
> tem = *p;
> if (x)
> y = tem;
> else
> z = tem;
>
> that is, we get the runtime benefit but not the code-size one
> (hoisting mostly helps code size plus allows if-conversion as followup
> in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> then the above would still cause hoisting - so the heuristic isn't sustainable.
> Again, sth like "distance" isn't really available.
Hi Richard,
It doesn't work without copying AVAIL_OUT.
I tried for small example with attached patch:
int foo(int cond, int x, int y)
{
int t;
void f(int);
if (cond)
t = x + y;
else
t = x - y;
f (t);
int t2 = (x + y) * 10;
return t2;
}
By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
it does not hoist in first pass, but then PRE inserts x + y in the "else block",
and we eventually hoist before if (cond). Similarly for e_c3d
hoistings in calculix.
IIUC, we want hoisting to be:
(ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
to ensure that hoisted expressions are along all paths from block to post-dom ?
If copying AVAIL_OUT sets is too large, could we keep another set that
precomputes intersection of AVAIL_OUT of pdom preds
for each block and then use this info during hoisting ?
For computing "distance", I implemented a simple DFS walk from block
to post-dom, that gives up if depth crosses
threshold before reaching post-dom. I am not sure tho, how expensive
that can get.
Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> >
> >
> > >
> > > > >
> > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > available at this point - it is the anticipated values from the successors
> > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > propagating them upwards during compute_antic (where it then
> > > > > would also apply to PRE in general).
> > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > expression in the block,
> > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > range and hoisting
> > > > if the "distance" between block and post-dom is "too far" ? In
> > > > general, as you point out, we'd need to compute,
> > > > distance info for successors block during compute_antic, but special
> > > > casing for post-dom should be easy enough
> > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > post-dom could be potentially "long range", if the region is large.
> > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > >
> > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > with at least some reasoning is the elimination phase where
> > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > there it will be impossible to do something sensible.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Alexander
[-- Attachment #2: gnu-659-pdom-4.diff --]
[-- Type: application/octet-stream, Size: 2584 bytes --]
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..8cee5707e7d 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3477,6 +3477,43 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
return new_stuff;
}
+/* Return true if PDOM_BB is within DIST_LIMIT of BLOCK,
+ where "distance" is measured in terms of number of basic blocks. */
+
+static bool
+pdom_within_dist_p_1 (basic_block block, basic_block pdom_bb,
+ bool *visited, unsigned dist_limit,
+ unsigned dist_from_block)
+{
+ if (dist_from_block >= dist_limit)
+ return false;
+
+ if (block == pdom_bb)
+ return true;
+
+ edge e;
+ edge_iterator ei;
+ visited[block->index] = true;
+
+ FOR_EACH_EDGE (e, ei, block->succs)
+ if (!visited[e->dest->index]
+ && !pdom_within_dist_p_1 (e->dest, pdom_bb, visited,
+ dist_limit, dist_from_block + 1))
+ return false;
+ return true;
+}
+
+static bool
+pdom_within_dist_p (basic_block bb, basic_block pdom_bb, unsigned dist)
+{
+ unsigned n_bbs = n_basic_blocks_for_fn (cfun);
+ bool *visited = new bool[n_bbs];
+ memset (visited, false, n_bbs);
+ bool ret = pdom_within_dist_p_1 (bb, pdom_bb, visited, dist, 0);
+ delete[] visited;
+ return ret;
+}
+
/* Insert expressions in BLOCK to compute hoistable values up.
Return TRUE if something was inserted, otherwise return FALSE.
The caller has to make sure that BLOCK has at least two successors. */
@@ -3537,6 +3574,14 @@ do_hoist_insertion (basic_block block)
&AVAIL_OUT (e->dest)->values);
bitmap_clear (&hoistable_set.values);
+ /* Intersect with AVAIL_OUT of preds of post-dom, to check that
+ hoisted exprs are along all paths from block to pdom. */
+
+ basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+ if (!pdom_within_dist_p (block, pdom_bb, 0))
+ FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+ bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
+
/* Short-cut for a common case: availout_in_some is empty. */
if (bitmap_empty_p (&availout_in_some))
return false;
@@ -4099,6 +4144,7 @@ init_pre (void)
alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
calculate_dominance_info (CDI_DOMINATORS);
+ calculate_dominance_info (CDI_POST_DOMINATORS);
bitmap_obstack_initialize (&grand_bitmap_obstack);
phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4177,7 @@ fini_pre ()
name_to_id.release ();
free_aux_for_blocks ();
+ free_dominance_info (CDI_POST_DOMINATORS);
}
namespace {
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-23 10:10 ` Prathamesh Kulkarni
@ 2020-09-23 11:10 ` Richard Biener
2020-09-24 10:36 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-23 11:10 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > >
> > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > >
> > > > > > > > > > > -O2:
> > > > > > > > > > >
> > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > >
> > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > enough for this kind of code)
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > >
> > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > >
> > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > >
> > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > >
> > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > >
> > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > >
> > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > in insn count).
> > > > > > > > > >
> > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > each iterating only 3 times).
> > > > > > > > > >
> > > > > > > > > > > Perf profiles for
> > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > >
> > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > >
> > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > >
> > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > >
> > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > >
> > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > assembly.
> > > > > > > > > >
> > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > -falign-loops=32.
> > > > > > > > > Hi,
> > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > >
> > > > > > > > > The hoisting region is:
> > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > 4 loops
> > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > {
> > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > }
> > > > > > > > > else
> > > > > > > > > 6 loops // load anisox
> > > > > > > > >
> > > > > > > > > followed by basic block:
> > > > > > > > >
> > > > > > > > > senergy=
> > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > >
> > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > right in block 181, which is:
> > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > >
> > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > >
> > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > AND
> > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > >
> > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > >
> > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > for ldur instruction:
> > > > > > > > >
> > > > > > > > > With full hoisting:
> > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > >
> > > > > > > > > Without full hoisting:
> > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > >
> > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > profiles for both cases).
> > > > > > > > >
> > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > possibly results
> > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > and making load slower ?
> > > > > > > > >
> > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > (with period = 1 million).
> > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > Hi,
> > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > with long live ranges).
> > > > > > > >
> > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > doesn't benefit from hoisting).
> > > > > > > >
> > > > > > > > For instance:
> > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > {
> > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > return false;
> > > > > > > > }
> > > > > > > >
> > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > yet).
> > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > for
> > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > for any regressions.
> > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > Hi,
> > > > > > > The attached patch implements the above heuristic.
> > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > (without inlining orthonl).
> > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > insn inside loop
> > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > >
> > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > speed and will report numbers
> > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > and set it in backend to not affect other targets).
> > > > > >
> > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > that do not belong to any loop blocks, because that may increase
> > > > > resource pressure
> > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > from post-dom and neither
> > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > is too coarse.
> > > > > The check with pre-header was essentially the same to see if we are
> > > > > hoisting across a loop region,
> > > > > not necessarily from within the loops.
> > > >
> > > > But it will fail to hoist *p in
> > > >
> > > > if (x)
> > > > {
> > > > a = *p;
> > > > }
> > > > else
> > > > {
> > > > b = *p;
> > > > }
> > > >
> > > > <make distance large>
> > > > pdom:
> > > > c = *p;
> > > >
> > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > >
> > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > in each of them we _might_ have the situation you want to protect against.
> > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > of them ...
> > > Hi Richard,
> > > Thanks for the suggestions. Right, the issue seems to be here that
> > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > as a guard against PRE eventually inserting expressions in pred blocks
> > > of pdom and making them available?
> > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > if the distance is "large".
> >
> > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > very large (it's actually quadratic in size of the CFG * # values), in
> > particular
> > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > (but from the last iteration - we do iterate insertion).
> >
> > Note I'm still not happy with pulling off this kind of heuristics.
> > What the suggestion
> > means is that for
> >
> > if (x)
> > y = *p;
> > z = *p;
> >
> > we'll end up with
> >
> > if (x)
> > y = *p;
> > else
> > z = *p;
> >
> > instead of
> >
> > tem = *p;
> > if (x)
> > y = tem;
> > else
> > z = tem;
> >
> > that is, we get the runtime benefit but not the code-size one
> > (hoisting mostly helps code size plus allows if-conversion as followup
> > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > Again, sth like "distance" isn't really available.
> Hi Richard,
> It doesn't work without copying AVAIL_OUT.
> I tried for small example with attached patch:
>
> int foo(int cond, int x, int y)
> {
> int t;
> void f(int);
>
> if (cond)
> t = x + y;
> else
> t = x - y;
>
> f (t);
> int t2 = (x + y) * 10;
> return t2;
> }
>
> By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> and we eventually hoist before if (cond). Similarly for e_c3d
> hoistings in calculix.
>
> IIUC, we want hoisting to be:
> (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> to ensure that hoisted expressions are along all paths from block to post-dom ?
> If copying AVAIL_OUT sets is too large, could we keep another set that
> precomputes intersection of AVAIL_OUT of pdom preds
> for each block and then use this info during hoisting ?
>
> For computing "distance", I implemented a simple DFS walk from block
> to post-dom, that gives up if depth crosses
> threshold before reaching post-dom. I am not sure tho, how expensive
> that can get.
As written it is quadratic in the CFG size.
You can optimize away the
+ FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+ bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
As said, I don't think this is the way to go - trying to avoid code
hoisting isn't
what we'd want to do - your quoted assembly instead points to a loop
with a non-empty latch which is usually caused by PRE and avoided with -O3
because it also harms vectorization.
Richard.
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > >
> > > > > >
> > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > available at this point - it is the anticipated values from the successors
> > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > propagating them upwards during compute_antic (where it then
> > > > > > would also apply to PRE in general).
> > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > expression in the block,
> > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > range and hoisting
> > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > general, as you point out, we'd need to compute,
> > > > > distance info for successors block during compute_antic, but special
> > > > > casing for post-dom should be easy enough
> > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > post-dom could be potentially "long range", if the region is large.
> > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > > >
> > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > with at least some reasoning is the elimination phase where
> > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > there it will be impossible to do something sensible.
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-23 11:10 ` Richard Biener
@ 2020-09-24 10:36 ` Prathamesh Kulkarni
2020-09-24 11:14 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-24 10:36 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
[-- Attachment #1: Type: text/plain, Size: 26427 bytes --]
On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > >
> > > > > > > > > > > > -O2:
> > > > > > > > > > > >
> > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > > >
> > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > >
> > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > > >
> > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > >
> > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > > >
> > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > >
> > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > > >
> > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > in insn count).
> > > > > > > > > > >
> > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > >
> > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > > >
> > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > > >
> > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > >
> > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > assembly.
> > > > > > > > > > >
> > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > Hi,
> > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > >
> > > > > > > > > > The hoisting region is:
> > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > 4 loops
> > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > {
> > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > }
> > > > > > > > > > else
> > > > > > > > > > 6 loops // load anisox
> > > > > > > > > >
> > > > > > > > > > followed by basic block:
> > > > > > > > > >
> > > > > > > > > > senergy=
> > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > >
> > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > right in block 181, which is:
> > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > >
> > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > >
> > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > AND
> > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > >
> > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > >
> > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > for ldur instruction:
> > > > > > > > > >
> > > > > > > > > > With full hoisting:
> > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > Without full hoisting:
> > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > profiles for both cases).
> > > > > > > > > >
> > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > possibly results
> > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > and making load slower ?
> > > > > > > > > >
> > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > (with period = 1 million).
> > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > Hi,
> > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > with long live ranges).
> > > > > > > > >
> > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > >
> > > > > > > > > For instance:
> > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > {
> > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > return false;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > yet).
> > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > for
> > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > for any regressions.
> > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > Hi,
> > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > (without inlining orthonl).
> > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > insn inside loop
> > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > >
> > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > speed and will report numbers
> > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > and set it in backend to not affect other targets).
> > > > > > >
> > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > resource pressure
> > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > from post-dom and neither
> > > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > > is too coarse.
> > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > hoisting across a loop region,
> > > > > > not necessarily from within the loops.
> > > > >
> > > > > But it will fail to hoist *p in
> > > > >
> > > > > if (x)
> > > > > {
> > > > > a = *p;
> > > > > }
> > > > > else
> > > > > {
> > > > > b = *p;
> > > > > }
> > > > >
> > > > > <make distance large>
> > > > > pdom:
> > > > > c = *p;
> > > > >
> > > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > >
> > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > of them ...
> > > > Hi Richard,
> > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > of pdom and making them available?
> > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > if the distance is "large".
> > >
> > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > > very large (it's actually quadratic in size of the CFG * # values), in
> > > particular
> > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > (but from the last iteration - we do iterate insertion).
> > >
> > > Note I'm still not happy with pulling off this kind of heuristics.
> > > What the suggestion
> > > means is that for
> > >
> > > if (x)
> > > y = *p;
> > > z = *p;
> > >
> > > we'll end up with
> > >
> > > if (x)
> > > y = *p;
> > > else
> > > z = *p;
> > >
> > > instead of
> > >
> > > tem = *p;
> > > if (x)
> > > y = tem;
> > > else
> > > z = tem;
> > >
> > > that is, we get the runtime benefit but not the code-size one
> > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > Again, sth like "distance" isn't really available.
> > Hi Richard,
> > It doesn't work without copying AVAIL_OUT.
> > I tried for small example with attached patch:
> >
> > int foo(int cond, int x, int y)
> > {
> > int t;
> > void f(int);
> >
> > if (cond)
> > t = x + y;
> > else
> > t = x - y;
> >
> > f (t);
> > int t2 = (x + y) * 10;
> > return t2;
> > }
> >
> > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > and we eventually hoist before if (cond). Similarly for e_c3d
> > hoistings in calculix.
> >
> > IIUC, we want hoisting to be:
> > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > If copying AVAIL_OUT sets is too large, could we keep another set that
> > precomputes intersection of AVAIL_OUT of pdom preds
> > for each block and then use this info during hoisting ?
> >
> > For computing "distance", I implemented a simple DFS walk from block
> > to post-dom, that gives up if depth crosses
> > threshold before reaching post-dom. I am not sure tho, how expensive
> > that can get.
>
> As written it is quadratic in the CFG size.
>
> You can optimize away the
>
> + FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
>
> loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
>
> As said, I don't think this is the way to go - trying to avoid code
> hoisting isn't
> what we'd want to do - your quoted assembly instead points to a loop
> with a non-empty latch which is usually caused by PRE and avoided with -O3
> because it also harms vectorization.
But disabling PRE (which removes non empty latch), only results in
marginal performance improvement,
while disabling hoisting of 'w' array, with non-empty latch, seems to
gain most of the performance.
AFAIU, that was happening, because after disabling hoisting of 'w',
there wasn't a cache miss (as seen with perf -e
L1-dcache-load-misses),
on the load instruction inside the inner loop.
For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
node, since that's quadratic in terms of CFG size.
Would it make sense to break interaction between PRE and hoisting,
only for the case when inserting into preds of pdom ?
I tried doing that in attached patch, where insert runs in two phases:
(a) PRE and hoisting, where hoisting marks block to not do PRE for.
(b) Second phase, which only runs PRE on all blocks.
This (expectedly) regresses ssa-hoist-3.c.
If the heuristic isn't acceptable, I suppose encoding distance of expr
within ANTIC sets
during compute_antic would be the right way to fix this ?
So ANTIC_IN (block) contains the anticipated expressions, and for each
antic expr, the "distance" from the furthest block
it's computed in ? Could you please elaborate a bit on how we could go
about encoding distance during compute_antic ?
Thanks,
Prathamesh
Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > >
> > > > > > >
> > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > would also apply to PRE in general).
> > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > > expression in the block,
> > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > range and hoisting
> > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > general, as you point out, we'd need to compute,
> > > > > > distance info for successors block during compute_antic, but special
> > > > > > casing for post-dom should be easy enough
> > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > there it will be impossible to do something sensible.
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Alexander
[-- Attachment #2: gnu-659-pdom-5.diff --]
[-- Type: application/octet-stream, Size: 3362 bytes --]
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..5be4c3cc9d4 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3482,7 +3482,7 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
The caller has to make sure that BLOCK has at least two successors. */
static bool
-do_hoist_insertion (basic_block block)
+do_hoist_insertion (basic_block block, hash_set<basic_block> *late_pre_bbs)
{
edge e;
edge_iterator ei;
@@ -3537,6 +3537,22 @@ do_hoist_insertion (basic_block block)
&AVAIL_OUT (e->dest)->values);
bitmap_clear (&hoistable_set.values);
+ /* Intersect with AVAIL_OUT of preds of post-dom, to check that
+ hoisted exprs are along all paths from block to pdom. */
+
+ basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+ bitmap_head S;
+ bitmap_initialize (&S, &grand_bitmap_obstack);
+ bitmap_and (&S, &availout_in_some, &ANTIC_IN (pdom_bb)->values);
+ if (!bitmap_empty_p (&S))
+ {
+ /* Mark pdom_bb in late_pre_bbs, to avoid PRE for this block
+ during hoisting. */
+ late_pre_bbs->add (pdom_bb);
+ FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+ bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
+ }
+
/* Short-cut for a common case: availout_in_some is empty. */
if (bitmap_empty_p (&availout_in_some))
return false;
@@ -3615,9 +3631,10 @@ do_hoist_insertion (basic_block block)
/* Perform insertion of partially redundant and hoistable values. */
static void
-insert (void)
+insert_1 (bool late_pre)
{
basic_block bb;
+ hash_set<basic_block> *late_pre_bbs = new hash_set<basic_block> ();
FOR_ALL_BB_FN (bb, cfun)
NEW_SETS (bb) = bitmap_set_new ();
@@ -3664,14 +3681,21 @@ insert (void)
/* Insert expressions for partial redundancies. */
if (flag_tree_pre && !single_pred_p (block))
{
- changed |= do_pre_regular_insertion (block, dom);
- if (do_partial_partial)
- changed |= do_pre_partial_partial_insertion (block, dom);
+ /* If hoisting marked to not insert in preds of block,
+ skip for now, and insert during "late pre". */
+ if (!late_pre && late_pre_bbs->contains (block))
+ ;
+ else
+ {
+ changed |= do_pre_regular_insertion (block, dom);
+ if (do_partial_partial)
+ changed |= do_pre_partial_partial_insertion (block, dom);
+ }
}
/* Insert expressions for hoisting. */
- if (flag_code_hoisting && EDGE_COUNT (block->succs) >= 2)
- changed |= do_hoist_insertion (block);
+ if (!late_pre && flag_code_hoisting && EDGE_COUNT (block->succs) >= 2)
+ changed |= do_hoist_insertion (block, late_pre_bbs);
}
}
@@ -3688,6 +3712,12 @@ insert (void)
free (rpo);
}
+static void
+insert ()
+{
+ insert_1 (false);
+ insert_1 (true);
+}
/* Compute the AVAIL set for all basic blocks.
@@ -4099,6 +4129,7 @@ init_pre (void)
alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
calculate_dominance_info (CDI_DOMINATORS);
+ calculate_dominance_info (CDI_POST_DOMINATORS);
bitmap_obstack_initialize (&grand_bitmap_obstack);
phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4162,7 @@ fini_pre ()
name_to_id.release ();
free_aux_for_blocks ();
+ free_dominance_info (CDI_POST_DOMINATORS);
}
namespace {
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-24 10:36 ` Prathamesh Kulkarni
@ 2020-09-24 11:14 ` Richard Biener
2020-10-21 10:03 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-24 11:14 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > >
> > > > > > > > > > > > > -O2:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > >
> > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > in insn count).
> > > > > > > > > > > >
> > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > >
> > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > >
> > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > > > >
> > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > >
> > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > assembly.
> > > > > > > > > > > >
> > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > Hi,
> > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > >
> > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > 4 loops
> > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > {
> > > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > > }
> > > > > > > > > > > else
> > > > > > > > > > > 6 loops // load anisox
> > > > > > > > > > >
> > > > > > > > > > > followed by basic block:
> > > > > > > > > > >
> > > > > > > > > > > senergy=
> > > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > >
> > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > >
> > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > >
> > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > AND
> > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > >
> > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > >
> > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > for ldur instruction:
> > > > > > > > > > >
> > > > > > > > > > > With full hoisting:
> > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > >
> > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > > > >
> > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > profiles for both cases).
> > > > > > > > > > >
> > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > possibly results
> > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > and making load slower ?
> > > > > > > > > > >
> > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > Hi,
> > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > with long live ranges).
> > > > > > > > > >
> > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > >
> > > > > > > > > > For instance:
> > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > {
> > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > > return false;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > yet).
> > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > for
> > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > for any regressions.
> > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > Hi,
> > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > (without inlining orthonl).
> > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > insn inside loop
> > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > >
> > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > speed and will report numbers
> > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > >
> > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > resource pressure
> > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > from post-dom and neither
> > > > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > > > is too coarse.
> > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > hoisting across a loop region,
> > > > > > > not necessarily from within the loops.
> > > > > >
> > > > > > But it will fail to hoist *p in
> > > > > >
> > > > > > if (x)
> > > > > > {
> > > > > > a = *p;
> > > > > > }
> > > > > > else
> > > > > > {
> > > > > > b = *p;
> > > > > > }
> > > > > >
> > > > > > <make distance large>
> > > > > > pdom:
> > > > > > c = *p;
> > > > > >
> > > > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > >
> > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > of them ...
> > > > > Hi Richard,
> > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > of pdom and making them available?
> > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > if the distance is "large".
> > > >
> > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > particular
> > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > (but from the last iteration - we do iterate insertion).
> > > >
> > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > What the suggestion
> > > > means is that for
> > > >
> > > > if (x)
> > > > y = *p;
> > > > z = *p;
> > > >
> > > > we'll end up with
> > > >
> > > > if (x)
> > > > y = *p;
> > > > else
> > > > z = *p;
> > > >
> > > > instead of
> > > >
> > > > tem = *p;
> > > > if (x)
> > > > y = tem;
> > > > else
> > > > z = tem;
> > > >
> > > > that is, we get the runtime benefit but not the code-size one
> > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > Again, sth like "distance" isn't really available.
> > > Hi Richard,
> > > It doesn't work without copying AVAIL_OUT.
> > > I tried for small example with attached patch:
> > >
> > > int foo(int cond, int x, int y)
> > > {
> > > int t;
> > > void f(int);
> > >
> > > if (cond)
> > > t = x + y;
> > > else
> > > t = x - y;
> > >
> > > f (t);
> > > int t2 = (x + y) * 10;
> > > return t2;
> > > }
> > >
> > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > hoistings in calculix.
> > >
> > > IIUC, we want hoisting to be:
> > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > precomputes intersection of AVAIL_OUT of pdom preds
> > > for each block and then use this info during hoisting ?
> > >
> > > For computing "distance", I implemented a simple DFS walk from block
> > > to post-dom, that gives up if depth crosses
> > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > that can get.
> >
> > As written it is quadratic in the CFG size.
> >
> > You can optimize away the
> >
> > + FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> >
> > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> >
> > As said, I don't think this is the way to go - trying to avoid code
> > hoisting isn't
> > what we'd want to do - your quoted assembly instead points to a loop
> > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > because it also harms vectorization.
> But disabling PRE (which removes non empty latch), only results in
> marginal performance improvement,
> while disabling hoisting of 'w' array, with non-empty latch, seems to
> gain most of the performance.
> AFAIU, that was happening, because after disabling hoisting of 'w',
> there wasn't a cache miss (as seen with perf -e
> L1-dcache-load-misses),
> on the load instruction inside the inner loop.
But that doesn't make much sense then. If code generation isn't
an issue I don't see how the hoisted loads should cause a L1
dcache load miss for data that is accessed in the respective loop
as well (though not hoisted from it since at -O2 not sufficiently
unrolled)
> For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> node, since that's quadratic in terms of CFG size.
> Would it make sense to break interaction between PRE and hoisting,
> only for the case when inserting into preds of pdom ?
> I tried doing that in attached patch, where insert runs in two phases:
> (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> (b) Second phase, which only runs PRE on all blocks.
> This (expectedly) regresses ssa-hoist-3.c.
>
> If the heuristic isn't acceptable, I suppose encoding distance of expr
> within ANTIC sets
> during compute_antic would be the right way to fix this ?
> So ANTIC_IN (block) contains the anticipated expressions, and for each
> antic expr, the "distance" from the furthest block
> it's computed in ? Could you please elaborate a bit on how we could go
> about encoding distance during compute_antic ?
But the distance in this case is just one CFG node ... we have
if (mattyp.eq.1)
... use of w but not with constant indices
else if (mattyp.eq.2)
.. inlined orthonl with constant index w() accesses, single BB
else
... use of w but not with constant indices - the actual relevant
loop of calculix
endif
... constant index w() accesses, single BB
so the CFG distance is one node - unless you want to compute the
maximum distance? Btw, I only see 9 loads hoisted.
I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
And indeed this case is exactly one where hoisting is superior to
PRE which would happily insert the 9 loads into the two variable-access
predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
In .optimized I see
pretmp_5573 = w[0];
pretmp_5574 = w[3];
pretmp_5575 = w[6];
pretmp_5576 = w[1];
pretmp_5577 = w[4];
pretmp_5578 = w[7];
pretmp_5579 = w[2];
pretmp_5580 = w[5];
pretmp_5581 = w[8];
if (mattyp.157_742 == 1)
I do remember talks/patches about ordering of such sequences of loads
to make them prefetch-happier. Are the loads actually emitted in-order
for arm? Thus w[0]...w[8] rather than as seen above with some random
permutes inbetween? On x86 they are emitted in random order
(they are also spilled immediately).
Richard.
> Thanks,
> Prathamesh
>
>
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > would also apply to PRE in general).
> > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > > > expression in the block,
> > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > range and hoisting
> > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > general, as you point out, we'd need to compute,
> > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > casing for post-dom should be easy enough
> > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > there it will be impossible to do something sensible.
> > > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-09-24 11:14 ` Richard Biener
@ 2020-10-21 10:03 ` Prathamesh Kulkarni
2020-10-21 10:39 ` Richard Biener
0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-10-21 10:03 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > > > > >
> > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > assembly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > >
> > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > 4 loops
> > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > {
> > > > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > > > }
> > > > > > > > > > > > else
> > > > > > > > > > > > 6 loops // load anisox
> > > > > > > > > > > >
> > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > >
> > > > > > > > > > > > senergy=
> > > > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > >
> > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > >
> > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > >
> > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > AND
> > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > >
> > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > >
> > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > >
> > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > >
> > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > > > > >
> > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > >
> > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > possibly results
> > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > >
> > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > Hi,
> > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > with long live ranges).
> > > > > > > > > > >
> > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > >
> > > > > > > > > > > For instance:
> > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > {
> > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > > > return false;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > yet).
> > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > for
> > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > for any regressions.
> > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > Hi,
> > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > insn inside loop
> > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > >
> > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > speed and will report numbers
> > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > >
> > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > resource pressure
> > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > from post-dom and neither
> > > > > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > > > > is too coarse.
> > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > hoisting across a loop region,
> > > > > > > > not necessarily from within the loops.
> > > > > > >
> > > > > > > But it will fail to hoist *p in
> > > > > > >
> > > > > > > if (x)
> > > > > > > {
> > > > > > > a = *p;
> > > > > > > }
> > > > > > > else
> > > > > > > {
> > > > > > > b = *p;
> > > > > > > }
> > > > > > >
> > > > > > > <make distance large>
> > > > > > > pdom:
> > > > > > > c = *p;
> > > > > > >
> > > > > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > >
> > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > of them ...
> > > > > > Hi Richard,
> > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > of pdom and making them available?
> > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > if the distance is "large".
> > > > >
> > > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > particular
> > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > (but from the last iteration - we do iterate insertion).
> > > > >
> > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > What the suggestion
> > > > > means is that for
> > > > >
> > > > > if (x)
> > > > > y = *p;
> > > > > z = *p;
> > > > >
> > > > > we'll end up with
> > > > >
> > > > > if (x)
> > > > > y = *p;
> > > > > else
> > > > > z = *p;
> > > > >
> > > > > instead of
> > > > >
> > > > > tem = *p;
> > > > > if (x)
> > > > > y = tem;
> > > > > else
> > > > > z = tem;
> > > > >
> > > > > that is, we get the runtime benefit but not the code-size one
> > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > Again, sth like "distance" isn't really available.
> > > > Hi Richard,
> > > > It doesn't work without copying AVAIL_OUT.
> > > > I tried for small example with attached patch:
> > > >
> > > > int foo(int cond, int x, int y)
> > > > {
> > > > int t;
> > > > void f(int);
> > > >
> > > > if (cond)
> > > > t = x + y;
> > > > else
> > > > t = x - y;
> > > >
> > > > f (t);
> > > > int t2 = (x + y) * 10;
> > > > return t2;
> > > > }
> > > >
> > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > hoistings in calculix.
> > > >
> > > > IIUC, we want hoisting to be:
> > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > for each block and then use this info during hoisting ?
> > > >
> > > > For computing "distance", I implemented a simple DFS walk from block
> > > > to post-dom, that gives up if depth crosses
> > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > that can get.
> > >
> > > As written it is quadratic in the CFG size.
> > >
> > > You can optimize away the
> > >
> > > + FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > >
> > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > >
> > > As said, I don't think this is the way to go - trying to avoid code
> > > hoisting isn't
> > > what we'd want to do - your quoted assembly instead points to a loop
> > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > because it also harms vectorization.
> > But disabling PRE (which removes non empty latch), only results in
> > marginal performance improvement,
> > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > gain most of the performance.
> > AFAIU, that was happening, because after disabling hoisting of 'w',
> > there wasn't a cache miss (as seen with perf -e
> > L1-dcache-load-misses),
> > on the load instruction inside the inner loop.
>
> But that doesn't make much sense then. If code generation isn't
> an issue I don't see how the hoisted loads should cause a L1
> dcache load miss for data that is accessed in the respective loop
> as well (though not hoisted from it since at -O2 not sufficiently
> unrolled)
Hi Richard,
I am very sorry to respond late, I was away from work for some
personal commitments, and couldn't respond earlier.
Yes, I agree this doesn't seem to make much sense but I am
consistently seeing L1 dcache load misses, which goes away
after disabling hoisting of 'w'. I am not sure tho why this happens.
Also the load instruction is the only one that has most
significant performance difference across several runs. Or maybe I am
interpreting the results incorrectly.
Do you have suggestions for any benchmarking experiment I could try ?
>
> > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > node, since that's quadratic in terms of CFG size.
> > Would it make sense to break interaction between PRE and hoisting,
> > only for the case when inserting into preds of pdom ?
> > I tried doing that in attached patch, where insert runs in two phases:
> > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > (b) Second phase, which only runs PRE on all blocks.
> > This (expectedly) regresses ssa-hoist-3.c.
> >
> > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > within ANTIC sets
> > during compute_antic would be the right way to fix this ?
> > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > antic expr, the "distance" from the furthest block
> > it's computed in ? Could you please elaborate a bit on how we could go
> > about encoding distance during compute_antic ?
>
> But the distance in this case is just one CFG node ... we have
>
> if (mattyp.eq.1)
> ... use of w but not with constant indices
> else if (mattyp.eq.2)
> .. inlined orthonl with constant index w() accesses, single BB
> else
> ... use of w but not with constant indices - the actual relevant
> loop of calculix
> endif
> ... constant index w() accesses, single BB
>
> so the CFG distance is one node - unless you want to compute the
> maximum distance? Btw, I only see 9 loads hoisted.
>
> I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
>
> And indeed this case is exactly one where hoisting is superior to
> PRE which would happily insert the 9 loads into the two variable-access
> predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
>
> In .optimized I see
>
> pretmp_5573 = w[0];
> pretmp_5574 = w[3];
> pretmp_5575 = w[6];
> pretmp_5576 = w[1];
> pretmp_5577 = w[4];
> pretmp_5578 = w[7];
> pretmp_5579 = w[2];
> pretmp_5580 = w[5];
> pretmp_5581 = w[8];
> if (mattyp.157_742 == 1)
>
> I do remember talks/patches about ordering of such sequences of loads
> to make them prefetch-happier. Are the loads actually emitted in-order
> for arm? Thus w[0]...w[8] rather than as seen above with some random
> permutes inbetween? On x86 they are emitted in random order
> (they are also spilled immediately).
On aarch64, they are emitted in random order as well.
Thanks,
Prathamesh
Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > would also apply to PRE in general).
> > > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > > > > expression in the block,
> > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > range and hoisting
> > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > casing for post-dom should be easy enough
> > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > >
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-10-21 10:03 ` Prathamesh Kulkarni
@ 2020-10-21 10:39 ` Richard Biener
2020-10-28 6:55 ` Prathamesh Kulkarni
0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-10-21 10:39 UTC (permalink / raw)
To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development
On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > > assembly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > > 4 loops
> > > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > > {
> > > > > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > > > > }
> > > > > > > > > > > > > else
> > > > > > > > > > > > > 6 loops // load anisox
> > > > > > > > > > > > >
> > > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > > >
> > > > > > > > > > > > > senergy=
> > > > > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > > >
> > > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > > >
> > > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > > AND
> > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > > >
> > > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > > > > > >
> > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > > >
> > > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > > possibly results
> > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > > with long live ranges).
> > > > > > > > > > > >
> > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > > >
> > > > > > > > > > > > For instance:
> > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > > {
> > > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > > > > return false;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > > yet).
> > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > > for
> > > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > > for any regressions.
> > > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > > Hi,
> > > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > > insn inside loop
> > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > > >
> > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > > speed and will report numbers
> > > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > > >
> > > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > > resource pressure
> > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > > from post-dom and neither
> > > > > > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > > > > > is too coarse.
> > > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > > hoisting across a loop region,
> > > > > > > > > not necessarily from within the loops.
> > > > > > > >
> > > > > > > > But it will fail to hoist *p in
> > > > > > > >
> > > > > > > > if (x)
> > > > > > > > {
> > > > > > > > a = *p;
> > > > > > > > }
> > > > > > > > else
> > > > > > > > {
> > > > > > > > b = *p;
> > > > > > > > }
> > > > > > > >
> > > > > > > > <make distance large>
> > > > > > > > pdom:
> > > > > > > > c = *p;
> > > > > > > >
> > > > > > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > > > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > > >
> > > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > > of them ...
> > > > > > > Hi Richard,
> > > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > > of pdom and making them available?
> > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > > if the distance is "large".
> > > > > >
> > > > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > > particular
> > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > > (but from the last iteration - we do iterate insertion).
> > > > > >
> > > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > > What the suggestion
> > > > > > means is that for
> > > > > >
> > > > > > if (x)
> > > > > > y = *p;
> > > > > > z = *p;
> > > > > >
> > > > > > we'll end up with
> > > > > >
> > > > > > if (x)
> > > > > > y = *p;
> > > > > > else
> > > > > > z = *p;
> > > > > >
> > > > > > instead of
> > > > > >
> > > > > > tem = *p;
> > > > > > if (x)
> > > > > > y = tem;
> > > > > > else
> > > > > > z = tem;
> > > > > >
> > > > > > that is, we get the runtime benefit but not the code-size one
> > > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > > Again, sth like "distance" isn't really available.
> > > > > Hi Richard,
> > > > > It doesn't work without copying AVAIL_OUT.
> > > > > I tried for small example with attached patch:
> > > > >
> > > > > int foo(int cond, int x, int y)
> > > > > {
> > > > > int t;
> > > > > void f(int);
> > > > >
> > > > > if (cond)
> > > > > t = x + y;
> > > > > else
> > > > > t = x - y;
> > > > >
> > > > > f (t);
> > > > > int t2 = (x + y) * 10;
> > > > > return t2;
> > > > > }
> > > > >
> > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > > hoistings in calculix.
> > > > >
> > > > > IIUC, we want hoisting to be:
> > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > > for each block and then use this info during hoisting ?
> > > > >
> > > > > For computing "distance", I implemented a simple DFS walk from block
> > > > > to post-dom, that gives up if depth crosses
> > > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > > that can get.
> > > >
> > > > As written it is quadratic in the CFG size.
> > > >
> > > > You can optimize away the
> > > >
> > > > + FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > > >
> > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > > >
> > > > As said, I don't think this is the way to go - trying to avoid code
> > > > hoisting isn't
> > > > what we'd want to do - your quoted assembly instead points to a loop
> > > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > > because it also harms vectorization.
> > > But disabling PRE (which removes non empty latch), only results in
> > > marginal performance improvement,
> > > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > > gain most of the performance.
> > > AFAIU, that was happening, because after disabling hoisting of 'w',
> > > there wasn't a cache miss (as seen with perf -e
> > > L1-dcache-load-misses),
> > > on the load instruction inside the inner loop.
> >
> > But that doesn't make much sense then. If code generation isn't
> > an issue I don't see how the hoisted loads should cause a L1
> > dcache load miss for data that is accessed in the respective loop
> > as well (though not hoisted from it since at -O2 not sufficiently
> > unrolled)
> Hi Richard,
> I am very sorry to respond late, I was away from work for some
> personal commitments, and couldn't respond earlier.
> Yes, I agree this doesn't seem to make much sense but I am
> consistently seeing L1 dcache load misses, which goes away
> after disabling hoisting of 'w'. I am not sure tho why this happens.
> Also the load instruction is the only one that has most
> significant performance difference across several runs. Or maybe I am
> interpreting the results incorrectly.
> Do you have suggestions for any benchmarking experiment I could try ?
I think you want to edit the bad assembly manually and try a few things,
like ordering the hoisted loads and then re-measure. Unfortunately
modern CPU pipelines have no easy way to tell us why they're unhappy.
> >
> > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > > node, since that's quadratic in terms of CFG size.
> > > Would it make sense to break interaction between PRE and hoisting,
> > > only for the case when inserting into preds of pdom ?
> > > I tried doing that in attached patch, where insert runs in two phases:
> > > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > > (b) Second phase, which only runs PRE on all blocks.
> > > This (expectedly) regresses ssa-hoist-3.c.
> > >
> > > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > > within ANTIC sets
> > > during compute_antic would be the right way to fix this ?
> > > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > > antic expr, the "distance" from the furthest block
> > > it's computed in ? Could you please elaborate a bit on how we could go
> > > about encoding distance during compute_antic ?
> >
> > But the distance in this case is just one CFG node ... we have
> >
> > if (mattyp.eq.1)
> > ... use of w but not with constant indices
> > else if (mattyp.eq.2)
> > .. inlined orthonl with constant index w() accesses, single BB
> > else
> > ... use of w but not with constant indices - the actual relevant
> > loop of calculix
> > endif
> > ... constant index w() accesses, single BB
> >
> > so the CFG distance is one node - unless you want to compute the
> > maximum distance? Btw, I only see 9 loads hoisted.
> >
> > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
> >
> > And indeed this case is exactly one where hoisting is superior to
> > PRE which would happily insert the 9 loads into the two variable-access
> > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
> >
> > In .optimized I see
> >
> > pretmp_5573 = w[0];
> > pretmp_5574 = w[3];
> > pretmp_5575 = w[6];
> > pretmp_5576 = w[1];
> > pretmp_5577 = w[4];
> > pretmp_5578 = w[7];
> > pretmp_5579 = w[2];
> > pretmp_5580 = w[5];
> > pretmp_5581 = w[8];
> > if (mattyp.157_742 == 1)
> >
> > I do remember talks/patches about ordering of such sequences of loads
> > to make them prefetch-happier. Are the loads actually emitted in-order
> > for arm? Thus w[0]...w[8] rather than as seen above with some random
> > permutes inbetween? On x86 they are emitted in random order
> > (they are also spilled immediately).
> On aarch64, they are emitted in random order as well.
>
> Thanks,
> Prathamesh
>
>
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > > would also apply to PRE in general).
> > > > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > > > > > expression in the block,
> > > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > > range and hoisting
> > > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > > casing for post-dom should be easy enough
> > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > > >
> > > > > > > > > > Richard.
> > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: LTO slows down calculix by more than 10% on aarch64
2020-10-21 10:39 ` Richard Biener
@ 2020-10-28 6:55 ` Prathamesh Kulkarni
0 siblings, 0 replies; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-10-28 6:55 UTC (permalink / raw)
To: Richard Biener; +Cc: Alexander Monakov, GCC Development
On Wed, 21 Oct 2020 at 16:10, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > > > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > > > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > > > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > > > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > > > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > > > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > > > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > > > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > > > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > > > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > > > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > > > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > > > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > > > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > > > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > > > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > > > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > > > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > > > > > > > > > │ cmp w0, #0x4
> > > > > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > > > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > > > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > > > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > > > assembly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > > > 4 loops
> > > > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > else
> > > > > > > > > > > > > > 6 loops // load anisox
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > senergy=
> > > > > > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > > > AND
> > > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > > > possibly results
> > > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > > > with long live ranges).
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > > > >
> > > > > > > > > > > > > For instance:
> > > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > > > {
> > > > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > > > > > return false;
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > > > yet).
> > > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > > > for
> > > > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > > > for any regressions.
> > > > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > > > insn inside loop
> > > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > > > >
> > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > > > speed and will report numbers
> > > > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > > > >
> > > > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > > > is certainly not what you want. Also the pre-header check is odd - we're
> > > > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > > > resource pressure
> > > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > > > from post-dom and neither
> > > > > > > > > > loop region has any uses of 'w'. I agree checking just for loop level
> > > > > > > > > > is too coarse.
> > > > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > > > hoisting across a loop region,
> > > > > > > > > > not necessarily from within the loops.
> > > > > > > > >
> > > > > > > > > But it will fail to hoist *p in
> > > > > > > > >
> > > > > > > > > if (x)
> > > > > > > > > {
> > > > > > > > > a = *p;
> > > > > > > > > }
> > > > > > > > > else
> > > > > > > > > {
> > > > > > > > > b = *p;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > <make distance large>
> > > > > > > > > pdom:
> > > > > > > > > c = *p;
> > > > > > > > >
> > > > > > > > > so it isn't what matters either. What happens at the immediate post-dominator
> > > > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > > > the value antic on one of the outgoing edges. In that case we're also going
> > > > > > > > > to PRE *p into the arm not computing *p (albeit in a later position). But
> > > > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > > > >
> > > > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > > > of them ...
> > > > > > > > Hi Richard,
> > > > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > > > of pdom and making them available?
> > > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > > > if the distance is "large".
> > > > > > >
> > > > > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is
> > > > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > > > particular
> > > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > > > (but from the last iteration - we do iterate insertion).
> > > > > > >
> > > > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > > > What the suggestion
> > > > > > > means is that for
> > > > > > >
> > > > > > > if (x)
> > > > > > > y = *p;
> > > > > > > z = *p;
> > > > > > >
> > > > > > > we'll end up with
> > > > > > >
> > > > > > > if (x)
> > > > > > > y = *p;
> > > > > > > else
> > > > > > > z = *p;
> > > > > > >
> > > > > > > instead of
> > > > > > >
> > > > > > > tem = *p;
> > > > > > > if (x)
> > > > > > > y = tem;
> > > > > > > else
> > > > > > > z = tem;
> > > > > > >
> > > > > > > that is, we get the runtime benefit but not the code-size one
> > > > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > > > in some cases). Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > > > Again, sth like "distance" isn't really available.
> > > > > > Hi Richard,
> > > > > > It doesn't work without copying AVAIL_OUT.
> > > > > > I tried for small example with attached patch:
> > > > > >
> > > > > > int foo(int cond, int x, int y)
> > > > > > {
> > > > > > int t;
> > > > > > void f(int);
> > > > > >
> > > > > > if (cond)
> > > > > > t = x + y;
> > > > > > else
> > > > > > t = x - y;
> > > > > >
> > > > > > f (t);
> > > > > > int t2 = (x + y) * 10;
> > > > > > return t2;
> > > > > > }
> > > > > >
> > > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > > > hoistings in calculix.
> > > > > >
> > > > > > IIUC, we want hoisting to be:
> > > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > > > for each block and then use this info during hoisting ?
> > > > > >
> > > > > > For computing "distance", I implemented a simple DFS walk from block
> > > > > > to post-dom, that gives up if depth crosses
> > > > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > > > that can get.
> > > > >
> > > > > As written it is quadratic in the CFG size.
> > > > >
> > > > > You can optimize away the
> > > > >
> > > > > + FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > > > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > > > >
> > > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > > > >
> > > > > As said, I don't think this is the way to go - trying to avoid code
> > > > > hoisting isn't
> > > > > what we'd want to do - your quoted assembly instead points to a loop
> > > > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > > > because it also harms vectorization.
> > > > But disabling PRE (which removes non empty latch), only results in
> > > > marginal performance improvement,
> > > > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > > > gain most of the performance.
> > > > AFAIU, that was happening, because after disabling hoisting of 'w',
> > > > there wasn't a cache miss (as seen with perf -e
> > > > L1-dcache-load-misses),
> > > > on the load instruction inside the inner loop.
> > >
> > > But that doesn't make much sense then. If code generation isn't
> > > an issue I don't see how the hoisted loads should cause a L1
> > > dcache load miss for data that is accessed in the respective loop
> > > as well (though not hoisted from it since at -O2 not sufficiently
> > > unrolled)
> > Hi Richard,
> > I am very sorry to respond late, I was away from work for some
> > personal commitments, and couldn't respond earlier.
> > Yes, I agree this doesn't seem to make much sense but I am
> > consistently seeing L1 dcache load misses, which goes away
> > after disabling hoisting of 'w'. I am not sure tho why this happens.
> > Also the load instruction is the only one that has most
> > significant performance difference across several runs. Or maybe I am
> > interpreting the results incorrectly.
> > Do you have suggestions for any benchmarking experiment I could try ?
>
> I think you want to edit the bad assembly manually and try a few things,
> like ordering the hoisted loads and then re-measure. Unfortunately
> modern CPU pipelines have no easy way to tell us why they're unhappy.
Hi Richard,
Thanks for the suggestions. I reordered the hoisted loads to be
in-order but that
didn't seem to improve performance.
Thanks,
Prathamesh
>
> > >
> > > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > > > node, since that's quadratic in terms of CFG size.
> > > > Would it make sense to break interaction between PRE and hoisting,
> > > > only for the case when inserting into preds of pdom ?
> > > > I tried doing that in attached patch, where insert runs in two phases:
> > > > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > > > (b) Second phase, which only runs PRE on all blocks.
> > > > This (expectedly) regresses ssa-hoist-3.c.
> > > >
> > > > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > > > within ANTIC sets
> > > > during compute_antic would be the right way to fix this ?
> > > > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > > > antic expr, the "distance" from the furthest block
> > > > it's computed in ? Could you please elaborate a bit on how we could go
> > > > about encoding distance during compute_antic ?
> > >
> > > But the distance in this case is just one CFG node ... we have
> > >
> > > if (mattyp.eq.1)
> > > ... use of w but not with constant indices
> > > else if (mattyp.eq.2)
> > > .. inlined orthonl with constant index w() accesses, single BB
> > > else
> > > ... use of w but not with constant indices - the actual relevant
> > > loop of calculix
> > > endif
> > > ... constant index w() accesses, single BB
> > >
> > > so the CFG distance is one node - unless you want to compute the
> > > maximum distance? Btw, I only see 9 loads hoisted.
> > >
> > > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
> > >
> > > And indeed this case is exactly one where hoisting is superior to
> > > PRE which would happily insert the 9 loads into the two variable-access
> > > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
> > >
> > > In .optimized I see
> > >
> > > pretmp_5573 = w[0];
> > > pretmp_5574 = w[3];
> > > pretmp_5575 = w[6];
> > > pretmp_5576 = w[1];
> > > pretmp_5577 = w[4];
> > > pretmp_5578 = w[7];
> > > pretmp_5579 = w[2];
> > > pretmp_5580 = w[5];
> > > pretmp_5581 = w[8];
> > > if (mattyp.157_742 == 1)
> > >
> > > I do remember talks/patches about ordering of such sequences of loads
> > > to make them prefetch-happier. Are the loads actually emitted in-order
> > > for arm? Thus w[0]...w[8] rather than as seen above with some random
> > > permutes inbetween? On x86 they are emitted in random order
> > > (they are also spilled immediately).
> > On aarch64, they are emitted in random order as well.
> >
> > Thanks,
> > Prathamesh
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > > > that do _not_ compute the value itself that are the issue. To capture
> > > > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > > > would also apply to PRE in general).
> > > > > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > > > > > > > > > expression in the block,
> > > > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > > > range and hoisting
> > > > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > > > casing for post-dom should be easy enough
> > > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > > > >
> > > > > > > > > > > Richard.
> > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Alexander
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2020-10-28 6:56 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 10:32 LTO slows down calculix by more than 10% on aarch64 Prathamesh Kulkarni
2020-08-26 11:20 ` Richard Biener
2020-08-28 11:16 ` Prathamesh Kulkarni
2020-08-28 11:57 ` Richard Biener
2020-08-31 11:21 ` Prathamesh Kulkarni
2020-08-31 11:40 ` Jan Hubicka
2020-08-28 12:03 ` Alexander Monakov
2020-08-31 11:23 ` Prathamesh Kulkarni
2020-09-04 9:52 ` Prathamesh Kulkarni
2020-09-04 11:38 ` Alexander Monakov
2020-09-21 9:49 ` Prathamesh Kulkarni
2020-09-21 12:44 ` Prathamesh Kulkarni
2020-09-22 5:08 ` Prathamesh Kulkarni
2020-09-22 7:25 ` Richard Biener
2020-09-22 9:37 ` Prathamesh Kulkarni
2020-09-22 11:06 ` Richard Biener
2020-09-22 16:24 ` Prathamesh Kulkarni
2020-09-23 7:52 ` Richard Biener
2020-09-23 10:10 ` Prathamesh Kulkarni
2020-09-23 11:10 ` Richard Biener
2020-09-24 10:36 ` Prathamesh Kulkarni
2020-09-24 11:14 ` Richard Biener
2020-10-21 10:03 ` Prathamesh Kulkarni
2020-10-21 10:39 ` Richard Biener
2020-10-28 6:55 ` Prathamesh Kulkarni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).