LTO slows down calculix by more than 10% on aarch64

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* LTO slows down calculix by more than 10% on aarch64
@ 2020-08-26 10:32 Prathamesh Kulkarni
  2020-08-26 11:20 ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-26 10:32 UTC (permalink / raw)
  To: GCC Development

[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]

Hi,
We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
on aarch64 in our validation CI. I tried to investigate this issue a
bit, and it seems the regression comes from inlining of orthonl into
e_c3d. Disabling that brings back the performance. However, inlining
orthonl into e_c3d, increases it's size from 3187 to 3837 by around
16.9% which isn't too large.

I have attached two test-cases, e_c3d.f that has orthonl manually
inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
which contains unmodified function.
(gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
sufficient.

It seems that inlining orthonl, causes 20 hoistings into block 181,
which are then hoisted to block 173, in particular hoistings of w(1,
1) ... w(3, 3), which wasn't
possible without inlining. The hoistings happen because of basic block
that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
following block in line 1035 in e_c3d.f:

senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight

Disabling hoisting into blocks 173 (and 181), brings back most of the
performance. I am not able to understand why (if?) these hoistings of
w(1, 1) ...
w(3, 3) are causing slowdown however. Looking at assembly, the hot
code-path from perf in e_c3d shows following code-gen diff:
For inlined version:
.L122:
        ldr     d15, [x1, -248]
        add     w0, w0, 1
        add     x2, x2, 24
        add     x1, x1, 72
        fmul    d15, d17, d15
        fmul    d15, d15, d18
        fmul    d14, d15, d14
        fmadd   d16, d14, d31, d16
        cmp     w0, 4
        beq     .L121
        ldr     d14, [x2, -8]
        b       .L122

and for non-inlined version:
.L118:
        ldr     d0, [x1, -248]
        add     w0, w0, 1
        ldr     d2, [x2, -8]
        add     x1, x1, 72
        add     x2, x2, 24
        fmul    d0, d3, d0
        fmul    d0, d0, d5
        fmul    d0, d0, d2
        fmadd   d1, d4, d0, d1
        cmp     w0, 4
        bne     .L118

which corresponds to the following loop in line 1014.
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight

I am not sure why would hoisting have any direct effect on this loop
except perhaps that hoisting allocated more reigsters, and led to
increased register pressure. Perhaps that's why it's using highered
number regs for code-gen in inlined version ? However disabling
hoisting in blocks 173 and 181, also leads to overall 6 extra spills
(by grepping for str to sp), so
hoisting is also helping here ? I am not sure how to proceed further,
and would be grateful for suggestions.

Thanks,
Prathamesh

[-- Attachment #2: e_c3d.f --]
[-- Type: application/octet-stream, Size: 47165 bytes --]

!
!     CalculiX - A 3-dimensional finite element program
!              Copyright (C) 1998 Guido Dhondt
!
!     This program is free software; you can redistribute it and/or
!     modify it under the terms of the GNU General Public License as
!     published by the Free Software Foundation(version 2);
!     
!
!     This program is distributed in the hope that it will be useful,
!     but WITHOUT ANY WARRANTY; without even the implied warranty of 
!     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
!     GNU General Public License for more details.
!
!     You should have received a copy of the GNU General Public License
!     along with this program; if not, write to the Free Software
!     Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
!
      subroutine e_c3d(co,nk,konl,lakonl,p1,p2,omx,bodyfx,ibod,s,sm,f,
     &  ff,nelem,nmethod,elcon,nelcon,rhcon,nrhcon,alcon,nalcon,alzero,
     &  ielmat,ielorien,norien,orab,ntmat_,
     &  t0,t1,ithermal,prestr,iprestr,vold,iperturb,nelemload,
     &  sideload,xload,nload,idist,sti,stx,eei,iexpl,plicon,
     &  nplicon,plkcon,nplkcon,xstiff,npmat_,dtime,
     &  matname,mint_,ncmat_,mass,stiffness,buckling,rhs,intscheme)
!
!     computation of the element matrix and rhs for the element with
!     the topology in konl
!
!     f: rhs with temperature and eigenstress contribution: for linear 
!        calculations only
!     ff: rhs without temperature and eigenstress contribution
!
!     nmethod=0: check for positive Jacobian
!     nmethod=1: stiffness matrix + right hand side
!     nmethod=2: stiffness matrix + mass matrix
!     nmethod=3: static stiffness + buckling stiffness
!     nmethod=4: right hand side (linear, iperturb=0)
!
      implicit none
!
      logical mass,stiffness,buckling,rhs
!
      character*5 sideload(*)
      character*8 lakonl
      character*20 matname(*),amat
!
      integer konl(20),ifaceq(8,6),nelemload(2,*),nk,ibod,nelem,nmethod,
     &  mattyp,ithermal,iprestr,iperturb,nload,idist,i,j,k,l,i1,i2,j1,
     &  k1,l1,ii,jj,ii1,jj1,id,ipointer,ig,m1,m2,m3,m4,kk,
     &  nelcon(2,*),nrhcon(*),nalcon(2,*),ielmat(*),ielorien(*),
     &  ntmat_,nope,nopes,norien,icmdl,ihyper,iexpl,kode,imat,mint2d,
     &  mint3d,mint_,ifacet(6,4),nopev,iorien,istiff,ncmat_,
     &  ifacew(8,5),intscheme,n,ipointeri,ipointerj,iii1,jjj1,n1
!
      integer nplicon(0:ntmat_,*),nplkcon(0:ntmat_,*),npmat_
!
      real*8 co(3,*),xl(3,20),shp(4,20),
     &  s(60,60),w(3,3),p1(3),p2(3),f(60),bodyf(3),bodyfx(3),ff(60),
     &  bf(3),q(3),shpj(4,20),elcon(0:ncmat_,ntmat_,*),
     &  rhcon(0:1,ntmat_,*),xkl(3,3),
     &  alcon(0:6,ntmat_,*),alzero(*),orab(7,*),t0(*),t1(*),
     &  anisox(3,3,3,3),beta(6),prestr(6,*),voldl(3,20),vo(3,3),
     &  xl2(3,8),xsj2(3),shp2(4,8),vold(0:3,*),xload(2,*),v(3,3,3,3),
     &  om,omx,e,un,al,um,xi,et,ze,tt,const,xsj,xsjj,sm(60,60),
     &  sti(6,mint_,*),stx(6,mint_,*),s11,s22,s33,s12,s13,s23,s11b,
     &  s22b,s33b,s12b,s13b,s23b,eei(6,mint_,*),t0l,t1l,stre(6),
     &  senergy,senergyb,rho,elas(21),alph(6),summass,summ,
     &  sume,factorm,factore,alp,elconloc(21),eth(6),exx,eyy,ezz,
     &  exy,exz,eyz,am1,weight,pgauss(3),dmass,xl1(3,8),term
!
      real*8 gauss2d1(2,1),gauss2d2(2,4),gauss2d3(2,9),gauss2d4(2,1),
     &  gauss2d5(2,3),gauss3d1(3,1),gauss3d2(3,8),gauss3d3(3,27),
     &  gauss3d4(3,1),gauss3d5(3,4),gauss3d6(3,15),gauss3d7(3,2),
     &  gauss3d8(3,9),gauss3d9(3,18),weight2d1(1),weight2d2(4),
     &  weight2d3(9),weight2d4(1),weight2d5(3),weight3d1(1),
     &  weight3d2(8),weight3d3(27),weight3d4(1),weight3d5(4),
     &  weight3d6(15),weight3d7(2),weight3d8(9),weight3d9(18)
!
      real*8 plicon(0:2*npmat_,ntmat_,*),plkcon(0:2*npmat_,ntmat_,*),
     &  xstiff(21,mint_,*),
     &  plconloc(82),dtime
!
      include "gauss.f"
!
      data ifaceq /4,3,2,1,11,10,9,12,
     &            5,6,7,8,13,14,15,16,
     &            1,2,6,5,9,18,13,17,
     &            2,3,7,6,10,19,14,18,
     &            3,4,8,7,11,20,15,19,
     &            4,1,5,8,12,17,16,20/
      data ifacet /1,3,2,7,6,5,
     &             1,2,4,5,9,8,
     &             2,3,4,6,10,9,
     &             1,4,3,8,10,7/
      data ifacew /1,3,2,9,8,7,0,0,
     &             4,5,6,10,11,12,0,0,
     &             1,2,5,4,7,14,10,13,
     &             2,3,6,5,8,15,11,14,
     &             4,6,3,1,12,15,9,13/
!
c      if(nmethod.eq.5) then
c         intscheme=1
c         nmethod=2
c      else
c         intscheme=0
c      endif
c!
c      mass=.false.
c      stiffness=.false.
c      buckling=.false.
c      rhs=.false.
c!
c      if(nmethod.eq.1) then
c         stiffness=.true.
c         rhs=.true.
c      elseif(nmethod.eq.2) then
c         mass=.true.
c         stiffness=.true.
c      elseif(nmethod.eq.3) then
c         stiffness=.true.
c         buckling=.true.
c      elseif(nmethod.eq.4) then
c         rhs=.true.
c      endif
!
      summass=0.d0
!
      imat=ielmat(nelem)
      amat=matname(imat)
      if(norien.gt.0) then
         iorien=ielorien(nelem)
      else
         iorien=0
      endif
!
      if(lakonl(4:4).eq.'2') then
         nope=20
         nopev=8
         nopes=8
      elseif(lakonl(4:4).eq.'8') then
         nope=8
         nopev=8
         nopes=4
      elseif(lakonl(4:5).eq.'10') then
         nope=10
         nopev=4
         nopes=6
      elseif(lakonl(4:4).eq.'4') then
         nope=4
         nopev=4
         nopes=3
      elseif(lakonl(4:5).eq.'15') then
         nope=15
         nopev=6
      else
         nope=6
         nopev=6
      endif
!
      if(intscheme.eq.0) then
         if(lakonl(4:5).eq.'8R') then
            mint2d=1
            mint3d=1
         elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) then
            mint2d=4
            mint3d=8
         elseif(lakonl(4:4).eq.'2') then
            mint2d=9
            mint3d=27
         elseif(lakonl(4:5).eq.'10') then
            mint2d=3
            mint3d=4
         elseif(lakonl(4:4).eq.'4') then
            mint2d=1
            mint3d=1
         elseif(lakonl(4:5).eq.'15') then
            mint3d=9
         else
            mint3d=2
         endif
      else
         if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
            mint3d=27
         elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
            mint3d=15
         else
            mint3d=18
         endif
      endif
!
!     computation of the coordinates of the local nodes
!
      do i=1,nope
        do j=1,3
          xl(j,i)=co(j,konl(i))
        enddo
      enddo
!
      if(nelcon(1,imat).lt.0) then
         ihyper=1
      else
         ihyper=0
      endif
!
!       initialisation for distributed forces
!
      if(rhs) then
        if(idist.ne.0) then
          do i=1,3*nope
            f(i)=0.d0
            ff(i)=0.d0
          enddo
        endif
      endif
!
!     displacements for 2nd order static and modal theory
!
      if((iperturb.ne.0).and.stiffness.and.(.not.buckling)) then
         do i1=1,nope
            do i2=1,3
               voldl(i2,i1)=vold(i2,konl(i1))
            enddo
         enddo
      endif
!
!     initialisation of sm
!
      if(mass.or.buckling) then
        do i=1,3*nope
          do j=1,3*nope
            sm(i,j)=0.d0
          enddo
        enddo
      endif
!
!     initialisation of s
!
      do i=1,3*nope
        do j=1,3*nope
          s(i,j)=0.d0
        enddo
      enddo
!
!     computation of the matrix: loop over the Gauss points
!
      do kk=1,mint3d
         if(intscheme.eq.0) then
            if(lakonl(4:5).eq.'8R') then
               xi=gauss3d1(1,kk)
               et=gauss3d1(2,kk)
               ze=gauss3d1(3,kk)
               weight=weight3d1(kk)
            elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) 
     &              then
               xi=gauss3d2(1,kk)
c               if(nope.eq.20) xi=xi+1.d0
               et=gauss3d2(2,kk)
               ze=gauss3d2(3,kk)
               weight=weight3d2(kk)
            elseif(lakonl(4:4).eq.'2') then
c               xi=gauss3d3(1,kk)+1.d0
               xi=gauss3d3(1,kk)
               et=gauss3d3(2,kk)
               ze=gauss3d3(3,kk)
               weight=weight3d3(kk)
            elseif(lakonl(4:5).eq.'10') then
               xi=gauss3d5(1,kk)
               et=gauss3d5(2,kk)
               ze=gauss3d5(3,kk)
               weight=weight3d5(kk)
            elseif(lakonl(4:4).eq.'4') then
               xi=gauss3d4(1,kk)
               et=gauss3d4(2,kk)
               ze=gauss3d4(3,kk)
               weight=weight3d4(kk)
            elseif(lakonl(4:5).eq.'15') then
               xi=gauss3d8(1,kk)
               et=gauss3d8(2,kk)
               ze=gauss3d8(3,kk)
               weight=weight3d8(kk)
            else
               xi=gauss3d7(1,kk)
               et=gauss3d7(2,kk)
               ze=gauss3d7(3,kk)
               weight=weight3d7(kk)
            endif
         else
            if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
c               xi=gauss3d3(1,kk)+1.d0
               xi=gauss3d3(1,kk)
               et=gauss3d3(2,kk)
               ze=gauss3d3(3,kk)
               weight=weight3d3(kk)
            elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
               xi=gauss3d6(1,kk)
               et=gauss3d6(2,kk)
               ze=gauss3d6(3,kk)
               weight=weight3d6(kk)
            else
               xi=gauss3d9(1,kk)
               et=gauss3d9(2,kk)
               ze=gauss3d9(3,kk)
               weight=weight3d9(kk)
            endif
         endif
!
!           calculation of the shape functions and their derivatives
!           in the gauss point
!
         if(nope.eq.20) then
            call shape20h(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.8) then
            call shape8h(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.10) then
            call shape10tet(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.4) then
            call shape4tet(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.15) then
            call shape15w(xi,et,ze,xl,xsj,shp)
         else
            call shape6w(xi,et,ze,xl,xsj,shp)
         endif
!
!           check the jacobian determinant
!
         if(xsj.lt.1.d-20) then
            write(*,*) '*WARNING in e_c3d: nonpositive jacobian'
            write(*,*) '         determinant in element',nelem
            write(*,*)
            xsj=dabs(xsj)
            nmethod=0
         endif
!
         if((iperturb.ne.0).and.stiffness.and.(.not.buckling))
     &        then
!
!              stresses for 2nd order static and modal theory
!
            s11=sti(1,kk,nelem)
            s22=sti(2,kk,nelem)
            s33=sti(3,kk,nelem)
            s12=sti(4,kk,nelem)
            s13=sti(5,kk,nelem)
            s23=sti(6,kk,nelem)
         endif
!
!           calculating the temperature in the integration
!           point
!
         t0l=0.d0
         t1l=0.d0
         if(ithermal.eq.1) then
            if(lakonl(4:5).eq.'8 ') then
               do i1=1,nope
                  t0l=t0l+t0(konl(i1))/8.d0
                  t1l=t1l+t1(konl(i1))/8.d0
               enddo
            elseif(lakonl(4:6).eq.'20 ') then
               call lintemp(t0,t1,konl,nope,kk,t0l,t1l)
            else
               do i1=1,nope
                  t0l=t0l+shp(4,i1)*t0(konl(i1))
                  t1l=t1l+shp(4,i1)*t1(konl(i1))
               enddo
            endif
         elseif(ithermal.ge.2) then
            if(lakonl(4:5).eq.'8 ') then
               do i1=1,nope
                  t0l=t0l+t0(konl(i1))/8.d0
                  t1l=t1l+vold(0,konl(i1))/8.d0
               enddo
            elseif(lakonl(4:6).eq.'20 ') then
               call lintemp_th(t0,vold,konl,nope,kk,t0l,t1l)
            else
               do i1=1,nope
                  t0l=t0l+shp(4,i1)*t0(konl(i1))
                  t1l=t1l+shp(4,i1)*vold(0,konl(i1))
               enddo
            endif
         endif
         tt=t1l-t0l
!
!           calculating the coordinates of the integration point
!           for material orientation purposes (for cylindrical
!           coordinate systems)
!
         if(iorien.gt.0) then
            do j=1,3
               pgauss(j)=0.d0
               do i1=1,nope
                  pgauss(j)=pgauss(j)+shp(4,i1)*co(j,konl(i1))
               enddo
            enddo
         endif
!
!           for deformation plasticity: calculating the Jacobian
!           and the inverse of the deformation gradient
!           needed to convert the stiffness matrix in the spatial
!           frame of reference to the material frame
!
         kode=nelcon(1,imat)
!
!           material data and local stiffness matrix
!
         istiff=1
         call materialdata(elcon,nelcon,rhcon,nrhcon,alcon,nalcon,
     &        imat,amat,iorien,pgauss,orab,ntmat_,elas,alph,rho,
     &        nelem,ithermal,alzero,mattyp,t0l,t1l,
     &        ihyper,istiff,elconloc,eth,kode,plicon,
     &        nplicon,plkcon,nplkcon,npmat_,
     &        plconloc,mint_,dtime,nelem,kk,
     &        xstiff,ncmat_)
!
         if(mattyp.eq.1) then
            e=elas(1)
            un=elas(2)
            um=e/(1.d0+un)
            al=un*um/(1.d0-2.d0*un)
            um=um/2.d0
         elseif(mattyp.eq.2) then
            call orthotropic(elas,anisox)
         else
            call anisotropic(elas,anisox)
         endif
!
!           initialisation for the body forces
!
         om=omx*rho
         if(rhs) then
            if(ibod.ne.0) then
               do ii=1,3
                  bodyf(ii)=bodyfx(ii)*rho
               enddo
            endif
         endif
!
         if(rhs) then
!
!             information for the rhs
!
!             residual stresses
!
            if((iprestr.eq.1).or.(ithermal.eq.1)) then
               if(iprestr.eq.0) then
                  do ii=1,6
                     beta(ii)=0.d0
                  enddo
               else
                  do ii=1,6
                     beta(ii)=-prestr(ii,nelem)
                  enddo
               endif
            endif
!
!             calculation of the thermal stresses in an undeformed body 
!             assumption; beta corresponds to initial stresses.
!
            if(ithermal.eq.1) then
               if(ihyper.eq.0) then
                  icmdl=2
                  call linel(ithermal,mattyp,beta,al,um,am1,alph,tt,
     &                 elas,icmdl,exx,eyy,ezz,exy,exz,eyz,stre,
     &                 anisox)
               endif
            endif
!
         elseif(buckling) then
!
!              buckling stresses
!
            s11b=stx(1,kk,nelem)
            s22b=stx(2,kk,nelem)
            s33b=stx(3,kk,nelem)
            s12b=stx(4,kk,nelem)
            s13b=stx(5,kk,nelem)
            s23b=stx(6,kk,nelem)
!
         endif
!
!           incorporating the jacobian determinant in the shape
!           functions
!
         xsjj=dsqrt(xsj)
         do i1=1,nope
            shpj(1,i1)=shp(1,i1)*xsjj
            shpj(2,i1)=shp(2,i1)*xsjj
            shpj(3,i1)=shp(3,i1)*xsjj
            shpj(4,i1)=shp(4,i1)*xsj
         enddo
!
!           determination of the stiffness, and/or mass and/or
!           buckling matrix
!
         if(stiffness.or.mass.or.buckling) then
!
            if((iperturb.eq.0).or.buckling)
     &           then
               jj1=1
               do jj=1,nope
!
                  ii1=1
                  do ii=1,jj
!
!                   all products of the shape functions for a given ii
!                   and jj
!
                     do i1=1,3
                        do j1=1,3
                           w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
                        enddo
                     enddo
!
!                   the following section calculates the static
!                   part of the stiffness matrix which, for buckling 
!                   calculations, is done in a preliminary static
!                   call
!
                     if(.not.buckling) then
!
                        if(mattyp.eq.1) then
!
                           s(ii1,jj1)=s(ii1,jj1)+(al*w(1,1)+
     &                          um*(2.d0*w(1,1)+w(2,2)+w(3,3)))*weight
                           s(ii1,jj1+1)=s(ii1,jj1+1)+(al*w(1,2)+
     &                          um*w(2,1))*weight
                           s(ii1,jj1+2)=s(ii1,jj1+2)+(al*w(1,3)+
     &                          um*w(3,1))*weight
                           s(ii1+1,jj1)=s(ii1+1,jj1)+(al*w(2,1)+
     &                          um*w(1,2))*weight
                           s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+(al*w(2,2)+
     &                          um*(2.d0*w(2,2)+w(1,1)+w(3,3)))*weight
                           s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+(al*w(2,3)+
     &                          um*w(3,2))*weight
                           s(ii1+2,jj1)=s(ii1+2,jj1)+(al*w(3,1)+
     &                          um*w(1,3))*weight
                           s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+(al*w(3,2)+
     &                          um*w(2,3))*weight
                           s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+(al*w(3,3)+
     &                          um*(2.d0*w(3,3)+w(2,2)+w(1,1)))*weight
!
                        elseif(mattyp.eq.2) then
!
                           s(ii1,jj1)=s(ii1,jj1)+(elas(1)*w(1,1)+
     &                          elas(7)*w(2,2)+elas(8)*w(3,3))*weight
                           s(ii1,jj1+1)=s(ii1,jj1+1)+(elas(2)*w(1,2)+
     &                          elas(7)*w(2,1))*weight
                           s(ii1,jj1+2)=s(ii1,jj1+2)+(elas(4)*w(1,3)+
     &                          elas(8)*w(3,1))*weight
                           s(ii1+1,jj1)=s(ii1+1,jj1)+(elas(7)*w(1,2)+
     &                          elas(2)*w(2,1))*weight
                           s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+
     &                          (elas(7)*w(1,1)+
     &                          elas(3)*w(2,2)+elas(9)*w(3,3))*weight
                           s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+
     &                          (elas(5)*w(2,3)+
     &                          elas(9)*w(3,2))*weight
                           s(ii1+2,jj1)=s(ii1+2,jj1)+
     &                          (elas(8)*w(1,3)+
     &                          elas(4)*w(3,1))*weight
                           s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+
     &                          (elas(9)*w(2,3)+
     &                          elas(5)*w(3,2))*weight
                           s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+
     &                          (elas(8)*w(1,1)+
     &                          elas(9)*w(2,2)+elas(6)*w(3,3))*weight
!
                        else
!
                           do i1=1,3
                              do j1=1,3
                                 do k1=1,3
                                    do l1=1,3
                                       s(ii1+i1-1,jj1+j1-1)=
     &                                      s(ii1+i1-1,jj1+j1-1)
     &                                      +anisox(i1,k1,j1,l1)
     &                                      *w(k1,l1)*weight
                                    enddo
                                 enddo
                              enddo
                           enddo
!
                        endif
!
!                     mass matrix
!
                        if(mass) then
                           sm(ii1,jj1)=sm(ii1,jj1)
     &                          +rho*shpj(4,ii)*shp(4,jj)*weight
                           sm(ii1+1,jj1+1)=sm(ii1,jj1)
                           sm(ii1+2,jj1+2)=sm(ii1,jj1)
                        endif
!
                     else
!
!                     buckling matrix  
!
                        senergyb=
     &                       (s11b*w(1,1)+s12b*(w(1,2)+w(2,1))
     &                       +s13b*(w(1,3)+w(3,1))+s22b*w(2,2)
     &                       +s23b*(w(2,3)+w(3,2))+s33b*w(3,3))*weight
                        sm(ii1,jj1)=sm(ii1,jj1)-senergyb
                        sm(ii1+1,jj1+1)=sm(ii1+1,jj1+1)-senergyb
                        sm(ii1+2,jj1+2)=sm(ii1+2,jj1+2)-senergyb
!
                     endif
!
                     ii1=ii1+3
                  enddo
                  jj1=jj1+3
               enddo
            else
!
!               stiffness matrix for static and modal
!               2nd order calculations
!
!               large displacement stiffness
!               
               do i1=1,3
                  do j1=1,3
                     vo(i1,j1)=0.d0
                     do k1=1,nope
                        vo(i1,j1)=vo(i1,j1)+shp(j1,k1)*voldl(i1,k1)
                     enddo
                  enddo
               enddo
!
               if(mattyp.eq.1) then
                  call wcoef(v,vo,al,um)
               endif
!
!               calculating the total mass of the element for
!               lumping purposes: only for explicit nonlinear
!               dynamic calculations
!
               if(mass.and.(iexpl.eq.1)) then
                  summass=summass+rho*xsj
               endif
!
               jj1=1
               do jj=1,nope
!
                  ii1=1
                  do ii=1,jj
!
!                   all products of the shape functions for a given ii
!                   and jj
!
                     do i1=1,3
                        do j1=1,3
                           w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
                        enddo
                     enddo
!
                     if(mattyp.eq.1) then
!
                        do m1=1,3
                           do m2=1,3
                              do m3=1,3
                                 do m4=1,3
                                    s(ii1+m2-1,jj1+m1-1)=
     &                                   s(ii1+m2-1,jj1+m1-1)
     &                                   +v(m4,m3,m2,m1)*w(m4,m3)*weight
                                 enddo
                              enddo
                           enddo
                        enddo
!                      
                     elseif(mattyp.eq.2) then
!
!                        call orthonl(w,vo,elas,s,ii1,jj1,weight)
      s(ii1,jj1)=s(ii1,jj1)+((elas( 1)+elas( 1)*vo(1,1)
     &+(elas( 1)+elas( 1)*vo(1,1)
     &)*vo(1,1)+(elas( 7)*vo(1,2))*vo(1,2)
     &+(elas( 8)*vo(1,3))
     &*vo(1,3))*w(1,1)
     &+(elas( 2)*vo(1,2)+(elas( 2)*vo(1,2))*vo(1,1)+(elas( 7)
     &+elas( 7)*vo(1,1))*vo(1,2)
     &)*w(1,2)
     &+(elas( 4)*vo(1,3)+(elas( 4)*vo(1,3))*vo(1,1)
     &+(elas( 8)+elas( 8)*vo(1,1))
     &*vo(1,3))*w(1,3)
     &+(elas( 7)*vo(1,2)+(elas( 7)*vo(1,2))*vo(1,1)+(elas( 2)
     &+elas( 2)*vo(1,1))*vo(1,2)
     &)*w(2,1)
     &+(elas( 7)+elas( 7)*vo(1,1)
     &+(elas( 7)+elas( 7)*vo(1,1)
     &)*vo(1,1)+(elas( 3)*vo(1,2))*vo(1,2)
     &+(elas( 9)*vo(1,3))
     &*vo(1,3))*w(2,2)
     &+((elas( 5)*vo(1,3))*vo(1,2)
     &+(elas( 9)*vo(1,2))
     &*vo(1,3))*w(2,3)
     &+(elas( 8)*vo(1,3)+(elas( 8)*vo(1,3))*vo(1,1)
     &+(elas( 4)+elas( 4)*vo(1,1))
     &*vo(1,3))*w(3,1)
     &+((elas( 9)*vo(1,3))*vo(1,2)
     &+(elas( 5)*vo(1,2))
     &*vo(1,3))*w(3,2)
     &+(elas( 8)+elas( 8)*vo(1,1)
     &+(elas( 8)+elas( 8)*vo(1,1)
     &)*vo(1,1)+(elas( 9)*vo(1,2))*vo(1,2)
     &+(elas( 6)*vo(1,3))
     &*vo(1,3))*w(3,3))*weight
      s(ii1,jj1+1)=s(ii1,jj1+1)+((elas( 1)*vo(2,1)
     &+(elas( 1)*vo(2,1)
     &)*vo(1,1)+(elas( 7)
     &+elas( 7)*vo(2,2))*vo(1,2)
     &+(elas( 8)*vo(2,3))
     &*vo(1,3))*w(1,1)
     &+(elas( 2)
     &+elas( 2)*vo(2,2)+(elas( 2)
     &+elas( 2)*vo(2,2))*vo(1,1)+(elas( 7)*vo(2,1))*vo(1,2)
     &)*w(1,2)
     &+(elas( 4)*vo(2,3)+(elas( 4)*vo(2,3))*vo(1,1)
     &+(elas( 8)*vo(2,1))
     &*vo(1,3))*w(1,3)
     &+(elas( 7)
     &+elas( 7)*vo(2,2)+(elas( 7)
     &+elas( 7)*vo(2,2))*vo(1,1)+(elas( 2)*vo(2,1))*vo(1,2)
     &)*w(2,1)
     &+(elas( 7)*vo(2,1)
     &+(elas( 7)*vo(2,1)
     &)*vo(1,1)+(elas( 3)
     &+elas( 3)*vo(2,2))*vo(1,2)
     &+(elas( 9)*vo(2,3))
     &*vo(1,3))*w(2,2)
     &+((elas( 5)*vo(2,3))*vo(1,2)
     &+(elas( 9)+elas( 9)*vo(2,2))
     &*vo(1,3))*w(2,3)
     &+(elas( 8)*vo(2,3)+(elas( 8)*vo(2,3))*vo(1,1)
     &+(elas( 4)*vo(2,1))
     &*vo(1,3))*w(3,1)
     &+((elas( 9)*vo(2,3))*vo(1,2)
     &+(elas( 5)+elas( 5)*vo(2,2))
     &*vo(1,3))*w(3,2)
     &+(elas( 8)*vo(2,1)
     &+(elas( 8)*vo(2,1)
     &)*vo(1,1)+(elas( 9)
     &+elas( 9)*vo(2,2))*vo(1,2)
     &+(elas( 6)*vo(2,3))
     &*vo(1,3))*w(3,3))*weight
      s(ii1,jj1+2)=s(ii1,jj1+2)+((elas( 1)*vo(3,1)
     &+(elas( 1)*vo(3,1)
     &)*vo(1,1)+(elas( 7)*vo(3,2))*vo(1,2)
     &+(elas( 8)+elas( 8)*vo(3,3))
     &*vo(1,3))*w(1,1)
     &+(elas( 2)*vo(3,2)
     &+(elas( 2)*vo(3,2))*vo(1,1)+(elas( 7)*vo(3,1))*vo(1,2)
     &)*w(1,2)
     &+(elas( 4)
     &+elas( 4)*vo(3,3)+(elas( 4)
     &+elas( 4)*vo(3,3))*vo(1,1)
     &+(elas( 8)*vo(3,1))
     &*vo(1,3))*w(1,3)
     &+(elas( 7)*vo(3,2)+(elas( 7)*vo(3,2))*vo(1,1)
     &+(elas( 2)*vo(3,1))*vo(1,2)
     &)*w(2,1)
     &+(elas( 7)*vo(3,1)
     &+(elas( 7)*vo(3,1)
     &)*vo(1,1)+(elas( 3)*vo(3,2))*vo(1,2)
     &+(elas( 9)+elas( 9)*vo(3,3))
     &*vo(1,3))*w(2,2)
     &+((elas( 5)
     &+elas( 5)*vo(3,3))*vo(1,2)
     &+(elas( 9)*vo(3,2))
     &*vo(1,3))*w(2,3)
     &+(elas( 8)
     &+elas( 8)*vo(3,3)+(elas( 8)
     &+elas( 8)*vo(3,3))*vo(1,1)
     &+(elas( 4)*vo(3,1))
     &*vo(1,3))*w(3,1)
     &+((elas( 9)
     &+elas( 9)*vo(3,3))*vo(1,2)
     &+(elas( 5)*vo(3,2))
     &*vo(1,3))*w(3,2)
     &+(elas( 8)*vo(3,1)
     &+(elas( 8)*vo(3,1)
     &)*vo(1,1)+(elas( 9)*vo(3,2))*vo(1,2)
     &+(elas( 6)+elas( 6)*vo(3,3))
     &*vo(1,3))*w(3,3))*weight
      s(ii1+1,jj1)=s(ii1+1,jj1)+((elas( 7)*vo(1,2)
     &+(elas( 1)+elas( 1)*vo(1,1)
     &)*vo(2,1)+(elas( 7)*vo(1,2))*vo(2,2)
     &+(elas( 8)*vo(1,3))
     &*vo(2,3))*w(1,1)
     &+(elas( 7)+elas( 7)*vo(1,1)
     &+(elas( 2)*vo(1,2))*vo(2,1)+(elas( 7)
     &+elas( 7)*vo(1,1))*vo(2,2)
     &)*w(1,2)
     &+((elas( 4)*vo(1,3))*vo(2,1)
     &+(elas( 8)+elas( 8)*vo(1,1))
     &*vo(2,3))*w(1,3)
     &+(elas( 2)+elas( 2)*vo(1,1)
     &+(elas( 7)*vo(1,2))*vo(2,1)+(elas( 2)
     &+elas( 2)*vo(1,1))*vo(2,2)
     &)*w(2,1)
     &+(elas( 3)*vo(1,2)+(elas( 7)+elas( 7)*vo(1,1)
     &)*vo(2,1)+(elas( 3)*vo(1,2))*vo(2,2)
     &+(elas( 9)*vo(1,3))
     &*vo(2,3))*w(2,2)
     &+(elas( 5)*vo(1,3)+(elas( 5)*vo(1,3))*vo(2,2)
     &+(elas( 9)*vo(1,2))
     &*vo(2,3))*w(2,3)
     &+((elas( 8)*vo(1,3))*vo(2,1)
     &+(elas( 4)+elas( 4)*vo(1,1))
     &*vo(2,3))*w(3,1)
     &+(elas( 9)*vo(1,3)+(elas( 9)*vo(1,3))*vo(2,2)
     &+(elas( 5)*vo(1,2))
     &*vo(2,3))*w(3,2)
     &+(elas( 9)*vo(1,2)+(elas( 8)+elas( 8)*vo(1,1)
     &)*vo(2,1)+(elas( 9)*vo(1,2))*vo(2,2)
     &+(elas( 6)*vo(1,3))
     &*vo(2,3))*w(3,3))*weight
      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+((elas( 7)
     &+elas( 7)*vo(2,2)+(elas( 1)*vo(2,1)
     &)*vo(2,1)+(elas( 7)
     &+elas( 7)*vo(2,2))*vo(2,2)
     &+(elas( 8)*vo(2,3))
     &*vo(2,3))*w(1,1)
     &+(elas( 7)*vo(2,1)
     &+(elas( 2)
     &+elas( 2)*vo(2,2))*vo(2,1)+(elas( 7)*vo(2,1))*vo(2,2)
     &)*w(1,2)
     &+((elas( 4)*vo(2,3))*vo(2,1)
     &+(elas( 8)*vo(2,1))
     &*vo(2,3))*w(1,3)
     &+(elas( 2)*vo(2,1)
     &+(elas( 7)
     &+elas( 7)*vo(2,2))*vo(2,1)+(elas( 2)*vo(2,1))*vo(2,2)
     &)*w(2,1)
     &+(elas( 3)
     &+elas( 3)*vo(2,2)+(elas( 7)*vo(2,1)
     &)*vo(2,1)+(elas( 3)
     &+elas( 3)*vo(2,2))*vo(2,2)
     &+(elas( 9)*vo(2,3))
     &*vo(2,3))*w(2,2)
     &+(elas( 5)*vo(2,3)+(elas( 5)*vo(2,3))*vo(2,2)
     &+(elas( 9)+elas( 9)*vo(2,2))
     &*vo(2,3))*w(2,3)
     &+((elas( 8)*vo(2,3))*vo(2,1)
     &+(elas( 4)*vo(2,1))
     &*vo(2,3))*w(3,1)
     &+(elas( 9)*vo(2,3)+(elas( 9)*vo(2,3))*vo(2,2)
     &+(elas( 5)+elas( 5)*vo(2,2))
     &*vo(2,3))*w(3,2)
     &+(elas( 9)
     &+elas( 9)*vo(2,2)+(elas( 8)*vo(2,1)
     &)*vo(2,1)+(elas( 9)
     &+elas( 9)*vo(2,2))*vo(2,2)
     &+(elas( 6)*vo(2,3))
     &*vo(2,3))*w(3,3))*weight
      s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+((elas( 7)*vo(3,2)+(elas( 1)*vo(3,1)
     &)*vo(2,1)+(elas( 7)*vo(3,2))*vo(2,2)
     &+(elas( 8)+elas( 8)*vo(3,3))
     &*vo(2,3))*w(1,1)
     &+(elas( 7)*vo(3,1)
     &+(elas( 2)*vo(3,2))*vo(2,1)+(elas( 7)*vo(3,1))*vo(2,2)
     &)*w(1,2)
     &+((elas( 4)
     &+elas( 4)*vo(3,3))*vo(2,1)
     &+(elas( 8)*vo(3,1))
     &*vo(2,3))*w(1,3)
     &+(elas( 2)*vo(3,1)
     &+(elas( 7)*vo(3,2))*vo(2,1)+(elas( 2)*vo(3,1))*vo(2,2)
     &)*w(2,1)
     &+(elas( 3)*vo(3,2)+(elas( 7)*vo(3,1)
     &)*vo(2,1)+(elas( 3)*vo(3,2))*vo(2,2)
     &+(elas( 9)+elas( 9)*vo(3,3))
     &*vo(2,3))*w(2,2)
     &+(elas( 5)
     &+elas( 5)*vo(3,3)+(elas( 5)
     &+elas( 5)*vo(3,3))*vo(2,2)
     &+(elas( 9)*vo(3,2))
     &*vo(2,3))*w(2,3)
     &+((elas( 8)
     &+elas( 8)*vo(3,3))*vo(2,1)
     &+(elas( 4)*vo(3,1))
     &*vo(2,3))*w(3,1)
     &+(elas( 9)
     &+elas( 9)*vo(3,3)+(elas( 9)
     &+elas( 9)*vo(3,3))*vo(2,2)
     &+(elas( 5)*vo(3,2))
     &*vo(2,3))*w(3,2)
     &+(elas( 9)*vo(3,2)+(elas( 8)*vo(3,1)
     &)*vo(2,1)+(elas( 9)*vo(3,2))*vo(2,2)
     &+(elas( 6)+elas( 6)*vo(3,3))
     &*vo(2,3))*w(3,3))*weight
      s(ii1+2,jj1)=s(ii1+2,jj1)+((elas( 8)*vo(1,3)
     &+(elas( 1)+elas( 1)*vo(1,1)
     &)*vo(3,1)+(elas( 7)*vo(1,2))*vo(3,2)
     &+(elas( 8)*vo(1,3))
     &*vo(3,3))*w(1,1)
     &+((elas( 2)*vo(1,2))*vo(3,1)+(elas( 7)
     &+elas( 7)*vo(1,1))*vo(3,2)
     &)*w(1,2)
     &+(elas( 8)+elas( 8)*vo(1,1)
     &+(elas( 4)*vo(1,3))*vo(3,1)
     &+(elas( 8)+elas( 8)*vo(1,1))
     &*vo(3,3))*w(1,3)
     &+((elas( 7)*vo(1,2))*vo(3,1)+(elas( 2)
     &+elas( 2)*vo(1,1))*vo(3,2)
     &)*w(2,1)
     &+(elas( 9)*vo(1,3)+(elas( 7)+elas( 7)*vo(1,1)
     &)*vo(3,1)+(elas( 3)*vo(1,2))*vo(3,2)
     &+(elas( 9)*vo(1,3))
     &*vo(3,3))*w(2,2)
     &+(elas( 9)*vo(1,2)+(elas( 5)*vo(1,3))*vo(3,2)
     &+(elas( 9)*vo(1,2))
     &*vo(3,3))*w(2,3)
     &+(elas( 4)+elas( 4)*vo(1,1)
     &+(elas( 8)*vo(1,3))*vo(3,1)
     &+(elas( 4)+elas( 4)*vo(1,1))
     &*vo(3,3))*w(3,1)
     &+(elas( 5)*vo(1,2)+(elas( 9)*vo(1,3))*vo(3,2)
     &+(elas( 5)*vo(1,2))
     &*vo(3,3))*w(3,2)
     &+(elas( 6)*vo(1,3)+(elas( 8)+elas( 8)*vo(1,1)
     &)*vo(3,1)+(elas( 9)*vo(1,2))*vo(3,2)
     &+(elas( 6)*vo(1,3))
     &*vo(3,3))*w(3,3))*weight
      s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+((elas( 8)*vo(2,3)
     &+(elas( 1)*vo(2,1)
     &)*vo(3,1)+(elas( 7)
     &+elas( 7)*vo(2,2))*vo(3,2)
     &+(elas( 8)*vo(2,3))
     &*vo(3,3))*w(1,1)
     &+((elas( 2)
     &+elas( 2)*vo(2,2))*vo(3,1)+(elas( 7)*vo(2,1))*vo(3,2)
     &)*w(1,2)
     &+(elas( 8)*vo(2,1)
     &+(elas( 4)*vo(2,3))*vo(3,1)
     &+(elas( 8)*vo(2,1))
     &*vo(3,3))*w(1,3)
     &+((elas( 7)
     &+elas( 7)*vo(2,2))*vo(3,1)+(elas( 2)*vo(2,1))*vo(3,2)
     &)*w(2,1)
     &+(elas( 9)*vo(2,3)+(elas( 7)*vo(2,1)
     &)*vo(3,1)+(elas( 3)
     &+elas( 3)*vo(2,2))*vo(3,2)
     &+(elas( 9)*vo(2,3))
     &*vo(3,3))*w(2,2)
     &+(elas( 9)
     &+elas( 9)*vo(2,2)+(elas( 5)*vo(2,3))*vo(3,2)
     &+(elas( 9)+elas( 9)*vo(2,2))
     &*vo(3,3))*w(2,3)
     &+(elas( 4)*vo(2,1)
     &+(elas( 8)*vo(2,3))*vo(3,1)
     &+(elas( 4)*vo(2,1))
     &*vo(3,3))*w(3,1)
     &+(elas( 5)
     &+elas( 5)*vo(2,2)+(elas( 9)*vo(2,3))*vo(3,2)
     &+(elas( 5)+elas( 5)*vo(2,2))
     &*vo(3,3))*w(3,2)
     &+(elas( 6)*vo(2,3)+(elas( 8)*vo(2,1)
     &)*vo(3,1)+(elas( 9)
     &+elas( 9)*vo(2,2))*vo(3,2)
     &+(elas( 6)*vo(2,3))
     &*vo(3,3))*w(3,3))*weight
      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+((elas( 8)
     &+elas( 8)*vo(3,3)+(elas( 1)*vo(3,1)
     &)*vo(3,1)+(elas( 7)*vo(3,2))*vo(3,2)
     &+(elas( 8)+elas( 8)*vo(3,3))
     &*vo(3,3))*w(1,1)
     &+((elas( 2)*vo(3,2))*vo(3,1)+(elas( 7)*vo(3,1))*vo(3,2)
     &)*w(1,2)
     &+(elas( 8)*vo(3,1)
     &+(elas( 4)
     &+elas( 4)*vo(3,3))*vo(3,1)
     &+(elas( 8)*vo(3,1))
     &*vo(3,3))*w(1,3)
     &+((elas( 7)*vo(3,2))*vo(3,1)+(elas( 2)*vo(3,1))*vo(3,2)
     &)*w(2,1)
     &+(elas( 9)
     &+elas( 9)*vo(3,3)+(elas( 7)*vo(3,1)
     &)*vo(3,1)+(elas( 3)*vo(3,2))*vo(3,2)
     &+(elas( 9)+elas( 9)*vo(3,3))
     &*vo(3,3))*w(2,2)
     &+(elas( 9)*vo(3,2)+(elas( 5)
     &+elas( 5)*vo(3,3))*vo(3,2)
     &+(elas( 9)*vo(3,2))
     &*vo(3,3))*w(2,3)
     &+(elas( 4)*vo(3,1)
     &+(elas( 8)
     &+elas( 8)*vo(3,3))*vo(3,1)
     &+(elas( 4)*vo(3,1))
     &*vo(3,3))*w(3,1)
     &+(elas( 5)*vo(3,2)+(elas( 9)
     &+elas( 9)*vo(3,3))*vo(3,2)
     &+(elas( 5)*vo(3,2))
     &*vo(3,3))*w(3,2)
     &+(elas( 6)
     &+elas( 6)*vo(3,3)+(elas( 8)*vo(3,1)
     &)*vo(3,1)+(elas( 9)*vo(3,2))*vo(3,2)
     &+(elas( 6)+elas( 6)*vo(3,3))
     &*vo(3,3))*w(3,3))*weight
!
                     else
!
                      do i1=1,3
                        iii1=ii1+i1-1
                        do j1=1,3
                          jjj1=jj1+j1-1
                          do k1=1,3
                            do l1=1,3
                              s(iii1,jjj1)=s(iii1,jjj1)
     &                         +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
                              do m1=1,3
                                s(iii1,jjj1)=s(iii1,jjj1)
     &                              +anisox(i1,k1,m1,l1)*w(k1,l1)
     &                                 *vo(j1,m1)*weight
     &                              +anisox(m1,k1,j1,l1)*w(k1,l1)
     &                                 *vo(i1,m1)*weight
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight
                                enddo
                              enddo
                            enddo
                          enddo
                        enddo
                      enddo
!SPEC: The immediately preceding loop nest is also available in 
!SPEC: program-generated (much longer) form from the author's 
!SPEC: website (see 454.calculix/Docs) in file anisonl.f
!SPEC:
!SPEC:                   call anisonl(w,vo,elas,s,ii1,jj1,weight)
!SPEC:
                     endif
!
!                   stress stiffness
!
                     senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
                     s(ii1,jj1)=s(ii1,jj1)+senergy
                     s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
                     s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
!
!                   mass matrix
!
                     if(mass) then
                        sm(ii1,jj1)=sm(ii1,jj1)
     &                       +rho*shpj(4,ii)*shp(4,jj)*weight
                        sm(ii1+1,jj1+1)=sm(ii1,jj1)
                        sm(ii1+2,jj1+2)=sm(ii1,jj1)
                     endif
!
!                    stiffness contribution of centrifugal forces
!
                     if(mass.and.(om.gt.0.d0)) then
                        dmass=shpj(4,ii)*shp(4,jj)*weight*om
                        do m1=1,3
                           s(ii1+m1-1,jj1+m1-1)=s(ii1+m1-1,jj1+m1-1)-
     &                          dmass
                           do m2=1,3
                              s(ii1+m1-1,jj1+m2-1)=s(ii1+m1-1,jj1+m2-1)+
     &                             dmass*p2(m1)*p2(m2)
                           enddo
                        enddo
                     endif
!
                     ii1=ii1+3
                  enddo
                  jj1=jj1+3
               enddo
            endif
!
         endif
!
!           computation of the right hand side
!
         if(rhs) then
!
!             body forces
!
            if(ibod.ne.0) then
               if(om.gt.0.d0) then
                  do i1=1,3
!
!                   computation of the global coordinates of the gauss
!                   point
!
                     q(i1)=0.d0
                     if(iperturb.eq.0) then
                        do j1=1,nope
                           q(i1)=q(i1)+shp(4,j1)*xl(i1,j1)
                        enddo
                     else
                        do j1=1,nope
                           q(i1)=q(i1)+shp(4,j1)*
     &                          (xl(i1,j1)+voldl(i1,j1))
                        enddo
                     endif
!                       
                     q(i1)=q(i1)-p1(i1)
                  enddo
                  const=q(1)*p2(1)+q(2)*p2(2)+q(3)*p2(3)
!
!                 inclusion of the centrifugal force into the body force
!
                  do i1=1,3
                     bf(i1)=bodyf(i1)+(q(i1)-const*p2(i1))*om
                  enddo
               else
                  do i1=1,3
                     bf(i1)=bodyf(i1)
                  enddo
               endif
               jj1=1
               do jj=1,nope
                  f(jj1)=f(jj1)+bf(1)*shpj(4,jj)*weight
                  f(jj1+1)=f(jj1+1)+bf(2)*shpj(4,jj)*weight
                  f(jj1+2)=f(jj1+2)+bf(3)*shpj(4,jj)*weight
                  ff(jj1)=ff(jj1)+bf(1)*shpj(4,jj)*weight
                  ff(jj1+1)=ff(jj1+1)+bf(2)*shpj(4,jj)*weight
                  ff(jj1+2)=ff(jj1+2)+bf(3)*shpj(4,jj)*weight
                  jj1=jj1+3
               enddo
            endif
!
!             thermal stresses and/or residual stresses
!
            if((ithermal.ne.0).or.(iprestr.ne.0)) then
               do jj=1,6
                  beta(jj)=beta(jj)*xsj
               enddo
               jj1=1
               do jj=1,nope
                  f(jj1)=f(jj1)+(shp(1,jj)*beta(1)+
     &                 (shp(2,jj)*beta(4)+shp(3,jj)*beta(5))/2.d0)
     &                 *weight
                  f(jj1+1)=f(jj1+1)+(shp(2,jj)*beta(2)+
     &                 (shp(1,jj)*beta(4)+shp(3,jj)*beta(6))/2.d0)
     &                 *weight
                  f(jj1+2)=f(jj1+2)+(shp(3,jj)*beta(3)+
     &                 (shp(1,jj)*beta(5)+shp(2,jj)*beta(6))/2.d0)
     &                 *weight
                  jj1=jj1+3
               enddo
            endif
!
         endif
!
      enddo
!
c      write(*,'(6(1x,e11.4))') ((s(i1,j1),i1=1,j1),j1=1,60)
c      write(*,*)
c
      if(.not.buckling) then
!
!       distributed loads
!
         if(nload.eq.0) then
            return
         endif
         call nident2(nelemload,nelem,nload,id)
         do
            if((id.eq.0).or.(nelemload(1,id).ne.nelem)) exit
            read(sideload(id)(2:2),'(i1)') ig
!
!         treatment of wedge faces
!
            if(lakonl(4:4).eq.'6') then
               mint2d=1
               if(ig.le.2) then
                  nopes=3
               else
                  nopes=4
               endif
            endif
          if(lakonl(4:5).eq.'15') then
             if(ig.le.2) then
                mint2d=3
                nopes=6
             else
                mint2d=4
                nopes=8
             endif
          endif
!
          if((nope.eq.20).or.(nope.eq.8)) then
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifaceq(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifaceq(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifaceq(i,ig)))+
     &                     vold(j,konl(ifaceq(i,ig)))
                   enddo
                enddo
             endif
          elseif((nope.eq.10).or.(nope.eq.4)) then
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacet(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifacet(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacet(i,ig)))+
     &                     vold(j,konl(ifacet(i,ig)))
                   enddo
                enddo
             endif
          else
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacew(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifacew(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacew(i,ig)))+
     &                     vold(j,konl(ifacew(i,ig)))
                   enddo
                enddo
             endif
          endif
!
          do i=1,mint2d
             if((lakonl(4:5).eq.'8R').or.
     &            ((lakonl(4:4).eq.'6').and.(nopes.eq.4))) then
                xi=gauss2d1(1,i)
                et=gauss2d1(2,i)
                weight=weight2d1(i)
             elseif((lakonl(4:4).eq.'8').or.
     &               (lakonl(4:6).eq.'20R').or.
     &               ((lakonl(4:5).eq.'15').and.(nopes.eq.8))) then
                xi=gauss2d2(1,i)
                et=gauss2d2(2,i)
                weight=weight2d2(i)
             elseif(lakonl(4:4).eq.'2') then
                xi=gauss2d3(1,i)
                et=gauss2d3(2,i)
                weight=weight2d3(i)
             elseif((lakonl(4:5).eq.'10').or.
     &               ((lakonl(4:5).eq.'15').and.(nopes.eq.6))) then
                xi=gauss2d5(1,i)
                et=gauss2d5(2,i)
                weight=weight2d5(i)
             elseif((lakonl(4:4).eq.'4').or.
     &               ((lakonl(4:4).eq.'6').and.(nopes.eq.3))) then
                xi=gauss2d4(1,i)
                et=gauss2d4(2,i)
                weight=weight2d4(i)
             endif
!
             if(rhs) then
                if(nopes.eq.8) then
                   call shape8q(xi,et,xl2,xsj2,shp2)
                elseif(nopes.eq.4) then
                   call shape4q(xi,et,xl2,xsj2,shp2)
                elseif(nopes.eq.6) then
                   call shape6tri(xi,et,xl2,xsj2,shp2)
                else
                   call shape3tri(xi,et,xl2,xsj2,shp2)
                endif
!
                do k=1,nopes
                   if((nope.eq.20).or.(nope.eq.8)) then
                      ipointer=(ifaceq(k,ig)-1)*3
                   elseif((nope.eq.10).or.(nope.eq.4)) then
                      ipointer=(ifacet(k,ig)-1)*3
                   else
                      ipointer=(ifacew(k,ig)-1)*3
                   endif
                   f(ipointer+1)=f(ipointer+1)-shp2(4,k)*xload(1,id)
     &                  *xsj2(1)*weight
                   f(ipointer+2)=f(ipointer+2)-shp2(4,k)*xload(1,id)
     &                  *xsj2(2)*weight
                   f(ipointer+3)=f(ipointer+3)-shp2(4,k)*xload(1,id)
     &                  *xsj2(3)*weight
                   ff(ipointer+1)=ff(ipointer+1)-shp2(4,k)*xload(1,id)
     &                  *xsj2(1)*weight
                   ff(ipointer+2)=ff(ipointer+2)-shp2(4,k)*xload(1,id)
     &                  *xsj2(2)*weight
                   ff(ipointer+3)=ff(ipointer+3)-shp2(4,k)*xload(1,id)
     &                  *xsj2(3)*weight
                enddo
!
!            stiffness contribution of the distributed load 
!
             elseif(mass) then
                if(nopes.eq.8) then
                   call shape8q(xi,et,xl1,xsj2,shp2)
                elseif(nopes.eq.4) then
                   call shape4q(xi,et,xl1,xsj2,shp2)
                elseif(nopes.eq.6) then
                   call shape6tri(xi,et,xl1,xsj2,shp2)
                else
                   call shape3tri(xi,et,xl1,xsj2,shp2)
                endif
!
!               calculation of the deformation gradient
!
                do k=1,3
                   do l=1,3
                      xkl(k,l)=0.d0
                      do ii=1,nopes
                         xkl(k,l)=xkl(k,l)+shp2(l,ii)*xl2(k,ii)
                      enddo
                   enddo
                enddo
!
                do ii=1,nopes
                   if((nope.eq.20).or.(nope.eq.8)) then
                      ipointeri=(ifaceq(ii,ig)-1)*3
                   elseif((nope.eq.10).or.(nope.eq.4)) then
                     ipointeri=(ifacet(ii,ig)-1)*3
                   else
                      ipointeri=(ifacew(ii,ig)-1)*3
                   endif
                   do jj=1,nopes
                      if((nope.eq.20).or.(nope.eq.8)) then
                         ipointerj=(ifaceq(jj,ig)-1)*3
                      elseif((nope.eq.10).or.(nope.eq.4)) then
                         ipointerj=(ifacet(jj,ig)-1)*3
                      else
                         ipointerj=(ifacew(jj,ig)-1)*3
                      endif
                      do k=1,3
                         do l=1,3
                            if(k.eq.l) cycle
                            if(k*l.eq.2) then
                               n=3
                            elseif(k*l.eq.3) then
                               n=2
                            else
                               n=1
                            endif
                            term=weight*xload(1,id)*shp2(4,jj)*
     &                       (xsj2(1)*
     &                        (xkl(n,2)*shp2(3,ii)-xkl(n,3)*shp2(2,ii))+
     &                        xsj2(2)*
     &                        (xkl(n,3)*shp2(1,ii)-xkl(n,1)*shp2(3,ii))+
     &                        xsj2(3)*
     &                        (xkl(n,1)*shp2(2,ii)-xkl(n,2)*shp2(1,ii)))
                            if(ipointeri+k.le.ipointerj+l) then
                               s(ipointeri+k,ipointerj+l)=
     &                              s(ipointeri+k,ipointerj+l)+term/2.d0
                            else
                               s(ipointerj+l,ipointeri+k)=
     &                              s(ipointerj+l,ipointeri+k)+term/2.d0
                            endif
                         enddo
                      enddo
                   enddo
                enddo
!
             endif
          enddo
!
          id=id-1
       enddo
!
      elseif(mass.and.(iexpl.eq.1)) then
!
!        scaling the diagonal terms of the mass matrix such that the total mass
!        is right (LUMPING; for explicit dynamic calculations)
!
         sume=0.d0
         summ=0.d0
         do i=1,3*nopev,3
            sume=sume+sm(i,i)
         enddo
         do i=3*nopev+1,3*nope,3
            summ=summ+sm(i,i)
         enddo
!
         if(nope.eq.20) then
c            alp=.2215d0
            alp=.2917d0
!              maybe alp=.2917d0 is better??
         elseif(nope.eq.10) then
            alp=0.1203d0
         elseif(nope.eq.15) then
            alp=0.2141d0
         endif
!
         if((nope.eq.20).or.(nope.eq.10).or.(nope.eq.15)) then
            factore=summass*alp/(1.d0+alp)/sume
            factorm=summass/(1.d0+alp)/summ
         else
            factore=summass/sume
         endif
!
         do i=1,3*nopev,3
            sm(i,i)=sm(i,i)*factore
            sm(i+1,i+1)=sm(i,i)
            sm(i+2,i+2)=sm(i,i)
         enddo
         do i=3*nopev+1,3*nope,3
            sm(i,i)=sm(i,i)*factorm
            sm(i+1,i+1)=sm(i,i)
            sm(i+2,i+2)=sm(i,i)
         enddo
!
      endif
!
      return
      end


[-- Attachment #3: e_c3d-orig.f --]
[-- Type: application/octet-stream, Size: 36912 bytes --]

!
!     CalculiX - A 3-dimensional finite element program
!              Copyright (C) 1998 Guido Dhondt
!
!     This program is free software; you can redistribute it and/or
!     modify it under the terms of the GNU General Public License as
!     published by the Free Software Foundation(version 2);
!     
!
!     This program is distributed in the hope that it will be useful,
!     but WITHOUT ANY WARRANTY; without even the implied warranty of 
!     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
!     GNU General Public License for more details.
!
!     You should have received a copy of the GNU General Public License
!     along with this program; if not, write to the Free Software
!     Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
!
      subroutine e_c3d(co,nk,konl,lakonl,p1,p2,omx,bodyfx,ibod,s,sm,f,
     &  ff,nelem,nmethod,elcon,nelcon,rhcon,nrhcon,alcon,nalcon,alzero,
     &  ielmat,ielorien,norien,orab,ntmat_,
     &  t0,t1,ithermal,prestr,iprestr,vold,iperturb,nelemload,
     &  sideload,xload,nload,idist,sti,stx,eei,iexpl,plicon,
     &  nplicon,plkcon,nplkcon,xstiff,npmat_,dtime,
     &  matname,mint_,ncmat_,mass,stiffness,buckling,rhs,intscheme)
!
!     computation of the element matrix and rhs for the element with
!     the topology in konl
!
!     f: rhs with temperature and eigenstress contribution: for linear 
!        calculations only
!     ff: rhs without temperature and eigenstress contribution
!
!     nmethod=0: check for positive Jacobian
!     nmethod=1: stiffness matrix + right hand side
!     nmethod=2: stiffness matrix + mass matrix
!     nmethod=3: static stiffness + buckling stiffness
!     nmethod=4: right hand side (linear, iperturb=0)
!
      implicit none
!
      logical mass,stiffness,buckling,rhs
!
      character*5 sideload(*)
      character*8 lakonl
      character*20 matname(*),amat
!
      integer konl(20),ifaceq(8,6),nelemload(2,*),nk,ibod,nelem,nmethod,
     &  mattyp,ithermal,iprestr,iperturb,nload,idist,i,j,k,l,i1,i2,j1,
     &  k1,l1,ii,jj,ii1,jj1,id,ipointer,ig,m1,m2,m3,m4,kk,
     &  nelcon(2,*),nrhcon(*),nalcon(2,*),ielmat(*),ielorien(*),
     &  ntmat_,nope,nopes,norien,icmdl,ihyper,iexpl,kode,imat,mint2d,
     &  mint3d,mint_,ifacet(6,4),nopev,iorien,istiff,ncmat_,
     &  ifacew(8,5),intscheme,n,ipointeri,ipointerj,iii1,jjj1,n1
!
      integer nplicon(0:ntmat_,*),nplkcon(0:ntmat_,*),npmat_
!
      real*8 co(3,*),xl(3,20),shp(4,20),
     &  s(60,60),w(3,3),p1(3),p2(3),f(60),bodyf(3),bodyfx(3),ff(60),
     &  bf(3),q(3),shpj(4,20),elcon(0:ncmat_,ntmat_,*),
     &  rhcon(0:1,ntmat_,*),xkl(3,3),
     &  alcon(0:6,ntmat_,*),alzero(*),orab(7,*),t0(*),t1(*),
     &  anisox(3,3,3,3),beta(6),prestr(6,*),voldl(3,20),vo(3,3),
     &  xl2(3,8),xsj2(3),shp2(4,8),vold(0:3,*),xload(2,*),v(3,3,3,3),
     &  om,omx,e,un,al,um,xi,et,ze,tt,const,xsj,xsjj,sm(60,60),
     &  sti(6,mint_,*),stx(6,mint_,*),s11,s22,s33,s12,s13,s23,s11b,
     &  s22b,s33b,s12b,s13b,s23b,eei(6,mint_,*),t0l,t1l,stre(6),
     &  senergy,senergyb,rho,elas(21),alph(6),summass,summ,
     &  sume,factorm,factore,alp,elconloc(21),eth(6),exx,eyy,ezz,
     &  exy,exz,eyz,am1,weight,pgauss(3),dmass,xl1(3,8),term
!
      real*8 gauss2d1(2,1),gauss2d2(2,4),gauss2d3(2,9),gauss2d4(2,1),
     &  gauss2d5(2,3),gauss3d1(3,1),gauss3d2(3,8),gauss3d3(3,27),
     &  gauss3d4(3,1),gauss3d5(3,4),gauss3d6(3,15),gauss3d7(3,2),
     &  gauss3d8(3,9),gauss3d9(3,18),weight2d1(1),weight2d2(4),
     &  weight2d3(9),weight2d4(1),weight2d5(3),weight3d1(1),
     &  weight3d2(8),weight3d3(27),weight3d4(1),weight3d5(4),
     &  weight3d6(15),weight3d7(2),weight3d8(9),weight3d9(18)
!
      real*8 plicon(0:2*npmat_,ntmat_,*),plkcon(0:2*npmat_,ntmat_,*),
     &  xstiff(21,mint_,*),
     &  plconloc(82),dtime
!
      include "gauss.f"
!
      data ifaceq /4,3,2,1,11,10,9,12,
     &            5,6,7,8,13,14,15,16,
     &            1,2,6,5,9,18,13,17,
     &            2,3,7,6,10,19,14,18,
     &            3,4,8,7,11,20,15,19,
     &            4,1,5,8,12,17,16,20/
      data ifacet /1,3,2,7,6,5,
     &             1,2,4,5,9,8,
     &             2,3,4,6,10,9,
     &             1,4,3,8,10,7/
      data ifacew /1,3,2,9,8,7,0,0,
     &             4,5,6,10,11,12,0,0,
     &             1,2,5,4,7,14,10,13,
     &             2,3,6,5,8,15,11,14,
     &             4,6,3,1,12,15,9,13/
!
c      if(nmethod.eq.5) then
c         intscheme=1
c         nmethod=2
c      else
c         intscheme=0
c      endif
c!
c      mass=.false.
c      stiffness=.false.
c      buckling=.false.
c      rhs=.false.
c!
c      if(nmethod.eq.1) then
c         stiffness=.true.
c         rhs=.true.
c      elseif(nmethod.eq.2) then
c         mass=.true.
c         stiffness=.true.
c      elseif(nmethod.eq.3) then
c         stiffness=.true.
c         buckling=.true.
c      elseif(nmethod.eq.4) then
c         rhs=.true.
c      endif
!
      summass=0.d0
!
      imat=ielmat(nelem)
      amat=matname(imat)
      if(norien.gt.0) then
         iorien=ielorien(nelem)
      else
         iorien=0
      endif
!
      if(lakonl(4:4).eq.'2') then
         nope=20
         nopev=8
         nopes=8
      elseif(lakonl(4:4).eq.'8') then
         nope=8
         nopev=8
         nopes=4
      elseif(lakonl(4:5).eq.'10') then
         nope=10
         nopev=4
         nopes=6
      elseif(lakonl(4:4).eq.'4') then
         nope=4
         nopev=4
         nopes=3
      elseif(lakonl(4:5).eq.'15') then
         nope=15
         nopev=6
      else
         nope=6
         nopev=6
      endif
!
      if(intscheme.eq.0) then
         if(lakonl(4:5).eq.'8R') then
            mint2d=1
            mint3d=1
         elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) then
            mint2d=4
            mint3d=8
         elseif(lakonl(4:4).eq.'2') then
            mint2d=9
            mint3d=27
         elseif(lakonl(4:5).eq.'10') then
            mint2d=3
            mint3d=4
         elseif(lakonl(4:4).eq.'4') then
            mint2d=1
            mint3d=1
         elseif(lakonl(4:5).eq.'15') then
            mint3d=9
         else
            mint3d=2
         endif
      else
         if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
            mint3d=27
         elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
            mint3d=15
         else
            mint3d=18
         endif
      endif
!
!     computation of the coordinates of the local nodes
!
      do i=1,nope
        do j=1,3
          xl(j,i)=co(j,konl(i))
        enddo
      enddo
!
      if(nelcon(1,imat).lt.0) then
         ihyper=1
      else
         ihyper=0
      endif
!
!       initialisation for distributed forces
!
      if(rhs) then
        if(idist.ne.0) then
          do i=1,3*nope
            f(i)=0.d0
            ff(i)=0.d0
          enddo
        endif
      endif
!
!     displacements for 2nd order static and modal theory
!
      if((iperturb.ne.0).and.stiffness.and.(.not.buckling)) then
         do i1=1,nope
            do i2=1,3
               voldl(i2,i1)=vold(i2,konl(i1))
            enddo
         enddo
      endif
!
!     initialisation of sm
!
      if(mass.or.buckling) then
        do i=1,3*nope
          do j=1,3*nope
            sm(i,j)=0.d0
          enddo
        enddo
      endif
!
!     initialisation of s
!
      do i=1,3*nope
        do j=1,3*nope
          s(i,j)=0.d0
        enddo
      enddo
!
!     computation of the matrix: loop over the Gauss points
!
      do kk=1,mint3d
         if(intscheme.eq.0) then
            if(lakonl(4:5).eq.'8R') then
               xi=gauss3d1(1,kk)
               et=gauss3d1(2,kk)
               ze=gauss3d1(3,kk)
               weight=weight3d1(kk)
            elseif((lakonl(4:4).eq.'8').or.(lakonl(4:6).eq.'20R')) 
     &              then
               xi=gauss3d2(1,kk)
c               if(nope.eq.20) xi=xi+1.d0
               et=gauss3d2(2,kk)
               ze=gauss3d2(3,kk)
               weight=weight3d2(kk)
            elseif(lakonl(4:4).eq.'2') then
c               xi=gauss3d3(1,kk)+1.d0
               xi=gauss3d3(1,kk)
               et=gauss3d3(2,kk)
               ze=gauss3d3(3,kk)
               weight=weight3d3(kk)
            elseif(lakonl(4:5).eq.'10') then
               xi=gauss3d5(1,kk)
               et=gauss3d5(2,kk)
               ze=gauss3d5(3,kk)
               weight=weight3d5(kk)
            elseif(lakonl(4:4).eq.'4') then
               xi=gauss3d4(1,kk)
               et=gauss3d4(2,kk)
               ze=gauss3d4(3,kk)
               weight=weight3d4(kk)
            elseif(lakonl(4:5).eq.'15') then
               xi=gauss3d8(1,kk)
               et=gauss3d8(2,kk)
               ze=gauss3d8(3,kk)
               weight=weight3d8(kk)
            else
               xi=gauss3d7(1,kk)
               et=gauss3d7(2,kk)
               ze=gauss3d7(3,kk)
               weight=weight3d7(kk)
            endif
         else
            if((lakonl(4:4).eq.'8').or.(lakonl(4:4).eq.'2')) then
c               xi=gauss3d3(1,kk)+1.d0
               xi=gauss3d3(1,kk)
               et=gauss3d3(2,kk)
               ze=gauss3d3(3,kk)
               weight=weight3d3(kk)
            elseif((lakonl(4:5).eq.'10').or.(lakonl(4:4).eq.'4')) then
               xi=gauss3d6(1,kk)
               et=gauss3d6(2,kk)
               ze=gauss3d6(3,kk)
               weight=weight3d6(kk)
            else
               xi=gauss3d9(1,kk)
               et=gauss3d9(2,kk)
               ze=gauss3d9(3,kk)
               weight=weight3d9(kk)
            endif
         endif
!
!           calculation of the shape functions and their derivatives
!           in the gauss point
!
         if(nope.eq.20) then
            call shape20h(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.8) then
            call shape8h(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.10) then
            call shape10tet(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.4) then
            call shape4tet(xi,et,ze,xl,xsj,shp)
         elseif(nope.eq.15) then
            call shape15w(xi,et,ze,xl,xsj,shp)
         else
            call shape6w(xi,et,ze,xl,xsj,shp)
         endif
!
!           check the jacobian determinant
!
         if(xsj.lt.1.d-20) then
            write(*,*) '*WARNING in e_c3d: nonpositive jacobian'
            write(*,*) '         determinant in element',nelem
            write(*,*)
            xsj=dabs(xsj)
            nmethod=0
         endif
!
         if((iperturb.ne.0).and.stiffness.and.(.not.buckling))
     &        then
!
!              stresses for 2nd order static and modal theory
!
            s11=sti(1,kk,nelem)
            s22=sti(2,kk,nelem)
            s33=sti(3,kk,nelem)
            s12=sti(4,kk,nelem)
            s13=sti(5,kk,nelem)
            s23=sti(6,kk,nelem)
         endif
!
!           calculating the temperature in the integration
!           point
!
         t0l=0.d0
         t1l=0.d0
         if(ithermal.eq.1) then
            if(lakonl(4:5).eq.'8 ') then
               do i1=1,nope
                  t0l=t0l+t0(konl(i1))/8.d0
                  t1l=t1l+t1(konl(i1))/8.d0
               enddo
            elseif(lakonl(4:6).eq.'20 ') then
               call lintemp(t0,t1,konl,nope,kk,t0l,t1l)
            else
               do i1=1,nope
                  t0l=t0l+shp(4,i1)*t0(konl(i1))
                  t1l=t1l+shp(4,i1)*t1(konl(i1))
               enddo
            endif
         elseif(ithermal.ge.2) then
            if(lakonl(4:5).eq.'8 ') then
               do i1=1,nope
                  t0l=t0l+t0(konl(i1))/8.d0
                  t1l=t1l+vold(0,konl(i1))/8.d0
               enddo
            elseif(lakonl(4:6).eq.'20 ') then
               call lintemp_th(t0,vold,konl,nope,kk,t0l,t1l)
            else
               do i1=1,nope
                  t0l=t0l+shp(4,i1)*t0(konl(i1))
                  t1l=t1l+shp(4,i1)*vold(0,konl(i1))
               enddo
            endif
         endif
         tt=t1l-t0l
!
!           calculating the coordinates of the integration point
!           for material orientation purposes (for cylindrical
!           coordinate systems)
!
         if(iorien.gt.0) then
            do j=1,3
               pgauss(j)=0.d0
               do i1=1,nope
                  pgauss(j)=pgauss(j)+shp(4,i1)*co(j,konl(i1))
               enddo
            enddo
         endif
!
!           for deformation plasticity: calculating the Jacobian
!           and the inverse of the deformation gradient
!           needed to convert the stiffness matrix in the spatial
!           frame of reference to the material frame
!
         kode=nelcon(1,imat)
!
!           material data and local stiffness matrix
!
         istiff=1
         call materialdata(elcon,nelcon,rhcon,nrhcon,alcon,nalcon,
     &        imat,amat,iorien,pgauss,orab,ntmat_,elas,alph,rho,
     &        nelem,ithermal,alzero,mattyp,t0l,t1l,
     &        ihyper,istiff,elconloc,eth,kode,plicon,
     &        nplicon,plkcon,nplkcon,npmat_,
     &        plconloc,mint_,dtime,nelem,kk,
     &        xstiff,ncmat_)
!
         if(mattyp.eq.1) then
            e=elas(1)
            un=elas(2)
            um=e/(1.d0+un)
            al=un*um/(1.d0-2.d0*un)
            um=um/2.d0
         elseif(mattyp.eq.2) then
            call orthotropic(elas,anisox)
         else
            call anisotropic(elas,anisox)
         endif
!
!           initialisation for the body forces
!
         om=omx*rho
         if(rhs) then
            if(ibod.ne.0) then
               do ii=1,3
                  bodyf(ii)=bodyfx(ii)*rho
               enddo
            endif
         endif
!
         if(rhs) then
!
!             information for the rhs
!
!             residual stresses
!
            if((iprestr.eq.1).or.(ithermal.eq.1)) then
               if(iprestr.eq.0) then
                  do ii=1,6
                     beta(ii)=0.d0
                  enddo
               else
                  do ii=1,6
                     beta(ii)=-prestr(ii,nelem)
                  enddo
               endif
            endif
!
!             calculation of the thermal stresses in an undeformed body 
!             assumption; beta corresponds to initial stresses.
!
            if(ithermal.eq.1) then
               if(ihyper.eq.0) then
                  icmdl=2
                  call linel(ithermal,mattyp,beta,al,um,am1,alph,tt,
     &                 elas,icmdl,exx,eyy,ezz,exy,exz,eyz,stre,
     &                 anisox)
               endif
            endif
!
         elseif(buckling) then
!
!              buckling stresses
!
            s11b=stx(1,kk,nelem)
            s22b=stx(2,kk,nelem)
            s33b=stx(3,kk,nelem)
            s12b=stx(4,kk,nelem)
            s13b=stx(5,kk,nelem)
            s23b=stx(6,kk,nelem)
!
         endif
!
!           incorporating the jacobian determinant in the shape
!           functions
!
         xsjj=dsqrt(xsj)
         do i1=1,nope
            shpj(1,i1)=shp(1,i1)*xsjj
            shpj(2,i1)=shp(2,i1)*xsjj
            shpj(3,i1)=shp(3,i1)*xsjj
            shpj(4,i1)=shp(4,i1)*xsj
         enddo
!
!           determination of the stiffness, and/or mass and/or
!           buckling matrix
!
         if(stiffness.or.mass.or.buckling) then
!
            if((iperturb.eq.0).or.buckling)
     &           then
               jj1=1
               do jj=1,nope
!
                  ii1=1
                  do ii=1,jj
!
!                   all products of the shape functions for a given ii
!                   and jj
!
                     do i1=1,3
                        do j1=1,3
                           w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
                        enddo
                     enddo
!
!                   the following section calculates the static
!                   part of the stiffness matrix which, for buckling 
!                   calculations, is done in a preliminary static
!                   call
!
                     if(.not.buckling) then
!
                        if(mattyp.eq.1) then
!
                           s(ii1,jj1)=s(ii1,jj1)+(al*w(1,1)+
     &                          um*(2.d0*w(1,1)+w(2,2)+w(3,3)))*weight
                           s(ii1,jj1+1)=s(ii1,jj1+1)+(al*w(1,2)+
     &                          um*w(2,1))*weight
                           s(ii1,jj1+2)=s(ii1,jj1+2)+(al*w(1,3)+
     &                          um*w(3,1))*weight
                           s(ii1+1,jj1)=s(ii1+1,jj1)+(al*w(2,1)+
     &                          um*w(1,2))*weight
                           s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+(al*w(2,2)+
     &                          um*(2.d0*w(2,2)+w(1,1)+w(3,3)))*weight
                           s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+(al*w(2,3)+
     &                          um*w(3,2))*weight
                           s(ii1+2,jj1)=s(ii1+2,jj1)+(al*w(3,1)+
     &                          um*w(1,3))*weight
                           s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+(al*w(3,2)+
     &                          um*w(2,3))*weight
                           s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+(al*w(3,3)+
     &                          um*(2.d0*w(3,3)+w(2,2)+w(1,1)))*weight
!
                        elseif(mattyp.eq.2) then
!
                           s(ii1,jj1)=s(ii1,jj1)+(elas(1)*w(1,1)+
     &                          elas(7)*w(2,2)+elas(8)*w(3,3))*weight
                           s(ii1,jj1+1)=s(ii1,jj1+1)+(elas(2)*w(1,2)+
     &                          elas(7)*w(2,1))*weight
                           s(ii1,jj1+2)=s(ii1,jj1+2)+(elas(4)*w(1,3)+
     &                          elas(8)*w(3,1))*weight
                           s(ii1+1,jj1)=s(ii1+1,jj1)+(elas(7)*w(1,2)+
     &                          elas(2)*w(2,1))*weight
                           s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+
     &                          (elas(7)*w(1,1)+
     &                          elas(3)*w(2,2)+elas(9)*w(3,3))*weight
                           s(ii1+1,jj1+2)=s(ii1+1,jj1+2)+
     &                          (elas(5)*w(2,3)+
     &                          elas(9)*w(3,2))*weight
                           s(ii1+2,jj1)=s(ii1+2,jj1)+
     &                          (elas(8)*w(1,3)+
     &                          elas(4)*w(3,1))*weight
                           s(ii1+2,jj1+1)=s(ii1+2,jj1+1)+
     &                          (elas(9)*w(2,3)+
     &                          elas(5)*w(3,2))*weight
                           s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+
     &                          (elas(8)*w(1,1)+
     &                          elas(9)*w(2,2)+elas(6)*w(3,3))*weight
!
                        else
!
                           do i1=1,3
                              do j1=1,3
                                 do k1=1,3
                                    do l1=1,3
                                       s(ii1+i1-1,jj1+j1-1)=
     &                                      s(ii1+i1-1,jj1+j1-1)
     &                                      +anisox(i1,k1,j1,l1)
     &                                      *w(k1,l1)*weight
                                    enddo
                                 enddo
                              enddo
                           enddo
!
                        endif
!
!                     mass matrix
!
                        if(mass) then
                           sm(ii1,jj1)=sm(ii1,jj1)
     &                          +rho*shpj(4,ii)*shp(4,jj)*weight
                           sm(ii1+1,jj1+1)=sm(ii1,jj1)
                           sm(ii1+2,jj1+2)=sm(ii1,jj1)
                        endif
!
                     else
!
!                     buckling matrix  
!
                        senergyb=
     &                       (s11b*w(1,1)+s12b*(w(1,2)+w(2,1))
     &                       +s13b*(w(1,3)+w(3,1))+s22b*w(2,2)
     &                       +s23b*(w(2,3)+w(3,2))+s33b*w(3,3))*weight
                        sm(ii1,jj1)=sm(ii1,jj1)-senergyb
                        sm(ii1+1,jj1+1)=sm(ii1+1,jj1+1)-senergyb
                        sm(ii1+2,jj1+2)=sm(ii1+2,jj1+2)-senergyb
!
                     endif
!
                     ii1=ii1+3
                  enddo
                  jj1=jj1+3
               enddo
            else
!
!               stiffness matrix for static and modal
!               2nd order calculations
!
!               large displacement stiffness
!               
               do i1=1,3
                  do j1=1,3
                     vo(i1,j1)=0.d0
                     do k1=1,nope
                        vo(i1,j1)=vo(i1,j1)+shp(j1,k1)*voldl(i1,k1)
                     enddo
                  enddo
               enddo
!
               if(mattyp.eq.1) then
                  call wcoef(v,vo,al,um)
               endif
!
!               calculating the total mass of the element for
!               lumping purposes: only for explicit nonlinear
!               dynamic calculations
!
               if(mass.and.(iexpl.eq.1)) then
                  summass=summass+rho*xsj
               endif
!
               jj1=1
               do jj=1,nope
!
                  ii1=1
                  do ii=1,jj
!
!                   all products of the shape functions for a given ii
!                   and jj
!
                     do i1=1,3
                        do j1=1,3
                           w(i1,j1)=shpj(i1,ii)*shpj(j1,jj)
                        enddo
                     enddo
!
                     if(mattyp.eq.1) then
!
                        do m1=1,3
                           do m2=1,3
                              do m3=1,3
                                 do m4=1,3
                                    s(ii1+m2-1,jj1+m1-1)=
     &                                   s(ii1+m2-1,jj1+m1-1)
     &                                   +v(m4,m3,m2,m1)*w(m4,m3)*weight
                                 enddo
                              enddo
                           enddo
                        enddo
!                      
                     elseif(mattyp.eq.2) then
!
                        call orthonl(w,vo,elas,s,ii1,jj1,weight)
!
                     else
!
                      do i1=1,3
                        iii1=ii1+i1-1
                        do j1=1,3
                          jjj1=jj1+j1-1
                          do k1=1,3
                            do l1=1,3
                              s(iii1,jjj1)=s(iii1,jjj1)
     &                         +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
                              do m1=1,3
                                s(iii1,jjj1)=s(iii1,jjj1)
     &                              +anisox(i1,k1,m1,l1)*w(k1,l1)
     &                                 *vo(j1,m1)*weight
     &                              +anisox(m1,k1,j1,l1)*w(k1,l1)
     &                                 *vo(i1,m1)*weight
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight
                                enddo
                              enddo
                            enddo
                          enddo
                        enddo
                      enddo
!SPEC: The immediately preceding loop nest is also available in 
!SPEC: program-generated (much longer) form from the author's 
!SPEC: website (see 454.calculix/Docs) in file anisonl.f
!SPEC:
!SPEC:                   call anisonl(w,vo,elas,s,ii1,jj1,weight)
!SPEC:
                     endif
!
!                   stress stiffness
!
                     senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
                     s(ii1,jj1)=s(ii1,jj1)+senergy
                     s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
                     s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
!
!                   mass matrix
!
                     if(mass) then
                        sm(ii1,jj1)=sm(ii1,jj1)
     &                       +rho*shpj(4,ii)*shp(4,jj)*weight
                        sm(ii1+1,jj1+1)=sm(ii1,jj1)
                        sm(ii1+2,jj1+2)=sm(ii1,jj1)
                     endif
!
!                    stiffness contribution of centrifugal forces
!
                     if(mass.and.(om.gt.0.d0)) then
                        dmass=shpj(4,ii)*shp(4,jj)*weight*om
                        do m1=1,3
                           s(ii1+m1-1,jj1+m1-1)=s(ii1+m1-1,jj1+m1-1)-
     &                          dmass
                           do m2=1,3
                              s(ii1+m1-1,jj1+m2-1)=s(ii1+m1-1,jj1+m2-1)+
     &                             dmass*p2(m1)*p2(m2)
                           enddo
                        enddo
                     endif
!
                     ii1=ii1+3
                  enddo
                  jj1=jj1+3
               enddo
            endif
!
         endif
!
!           computation of the right hand side
!
         if(rhs) then
!
!             body forces
!
            if(ibod.ne.0) then
               if(om.gt.0.d0) then
                  do i1=1,3
!
!                   computation of the global coordinates of the gauss
!                   point
!
                     q(i1)=0.d0
                     if(iperturb.eq.0) then
                        do j1=1,nope
                           q(i1)=q(i1)+shp(4,j1)*xl(i1,j1)
                        enddo
                     else
                        do j1=1,nope
                           q(i1)=q(i1)+shp(4,j1)*
     &                          (xl(i1,j1)+voldl(i1,j1))
                        enddo
                     endif
!                       
                     q(i1)=q(i1)-p1(i1)
                  enddo
                  const=q(1)*p2(1)+q(2)*p2(2)+q(3)*p2(3)
!
!                 inclusion of the centrifugal force into the body force
!
                  do i1=1,3
                     bf(i1)=bodyf(i1)+(q(i1)-const*p2(i1))*om
                  enddo
               else
                  do i1=1,3
                     bf(i1)=bodyf(i1)
                  enddo
               endif
               jj1=1
               do jj=1,nope
                  f(jj1)=f(jj1)+bf(1)*shpj(4,jj)*weight
                  f(jj1+1)=f(jj1+1)+bf(2)*shpj(4,jj)*weight
                  f(jj1+2)=f(jj1+2)+bf(3)*shpj(4,jj)*weight
                  ff(jj1)=ff(jj1)+bf(1)*shpj(4,jj)*weight
                  ff(jj1+1)=ff(jj1+1)+bf(2)*shpj(4,jj)*weight
                  ff(jj1+2)=ff(jj1+2)+bf(3)*shpj(4,jj)*weight
                  jj1=jj1+3
               enddo
            endif
!
!             thermal stresses and/or residual stresses
!
            if((ithermal.ne.0).or.(iprestr.ne.0)) then
               do jj=1,6
                  beta(jj)=beta(jj)*xsj
               enddo
               jj1=1
               do jj=1,nope
                  f(jj1)=f(jj1)+(shp(1,jj)*beta(1)+
     &                 (shp(2,jj)*beta(4)+shp(3,jj)*beta(5))/2.d0)
     &                 *weight
                  f(jj1+1)=f(jj1+1)+(shp(2,jj)*beta(2)+
     &                 (shp(1,jj)*beta(4)+shp(3,jj)*beta(6))/2.d0)
     &                 *weight
                  f(jj1+2)=f(jj1+2)+(shp(3,jj)*beta(3)+
     &                 (shp(1,jj)*beta(5)+shp(2,jj)*beta(6))/2.d0)
     &                 *weight
                  jj1=jj1+3
               enddo
            endif
!
         endif
!
      enddo
!
c      write(*,'(6(1x,e11.4))') ((s(i1,j1),i1=1,j1),j1=1,60)
c      write(*,*)
c
      if(.not.buckling) then
!
!       distributed loads
!
         if(nload.eq.0) then
            return
         endif
         call nident2(nelemload,nelem,nload,id)
         do
            if((id.eq.0).or.(nelemload(1,id).ne.nelem)) exit
            read(sideload(id)(2:2),'(i1)') ig
!
!         treatment of wedge faces
!
            if(lakonl(4:4).eq.'6') then
               mint2d=1
               if(ig.le.2) then
                  nopes=3
               else
                  nopes=4
               endif
            endif
          if(lakonl(4:5).eq.'15') then
             if(ig.le.2) then
                mint2d=3
                nopes=6
             else
                mint2d=4
                nopes=8
             endif
          endif
!
          if((nope.eq.20).or.(nope.eq.8)) then
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifaceq(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifaceq(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifaceq(i,ig)))+
     &                     vold(j,konl(ifaceq(i,ig)))
                   enddo
                enddo
             endif
          elseif((nope.eq.10).or.(nope.eq.4)) then
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacet(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifacet(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacet(i,ig)))+
     &                     vold(j,konl(ifacet(i,ig)))
                   enddo
                enddo
             endif
          else
             if(iperturb.eq.0) then
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacew(i,ig)))
                   enddo
                enddo
             else
                if(mass) then
                   do i=1,nopes
                      do j=1,3
                         xl1(j,i)=co(j,konl(ifacew(i,ig)))
                      enddo
                   enddo
                endif
                do i=1,nopes
                   do j=1,3
                      xl2(j,i)=co(j,konl(ifacew(i,ig)))+
     &                     vold(j,konl(ifacew(i,ig)))
                   enddo
                enddo
             endif
          endif
!
          do i=1,mint2d
             if((lakonl(4:5).eq.'8R').or.
     &            ((lakonl(4:4).eq.'6').and.(nopes.eq.4))) then
                xi=gauss2d1(1,i)
                et=gauss2d1(2,i)
                weight=weight2d1(i)
             elseif((lakonl(4:4).eq.'8').or.
     &               (lakonl(4:6).eq.'20R').or.
     &               ((lakonl(4:5).eq.'15').and.(nopes.eq.8))) then
                xi=gauss2d2(1,i)
                et=gauss2d2(2,i)
                weight=weight2d2(i)
             elseif(lakonl(4:4).eq.'2') then
                xi=gauss2d3(1,i)
                et=gauss2d3(2,i)
                weight=weight2d3(i)
             elseif((lakonl(4:5).eq.'10').or.
     &               ((lakonl(4:5).eq.'15').and.(nopes.eq.6))) then
                xi=gauss2d5(1,i)
                et=gauss2d5(2,i)
                weight=weight2d5(i)
             elseif((lakonl(4:4).eq.'4').or.
     &               ((lakonl(4:4).eq.'6').and.(nopes.eq.3))) then
                xi=gauss2d4(1,i)
                et=gauss2d4(2,i)
                weight=weight2d4(i)
             endif
!
             if(rhs) then
                if(nopes.eq.8) then
                   call shape8q(xi,et,xl2,xsj2,shp2)
                elseif(nopes.eq.4) then
                   call shape4q(xi,et,xl2,xsj2,shp2)
                elseif(nopes.eq.6) then
                   call shape6tri(xi,et,xl2,xsj2,shp2)
                else
                   call shape3tri(xi,et,xl2,xsj2,shp2)
                endif
!
                do k=1,nopes
                   if((nope.eq.20).or.(nope.eq.8)) then
                      ipointer=(ifaceq(k,ig)-1)*3
                   elseif((nope.eq.10).or.(nope.eq.4)) then
                      ipointer=(ifacet(k,ig)-1)*3
                   else
                      ipointer=(ifacew(k,ig)-1)*3
                   endif
                   f(ipointer+1)=f(ipointer+1)-shp2(4,k)*xload(1,id)
     &                  *xsj2(1)*weight
                   f(ipointer+2)=f(ipointer+2)-shp2(4,k)*xload(1,id)
     &                  *xsj2(2)*weight
                   f(ipointer+3)=f(ipointer+3)-shp2(4,k)*xload(1,id)
     &                  *xsj2(3)*weight
                   ff(ipointer+1)=ff(ipointer+1)-shp2(4,k)*xload(1,id)
     &                  *xsj2(1)*weight
                   ff(ipointer+2)=ff(ipointer+2)-shp2(4,k)*xload(1,id)
     &                  *xsj2(2)*weight
                   ff(ipointer+3)=ff(ipointer+3)-shp2(4,k)*xload(1,id)
     &                  *xsj2(3)*weight
                enddo
!
!            stiffness contribution of the distributed load 
!
             elseif(mass) then
                if(nopes.eq.8) then
                   call shape8q(xi,et,xl1,xsj2,shp2)
                elseif(nopes.eq.4) then
                   call shape4q(xi,et,xl1,xsj2,shp2)
                elseif(nopes.eq.6) then
                   call shape6tri(xi,et,xl1,xsj2,shp2)
                else
                   call shape3tri(xi,et,xl1,xsj2,shp2)
                endif
!
!               calculation of the deformation gradient
!
                do k=1,3
                   do l=1,3
                      xkl(k,l)=0.d0
                      do ii=1,nopes
                         xkl(k,l)=xkl(k,l)+shp2(l,ii)*xl2(k,ii)
                      enddo
                   enddo
                enddo
!
                do ii=1,nopes
                   if((nope.eq.20).or.(nope.eq.8)) then
                      ipointeri=(ifaceq(ii,ig)-1)*3
                   elseif((nope.eq.10).or.(nope.eq.4)) then
                     ipointeri=(ifacet(ii,ig)-1)*3
                   else
                      ipointeri=(ifacew(ii,ig)-1)*3
                   endif
                   do jj=1,nopes
                      if((nope.eq.20).or.(nope.eq.8)) then
                         ipointerj=(ifaceq(jj,ig)-1)*3
                      elseif((nope.eq.10).or.(nope.eq.4)) then
                         ipointerj=(ifacet(jj,ig)-1)*3
                      else
                         ipointerj=(ifacew(jj,ig)-1)*3
                      endif
                      do k=1,3
                         do l=1,3
                            if(k.eq.l) cycle
                            if(k*l.eq.2) then
                               n=3
                            elseif(k*l.eq.3) then
                               n=2
                            else
                               n=1
                            endif
                            term=weight*xload(1,id)*shp2(4,jj)*
     &                       (xsj2(1)*
     &                        (xkl(n,2)*shp2(3,ii)-xkl(n,3)*shp2(2,ii))+
     &                        xsj2(2)*
     &                        (xkl(n,3)*shp2(1,ii)-xkl(n,1)*shp2(3,ii))+
     &                        xsj2(3)*
     &                        (xkl(n,1)*shp2(2,ii)-xkl(n,2)*shp2(1,ii)))
                            if(ipointeri+k.le.ipointerj+l) then
                               s(ipointeri+k,ipointerj+l)=
     &                              s(ipointeri+k,ipointerj+l)+term/2.d0
                            else
                               s(ipointerj+l,ipointeri+k)=
     &                              s(ipointerj+l,ipointeri+k)+term/2.d0
                            endif
                         enddo
                      enddo
                   enddo
                enddo
!
             endif
          enddo
!
          id=id-1
       enddo
!
      elseif(mass.and.(iexpl.eq.1)) then
!
!        scaling the diagonal terms of the mass matrix such that the total mass
!        is right (LUMPING; for explicit dynamic calculations)
!
         sume=0.d0
         summ=0.d0
         do i=1,3*nopev,3
            sume=sume+sm(i,i)
         enddo
         do i=3*nopev+1,3*nope,3
            summ=summ+sm(i,i)
         enddo
!
         if(nope.eq.20) then
c            alp=.2215d0
            alp=.2917d0
!              maybe alp=.2917d0 is better??
         elseif(nope.eq.10) then
            alp=0.1203d0
         elseif(nope.eq.15) then
            alp=0.2141d0
         endif
!
         if((nope.eq.20).or.(nope.eq.10).or.(nope.eq.15)) then
            factore=summass*alp/(1.d0+alp)/sume
            factorm=summass/(1.d0+alp)/summ
         else
            factore=summass/sume
         endif
!
         do i=1,3*nopev,3
            sm(i,i)=sm(i,i)*factore
            sm(i+1,i+1)=sm(i,i)
            sm(i+2,i+2)=sm(i,i)
         enddo
         do i=3*nopev+1,3*nope,3
            sm(i,i)=sm(i,i)*factorm
            sm(i+1,i+1)=sm(i,i)
            sm(i+2,i+2)=sm(i,i)
         enddo
!
      endif
!
      return
      end


[-- Attachment #4: gauss.f --]
[-- Type: application/octet-stream, Size: 9346 bytes --]

!
!     contains Gauss point information
!
!     gauss2d1: quad, 1-point integration (1 integration point)
!     gauss2d2: quad, 2-point integration (4 integration points)
!     gauss2d3: quad, 3-point integration (9 integration points)
!     gauss2d4: tri, 1 integration point
!     gauss2d5: tri, 3 integration points
!     gauss3d1: hex, 1-point integration (1 integration point)
!     gauss3d2: hex, 2-point integration (8 integration points)
!     gauss3d3: hex, 3-point integration (27 integration points)
!     gauss3d4: tet, 1 integration point
!     gauss3d5: tet, 4 integration points
!     gauss3d6: tet, 15 integration points
!     gauss3d7: wedge, 2 integration points
!     gauss3d8: wedge, 9 integration points
!     gauss3d9: wedge, 18 integration points
!
!     weight2d1,... contains the weights
!
      data gauss2d1 /0.,0./
!
      data gauss2d2 /
     &  -0.577350269189626d0,-0.577350269189626d0,
     &  -0.577350269189626d0,0.577350269189626d0,
     &  0.577350269189626d0,-0.577350269189626d0,
     &  0.577350269189626d0,0.577350269189626d0/
!
      data gauss2d3 /
     & -0.774596669241483d0,-0.774596669241483d0,
     & -0.774596669241483d0,0.d0,
     & -0.774596669241483d0,0.774596669241483d0,
     & -0.d0,-0.774596669241483d0,
     & -0.d0,0.d0,
     & -0.d0,0.774596669241483d0,
     & 0.774596669241483d0,-0.774596669241483d0,
     & 0.774596669241483d0,0.d0,
     & 0.774596669241483d0,0.774596669241483d0/
!
      data gauss2d4 /0.333333333333333d0,0.333333333333333d0/
!
      data gauss2d5 /.5d0,.5d0,0.d0,.5d0,.5d0,0.d0/
!
      data gauss3d1 /0.,0.,0./
!
      data gauss3d2 /
     &  -0.577350269189626d0,-0.577350269189626d0,-0.577350269189626d0,
     &  0.577350269189626d0,-0.577350269189626d0,-0.577350269189626d0,
     &  -0.577350269189626d0,0.577350269189626d0,-0.577350269189626d0,
     &  0.577350269189626d0,0.577350269189626d0,-0.577350269189626d0,
     &  -0.577350269189626d0,-0.577350269189626d0,0.577350269189626d0,
     &  0.577350269189626d0,-0.577350269189626d0,0.577350269189626d0,
     &  -0.577350269189626d0,0.577350269189626d0,0.577350269189626d0,
     &  0.577350269189626d0,0.577350269189626d0,0.577350269189626d0/
!
      data gauss3d3 /
     & -0.774596669241483d0,-0.774596669241483d0,-0.774596669241483d0,
     & 0.d0,-0.774596669241483d0,-0.774596669241483d0,
     & 0.774596669241483d0,-0.774596669241483d0,-0.774596669241483d0,
     & -0.774596669241483d0,0.d0,-0.774596669241483d0,
     & 0.d0,0.d0,-0.774596669241483d0,
     & 0.774596669241483d0,0.d0,-0.774596669241483d0,
     & -0.774596669241483d0,0.774596669241483d0,-0.774596669241483d0,
     & 0.d0,0.774596669241483d0,-0.774596669241483d0,
     & 0.774596669241483d0,0.774596669241483d0,-0.774596669241483d0,
     & -0.774596669241483d0,-0.774596669241483d0,0.d0,
     & 0.d0,-0.774596669241483d0,0.d0,
     & 0.774596669241483d0,-0.774596669241483d0,0.d0,
     & -0.774596669241483d0,0.d0,0.d0,
     & 0.d0,0.d0,0.d0,
     & 0.774596669241483d0,0.d0,0.d0,
     & -0.774596669241483d0,0.774596669241483d0,0.d0,
     & 0.d0,0.774596669241483d0,0.d0,
     & 0.774596669241483d0,0.774596669241483d0,0.d0,
     & -0.774596669241483d0,-0.774596669241483d0,0.774596669241483d0,
     & 0.d0,-0.774596669241483d0,0.774596669241483d0,
     & 0.774596669241483d0,-0.774596669241483d0,0.774596669241483d0,
     & -0.774596669241483d0,0.d0,0.774596669241483d0,
     & 0.d0,0.d0,0.774596669241483d0,
     & 0.774596669241483d0,0.d0,0.774596669241483d0,
     & -0.774596669241483d0,0.774596669241483d0,0.774596669241483d0,
     & 0.d0,0.774596669241483d0,0.774596669241483d0,
     & 0.774596669241483d0,0.774596669241483d0,0.774596669241483d0/
!
      data gauss3d4 /0.25d0,0.25d0,0.25d0/
!
      data gauss3d5 /
     & 0.138196601125011d0,0.138196601125011d0,0.138196601125011d0,
     & 0.585410196624968d0,0.138196601125011d0,0.138196601125011d0,
     & 0.138196601125011d0,0.585410196624968d0,0.138196601125011d0,
     & 0.138196601125011d0,0.138196601125011d0,0.585410196624968d0/
!
      data gauss3d6 /
     & 0.25,0.25,0.25d0,
     & 0.091971078052723d0,0.091971078052723d0,0.091971078052723d0,
     & 0.091971078052723d0,0.091971078052723d0,0.724086765841831d0,
     & 0.091971078052723d0,0.724086765841831d0,0.091971078052723d0,
     & 0.724086765841831d0,0.091971078052723d0,0.091971078052723d0,
     & 0.319793627829630d0,0.319793627829630d0,0.319793627829630d0,
     & 0.319793627829630d0,0.319793627829630d0,0.040619116511110d0,
     & 0.319793627829630d0,0.040619116511110d0,0.319793627829630d0,
     & 0.040619116511110d0,0.319793627829630d0,0.319793627829630d0,
     & 0.056350832689629d0,0.056350832689629d0,0.443649167310371d0,
     & 0.056350832689629d0,0.443649167310371d0,0.056350832689629d0,
     & 0.443649167310371d0,0.056350832689629d0,0.056350832689629d0,
     & 0.443649167310371d0,0.443649167310371d0,0.056350832689629d0,
     & 0.443649167310371d0,0.056350832689629d0,0.443649167310371d0,
     & 0.056350832689629d0,0.443649167310371d0,0.443649167310371d0/
!
      data gauss3d7 /
     & 0.333333333333333d0,0.333333333333333d0,-0.577350269189626d0,
     & 0.333333333333333d0,0.333333333333333d0,0.577350269189626d0/
!
      data gauss3d8 /
     & 0.166666666666667d0,0.166666666666667d0,-0.774596669241483d0,
     & 0.666666666666667d0,0.166666666666667d0,-0.774596669241483d0,
     & 0.166666666666667d0,0.666666666666667d0,-0.774596669241483d0,
     & 0.166666666666667d0,0.166666666666667d0,0.d0,
     & 0.666666666666667d0,0.166666666666667d0,0.d0,
     & 0.166666666666667d0,0.666666666666667d0,0.d0,
     & 0.166666666666667d0,0.166666666666667d0,0.774596669241483d0,
     & 0.666666666666667d0,0.166666666666667d0,0.774596669241483d0,
     & 0.166666666666667d0,0.666666666666667d0,0.774596669241483d0/
!
      data gauss3d9 /
     & 0.166666666666667d0,0.166666666666667d0,-0.774596669241483d0,
     & 0.166666666666667d0,0.666666666666667d0,-0.774596669241483d0,
     & 0.666666666666667d0,0.166666666666667d0,-0.774596669241483d0,
     & 0.000000000000000d0,0.500000000000000d0,-0.774596669241483d0,
     & 0.500000000000000d0,0.000000000000000d0,-0.774596669241483d0,
     & 0.500000000000000d0,0.500000000000000d0,-0.774596669241483d0,
     & 0.166666666666667d0,0.166666666666667d0,0.d0,
     & 0.166666666666667d0,0.666666666666667d0,0.d0,
     & 0.666666666666667d0,0.166666666666667d0,0.d0,
     & 0.000000000000000d0,0.500000000000000d0,0.d0,
     & 0.500000000000000d0,0.000000000000000d0,0.d0,
     & 0.500000000000000d0,0.500000000000000d0,0.d0,
     & 0.166666666666667d0,0.166666666666667d0,0.774596669241483d0,
     & 0.166666666666667d0,0.666666666666667d0,0.774596669241483d0,
     & 0.666666666666667d0,0.166666666666667d0,0.774596669241483d0,
     & 0.000000000000000d0,0.500000000000000d0,0.774596669241483d0,
     & 0.500000000000000d0,0.000000000000000d0,0.774596669241483d0,
     & 0.500000000000000d0,0.500000000000000d0,0.774596669241483d0/
!
      data weight2d1 /4.d0/
!
      data weight2d2 /1.d0,1.d0,1.d0,1.d0/
!
      data weight2d3 /
     &  0.308641975308642d0,0.493827160493827d0,0.308641975308642d0,
     &  0.493827160493827d0,0.790123456790123d0,0.493827160493827d0,
     &  0.308641975308642d0,0.493827160493827d0,0.308641975308642d0/
!
      data weight2d4 /0.5d0/
!
      data weight2d5 /
     &  0.166666666666666d0,0.166666666666666d0,0.166666666666666d0/
!
      data weight3d1 /8.d0/
!
      data weight3d2 /1.d0,1.d0,1.d0,1.d0,1.d0,1.d0,1.d0,1.d0/
!
      data weight3d3 /
     &  0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
     &  0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
     &  0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
     &  0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
     &  0.438957475994513d0,0.702331961591221d0,0.438957475994513d0,
     &  0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
     &  0.171467764060357d0,0.274348422496571d0,0.171467764060357d0,
     &  0.274348422496571d0,0.438957475994513d0,0.274348422496571d0,
     &  0.171467764060357d0,0.274348422496571d0,0.171467764060357d0/
!
      data weight3d4 /0.166666666666667d0/
!
      data weight3d5 /
     &  0.041666666666667d0,0.041666666666667d0,0.041666666666667d0,
     &  0.041666666666667d0/
!
      data weight3d6 /
     &  0.019753086419753d0,0.011989513963170d0,0.011989513963170d0,
     &  0.011989513963170d0,0.011989513963170d0,0.011511367871045d0,
     &  0.011511367871045d0,0.011511367871045d0,0.011511367871045d0,
     &  0.008818342151675d0,0.008818342151675d0,0.008818342151675d0,
     &  0.008818342151675d0,0.008818342151675d0,0.008818342151675d0/
!
      data weight3d7 /0.5d0,0.5d0/
!
      data weight3d8 /
     &  0.092592592592593d0,0.092592592592593d0,0.092592592592593d0,
     &  0.148148148148148d0,0.148148148148148d0,0.148148148148148d0,
     &  0.092592592592593d0,0.092592592592593d0,0.092592592592593d0/
!
      data weight3d9 /
     &  0.083333333333333d0,0.083333333333333d0,0.083333333333333d0,
     &  0.009259259259259d0,0.009259259259259d0,0.009259259259259d0,
     &  0.133333333333333d0,0.133333333333333d0,0.133333333333333d0,
     &  0.014814814814815d0,0.014814814814815d0,0.014814814814815d0,
     &  0.083333333333333d0,0.083333333333333d0,0.083333333333333d0,
     &  0.009259259259259d0,0.009259259259259d0,0.009259259259259d0/
!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-26 10:32 LTO slows down calculix by more than 10% on aarch64 Prathamesh Kulkarni
@ 2020-08-26 11:20 ` Richard Biener
  2020-08-28 11:16   ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-08-26 11:20 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: GCC Development

On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
<gcc@gcc.gnu.org> wrote:
>
> Hi,
> We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> on aarch64 in our validation CI. I tried to investigate this issue a
> bit, and it seems the regression comes from inlining of orthonl into
> e_c3d. Disabling that brings back the performance. However, inlining
> orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> 16.9% which isn't too large.
>
> I have attached two test-cases, e_c3d.f that has orthonl manually
> inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> which contains unmodified function.
> (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> sufficient.
>
> It seems that inlining orthonl, causes 20 hoistings into block 181,
> which are then hoisted to block 173, in particular hoistings of w(1,
> 1) ... w(3, 3), which wasn't
> possible without inlining. The hoistings happen because of basic block
> that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> following block in line 1035 in e_c3d.f:
>
> senergy=
>      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
>      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
>      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
>
> Disabling hoisting into blocks 173 (and 181), brings back most of the
> performance. I am not able to understand why (if?) these hoistings of
> w(1, 1) ...
> w(3, 3) are causing slowdown however. Looking at assembly, the hot
> code-path from perf in e_c3d shows following code-gen diff:
> For inlined version:
> .L122:
>         ldr     d15, [x1, -248]
>         add     w0, w0, 1
>         add     x2, x2, 24
>         add     x1, x1, 72
>         fmul    d15, d17, d15
>         fmul    d15, d15, d18
>         fmul    d14, d15, d14
>         fmadd   d16, d14, d31, d16
>         cmp     w0, 4
>         beq     .L121
>         ldr     d14, [x2, -8]
>         b       .L122
>
> and for non-inlined version:
> .L118:
>         ldr     d0, [x1, -248]
>         add     w0, w0, 1
>         ldr     d2, [x2, -8]
>         add     x1, x1, 72
>         add     x2, x2, 24
>         fmul    d0, d3, d0
>         fmul    d0, d0, d5
>         fmul    d0, d0, d2
>         fmadd   d1, d4, d0, d1
>         cmp     w0, 4
>         bne     .L118

I wonder if you have profles.  The inlined version has a
non-empty latch block (looks like some PRE is happening
there?).  Eventually your uarch does not like the close
(does your assembly show the layour as it is?) branches?

> which corresponds to the following loop in line 1014.
>                                 do n1=1,3
>                                   s(iii1,jjj1)=s(iii1,jjj1)
>      &                                  +anisox(m1,k1,n1,l1)
>      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
>      &                                  *weight
>
> I am not sure why would hoisting have any direct effect on this loop
> except perhaps that hoisting allocated more reigsters, and led to
> increased register pressure. Perhaps that's why it's using highered
> number regs for code-gen in inlined version ? However disabling
> hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> (by grepping for str to sp), so
> hoisting is also helping here ? I am not sure how to proceed further,
> and would be grateful for suggestions.
>
> Thanks,
> Prathamesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-26 11:20 ` Richard Biener
@ 2020-08-28 11:16   ` Prathamesh Kulkarni
  2020-08-28 11:57     ` Richard Biener
  2020-08-28 12:03     ` Alexander Monakov
  0 siblings, 2 replies; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-28 11:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development

On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> <gcc@gcc.gnu.org> wrote:
> >
> > Hi,
> > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > on aarch64 in our validation CI. I tried to investigate this issue a
> > bit, and it seems the regression comes from inlining of orthonl into
> > e_c3d. Disabling that brings back the performance. However, inlining
> > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > 16.9% which isn't too large.
> >
> > I have attached two test-cases, e_c3d.f that has orthonl manually
> > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > which contains unmodified function.
> > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > sufficient.
> >
> > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > which are then hoisted to block 173, in particular hoistings of w(1,
> > 1) ... w(3, 3), which wasn't
> > possible without inlining. The hoistings happen because of basic block
> > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > following block in line 1035 in e_c3d.f:
> >
> > senergy=
> >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> >
> > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > performance. I am not able to understand why (if?) these hoistings of
> > w(1, 1) ...
> > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > code-path from perf in e_c3d shows following code-gen diff:
> > For inlined version:
> > .L122:
> >         ldr     d15, [x1, -248]
> >         add     w0, w0, 1
> >         add     x2, x2, 24
> >         add     x1, x1, 72
> >         fmul    d15, d17, d15
> >         fmul    d15, d15, d18
> >         fmul    d14, d15, d14
> >         fmadd   d16, d14, d31, d16
> >         cmp     w0, 4
> >         beq     .L121
> >         ldr     d14, [x2, -8]
> >         b       .L122
> >
> > and for non-inlined version:
> > .L118:
> >         ldr     d0, [x1, -248]
> >         add     w0, w0, 1
> >         ldr     d2, [x2, -8]
> >         add     x1, x1, 72
> >         add     x2, x2, 24
> >         fmul    d0, d3, d0
> >         fmul    d0, d0, d5
> >         fmul    d0, d0, d2
> >         fmadd   d1, d4, d0, d1
> >         cmp     w0, 4
> >         bne     .L118
>
> I wonder if you have profles.  The inlined version has a
> non-empty latch block (looks like some PRE is happening
> there?).  Eventually your uarch does not like the close
> (does your assembly show the layour as it is?) branches?
Hi Richard,
I have uploaded profiles obtained by perf here:
-O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
-O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data

For the above loop, it shows the following:
-O2:
  0.01 │ f1c:  ldur   d0, [x1, #-248]
  3.53 │        add    w0, w0, #0x1
          │        ldur   d2, [x2, #-8]
  3.54 │        add    x1, x1, #0x48
          │        add    x2, x2, #0x18
  5.89 │        fmul   d0, d3, d0
14.12 │        fmul   d0, d0, d5
14.14 │        fmul   d0, d0, d2
14.13 │        fmadd  d1, d4, d0, d1
  0.00 │        cmp    w0, #0x4
  3.52 │      ↑ b.ne   f1c

-O2 -flto:
  5.47  |1124:    ldur   d15, [x1, #-248]
  2.19  │            add    w0, w0, #0x1
  1.10  │            add    x2, x2, #0x18
  2.18  │            add    x1, x1, #0x48
  4.37  │            fmul   d15, d17, d15
 13.13 │            fmul   d15, d15, d18
 13.13 │            fmul   d14, d15, d14
 13.14 │            fmadd  d16, d14, d31, d16
           │            cmp    w0, #0x4
  3.28  │            ↓ b.eq   1154
  0.00  │            ldur   d14, [x2, #-8]
  2.19  │            ↑ b      1124

IIUC, the biggest relative difference comes from load [x1, #-248]
which in LTO's case takes 5.47% of overall samples:
5.47  |1124:   ldur   d15, [x1, #-248]
while in case of -O2, it's just 0.01:
 0.01 │ f1c:   ldur   d0, [x1, #-248]

I wonder if that's (one of) the main factor(s) behind slowdown or it's
not too relevant ?

Thanks,
Prathamesh
>
> > which corresponds to the following loop in line 1014.
> >                                 do n1=1,3
> >                                   s(iii1,jjj1)=s(iii1,jjj1)
> >      &                                  +anisox(m1,k1,n1,l1)
> >      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> >      &                                  *weight
> >
> > I am not sure why would hoisting have any direct effect on this loop
> > except perhaps that hoisting allocated more reigsters, and led to
> > increased register pressure. Perhaps that's why it's using highered
> > number regs for code-gen in inlined version ? However disabling
> > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > (by grepping for str to sp), so
> > hoisting is also helping here ? I am not sure how to proceed further,
> > and would be grateful for suggestions.
> >
> > Thanks,
> > Prathamesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-28 11:16   ` Prathamesh Kulkarni
@ 2020-08-28 11:57     ` Richard Biener
  2020-08-31 11:21       ` Prathamesh Kulkarni
  2020-08-28 12:03     ` Alexander Monakov
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-08-28 11:57 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: GCC Development

On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > bit, and it seems the regression comes from inlining of orthonl into
> > > e_c3d. Disabling that brings back the performance. However, inlining
> > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > 16.9% which isn't too large.
> > >
> > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > which contains unmodified function.
> > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > sufficient.
> > >
> > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > 1) ... w(3, 3), which wasn't
> > > possible without inlining. The hoistings happen because of basic block
> > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > following block in line 1035 in e_c3d.f:
> > >
> > > senergy=
> > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > >
> > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > performance. I am not able to understand why (if?) these hoistings of
> > > w(1, 1) ...
> > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > code-path from perf in e_c3d shows following code-gen diff:
> > > For inlined version:
> > > .L122:
> > >         ldr     d15, [x1, -248]
> > >         add     w0, w0, 1
> > >         add     x2, x2, 24
> > >         add     x1, x1, 72
> > >         fmul    d15, d17, d15
> > >         fmul    d15, d15, d18
> > >         fmul    d14, d15, d14
> > >         fmadd   d16, d14, d31, d16
> > >         cmp     w0, 4
> > >         beq     .L121
> > >         ldr     d14, [x2, -8]
> > >         b       .L122
> > >
> > > and for non-inlined version:
> > > .L118:
> > >         ldr     d0, [x1, -248]
> > >         add     w0, w0, 1
> > >         ldr     d2, [x2, -8]
> > >         add     x1, x1, 72
> > >         add     x2, x2, 24
> > >         fmul    d0, d3, d0
> > >         fmul    d0, d0, d5
> > >         fmul    d0, d0, d2
> > >         fmadd   d1, d4, d0, d1
> > >         cmp     w0, 4
> > >         bne     .L118
> >
> > I wonder if you have profles.  The inlined version has a
> > non-empty latch block (looks like some PRE is happening
> > there?).  Eventually your uarch does not like the close
> > (does your assembly show the layour as it is?) branches?
> Hi Richard,
> I have uploaded profiles obtained by perf here:
> -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
>
> For the above loop, it shows the following:
> -O2:
>   0.01 │ f1c:  ldur   d0, [x1, #-248]
>   3.53 │        add    w0, w0, #0x1
>           │        ldur   d2, [x2, #-8]
>   3.54 │        add    x1, x1, #0x48
>           │        add    x2, x2, #0x18
>   5.89 │        fmul   d0, d3, d0
> 14.12 │        fmul   d0, d0, d5
> 14.14 │        fmul   d0, d0, d2
> 14.13 │        fmadd  d1, d4, d0, d1
>   0.00 │        cmp    w0, #0x4
>   3.52 │      ↑ b.ne   f1c
>
> -O2 -flto:
>   5.47  |1124:    ldur   d15, [x1, #-248]
>   2.19  │            add    w0, w0, #0x1
>   1.10  │            add    x2, x2, #0x18
>   2.18  │            add    x1, x1, #0x48
>   4.37  │            fmul   d15, d17, d15
>  13.13 │            fmul   d15, d15, d18
>  13.13 │            fmul   d14, d15, d14
>  13.14 │            fmadd  d16, d14, d31, d16
>            │            cmp    w0, #0x4
>   3.28  │            ↓ b.eq   1154
>   0.00  │            ldur   d14, [x2, #-8]
>   2.19  │            ↑ b      1124
>
> IIUC, the biggest relative difference comes from load [x1, #-248]
> which in LTO's case takes 5.47% of overall samples:
> 5.47  |1124:   ldur   d15, [x1, #-248]
> while in case of -O2, it's just 0.01:
>  0.01 │ f1c:   ldur   d0, [x1, #-248]
>
> I wonder if that's (one of) the main factor(s) behind slowdown or it's
> not too relevant ?

This looks more like the branch since usually branch costs
are attributed to the target rather than the branch itself.  You could
try re-ordering the code so the loop entry jumps around the
latch which can then fall thru so see if that makes a difference.

Richard.

> Thanks,
> Prathamesh
> >
> > > which corresponds to the following loop in line 1014.
> > >                                 do n1=1,3
> > >                                   s(iii1,jjj1)=s(iii1,jjj1)
> > >      &                                  +anisox(m1,k1,n1,l1)
> > >      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > >      &                                  *weight
> > >
> > > I am not sure why would hoisting have any direct effect on this loop
> > > except perhaps that hoisting allocated more reigsters, and led to
> > > increased register pressure. Perhaps that's why it's using highered
> > > number regs for code-gen in inlined version ? However disabling
> > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > (by grepping for str to sp), so
> > > hoisting is also helping here ? I am not sure how to proceed further,
> > > and would be grateful for suggestions.
> > >
> > > Thanks,
> > > Prathamesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-28 11:16   ` Prathamesh Kulkarni
  2020-08-28 11:57     ` Richard Biener
@ 2020-08-28 12:03     ` Alexander Monakov
  2020-08-31 11:23       ` Prathamesh Kulkarni
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Monakov @ 2020-08-28 12:03 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development

On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:

> I wonder if that's (one of) the main factor(s) behind slowdown or it's
> not too relevant ?

Probably not. Some advice to make your search more directed:

Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
when they are computed against different bases, it's much easier to see that
a loop is slowing down if it went from 4000 to 4500 in absolute sample count
as opposed to 90% to 91% in relative sample ratio.

Before diving down 'perf report', be sure to fully account for differences
in 'perf stat' output. Do the programs execute the same number of instructions,
so the difference only in scheduling? Do the programs suffer from the same
amount of branch mispredictions? Please show output of 'perf stat' on the
mailing list too, so everyone is on the same page about that.

I also suspect that the dramatic slowdown has to do with the extra branch.
Your CPU might have some specialized counters for branch prediction, see
'perf list'.

Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-28 11:57     ` Richard Biener
@ 2020-08-31 11:21       ` Prathamesh Kulkarni
  2020-08-31 11:40         ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-31 11:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development

On Fri, 28 Aug 2020 at 17:27, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > > <gcc@gcc.gnu.org> wrote:
> > > >
> > > > Hi,
> > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > > bit, and it seems the regression comes from inlining of orthonl into
> > > > e_c3d. Disabling that brings back the performance. However, inlining
> > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > > 16.9% which isn't too large.
> > > >
> > > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > > which contains unmodified function.
> > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > > sufficient.
> > > >
> > > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > > 1) ... w(3, 3), which wasn't
> > > > possible without inlining. The hoistings happen because of basic block
> > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > > following block in line 1035 in e_c3d.f:
> > > >
> > > > senergy=
> > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >
> > > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > > performance. I am not able to understand why (if?) these hoistings of
> > > > w(1, 1) ...
> > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > > code-path from perf in e_c3d shows following code-gen diff:
> > > > For inlined version:
> > > > .L122:
> > > >         ldr     d15, [x1, -248]
> > > >         add     w0, w0, 1
> > > >         add     x2, x2, 24
> > > >         add     x1, x1, 72
> > > >         fmul    d15, d17, d15
> > > >         fmul    d15, d15, d18
> > > >         fmul    d14, d15, d14
> > > >         fmadd   d16, d14, d31, d16
> > > >         cmp     w0, 4
> > > >         beq     .L121
> > > >         ldr     d14, [x2, -8]
> > > >         b       .L122
> > > >
> > > > and for non-inlined version:
> > > > .L118:
> > > >         ldr     d0, [x1, -248]
> > > >         add     w0, w0, 1
> > > >         ldr     d2, [x2, -8]
> > > >         add     x1, x1, 72
> > > >         add     x2, x2, 24
> > > >         fmul    d0, d3, d0
> > > >         fmul    d0, d0, d5
> > > >         fmul    d0, d0, d2
> > > >         fmadd   d1, d4, d0, d1
> > > >         cmp     w0, 4
> > > >         bne     .L118
> > >
> > > I wonder if you have profles.  The inlined version has a
> > > non-empty latch block (looks like some PRE is happening
> > > there?).  Eventually your uarch does not like the close
> > > (does your assembly show the layour as it is?) branches?
> > Hi Richard,
> > I have uploaded profiles obtained by perf here:
> > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
> >
> > For the above loop, it shows the following:
> > -O2:
> >   0.01 │ f1c:  ldur   d0, [x1, #-248]
> >   3.53 │        add    w0, w0, #0x1
> >           │        ldur   d2, [x2, #-8]
> >   3.54 │        add    x1, x1, #0x48
> >           │        add    x2, x2, #0x18
> >   5.89 │        fmul   d0, d3, d0
> > 14.12 │        fmul   d0, d0, d5
> > 14.14 │        fmul   d0, d0, d2
> > 14.13 │        fmadd  d1, d4, d0, d1
> >   0.00 │        cmp    w0, #0x4
> >   3.52 │      ↑ b.ne   f1c
> >
> > -O2 -flto:
> >   5.47  |1124:    ldur   d15, [x1, #-248]
> >   2.19  │            add    w0, w0, #0x1
> >   1.10  │            add    x2, x2, #0x18
> >   2.18  │            add    x1, x1, #0x48
> >   4.37  │            fmul   d15, d17, d15
> >  13.13 │            fmul   d15, d15, d18
> >  13.13 │            fmul   d14, d15, d14
> >  13.14 │            fmadd  d16, d14, d31, d16
> >            │            cmp    w0, #0x4
> >   3.28  │            ↓ b.eq   1154
> >   0.00  │            ldur   d14, [x2, #-8]
> >   2.19  │            ↑ b      1124
> >
> > IIUC, the biggest relative difference comes from load [x1, #-248]
> > which in LTO's case takes 5.47% of overall samples:
> > 5.47  |1124:   ldur   d15, [x1, #-248]
> > while in case of -O2, it's just 0.01:
> >  0.01 │ f1c:   ldur   d0, [x1, #-248]
> >
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> This looks more like the branch since usually branch costs
> are attributed to the target rather than the branch itself.  You could
> try re-ordering the code so the loop entry jumps around the
> latch which can then fall thru so see if that makes a difference.
Thanks for the suggestions.
Is it possible to modify assembly files emitted after ltrans phase ?
IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
the obtained object files which doesn't make it possible to hand edit
assembly files post ltrans ?
In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
contains e_c3d to avoid the extra branch.
(If that doesn't work out, I can proceed with manually inlining in the
source and then modifying generated assembly).

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > > which corresponds to the following loop in line 1014.
> > > >                                 do n1=1,3
> > > >                                   s(iii1,jjj1)=s(iii1,jjj1)
> > > >      &                                  +anisox(m1,k1,n1,l1)
> > > >      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > >      &                                  *weight
> > > >
> > > > I am not sure why would hoisting have any direct effect on this loop
> > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > increased register pressure. Perhaps that's why it's using highered
> > > > number regs for code-gen in inlined version ? However disabling
> > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > (by grepping for str to sp), so
> > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > and would be grateful for suggestions.
> > > >
> > > > Thanks,
> > > > Prathamesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-28 12:03     ` Alexander Monakov
@ 2020-08-31 11:23       ` Prathamesh Kulkarni
  2020-09-04  9:52         ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-08-31 11:23 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Richard Biener, GCC Development

On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
>
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> Probably not. Some advice to make your search more directed:
>
> Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> when they are computed against different bases, it's much easier to see that
> a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> as opposed to 90% to 91% in relative sample ratio.
>
> Before diving down 'perf report', be sure to fully account for differences
> in 'perf stat' output. Do the programs execute the same number of instructions,
> so the difference only in scheduling? Do the programs suffer from the same
> amount of branch mispredictions? Please show output of 'perf stat' on the
> mailing list too, so everyone is on the same page about that.
>
> I also suspect that the dramatic slowdown has to do with the extra branch.
> Your CPU might have some specialized counters for branch prediction, see
> 'perf list'.
Hi Alexander,
Thanks for the suggestions! I am in the process of doing the
benchmarking experiments,
and will post the results soon.

Thanks,
Prathamesh
>
> Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-31 11:21       ` Prathamesh Kulkarni
@ 2020-08-31 11:40         ` Jan Hubicka
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2020-08-31 11:40 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development

> Thanks for the suggestions.
> Is it possible to modify assembly files emitted after ltrans phase ?
> IIUC, the linker invokes lto1 twice, for wpa and ltrans,and then links
> the obtained object files which doesn't make it possible to hand edit
> assembly files post ltrans ?
> In particular, I wanted to modify calculix.ltrans16.ltrans.s, which
> contains e_c3d to avoid the extra branch.
> (If that doesn't work out, I can proceed with manually inlining in the
> source and then modifying generated assembly).

It is not intended to work that way, but for smaller benchmark you can
just keep the .s files, modify them and then compile again with gfortran
*.s or so.

Honza
> 
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > > which corresponds to the following loop in line 1014.
> > > > >                                 do n1=1,3
> > > > >                                   s(iii1,jjj1)=s(iii1,jjj1)
> > > > >      &                                  +anisox(m1,k1,n1,l1)
> > > > >      &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > > >      &                                  *weight
> > > > >
> > > > > I am not sure why would hoisting have any direct effect on this loop
> > > > > except perhaps that hoisting allocated more reigsters, and led to
> > > > > increased register pressure. Perhaps that's why it's using highered
> > > > > number regs for code-gen in inlined version ? However disabling
> > > > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > > > (by grepping for str to sp), so
> > > > > hoisting is also helping here ? I am not sure how to proceed further,
> > > > > and would be grateful for suggestions.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-08-31 11:23       ` Prathamesh Kulkarni
@ 2020-09-04  9:52         ` Prathamesh Kulkarni
  2020-09-04 11:38           ` Alexander Monakov
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-04  9:52 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Richard Biener, GCC Development

On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 28 Aug 2020 at 17:33, Alexander Monakov <amonakov@ispras.ru> wrote:
> >
> > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> >
> > > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > > not too relevant ?
> >
> > Probably not. Some advice to make your search more directed:
> >
> > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> > when they are computed against different bases, it's much easier to see that
> > a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> > as opposed to 90% to 91% in relative sample ratio.
> >
> > Before diving down 'perf report', be sure to fully account for differences
> > in 'perf stat' output. Do the programs execute the same number of instructions,
> > so the difference only in scheduling? Do the programs suffer from the same
> > amount of branch mispredictions? Please show output of 'perf stat' on the
> > mailing list too, so everyone is on the same page about that.
> >
> > I also suspect that the dramatic slowdown has to do with the extra branch.
> > Your CPU might have some specialized counters for branch prediction, see
> > 'perf list'.
> Hi Alexander,
> Thanks for the suggestions! I am in the process of doing the
> benchmarking experiments,
> and will post the results soon.
Hi,
I obtained perf stat results for following benchmark runs:

-O2:

    7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
              3758               context-switches          #    0.000 K/sec
                40                 cpu-migrations             #    0.000 K/sec
             40847              page-faults                   #    0.005 K/sec
     7856782413676      cycles                           #    1.000 GHz
     6034510093417      instructions                   #    0.77  insn per cycle
      363937274287       branches                       #   46.321 M/sec
       48557110132       branch-misses                #   13.34% of all branches

-O2 with orthonl inlined:

    8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
              4285               context-switches         #    0.001 K/sec
                28                 cpu-migrations            #    0.000 K/sec
             40843              page-faults                  #    0.005 K/sec
     8319591038295      cycles                          #    1.000 GHz
     6276338800377      instructions                  #    0.75  insn per cycle
      467400726106       branches                      #   56.180 M/sec
       45986364011        branch-misses              #    9.84% of all branches

-O2 with orthonl inlined and PRE disabled (this removes the extra branches):

   8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
              2266               context-switches    #    0.000 K/sec
                32                 cpu-migrations       #    0.000 K/sec
             40846              page-faults             #    0.005 K/sec
     8207292032467      cycles                     #   1.000 GHz
     6035724436440      instructions             #    0.74  insn per cycle
      364415440156       branches                 #   44.401 M/sec
       53138327276        branch-misses         #   14.58% of all branches

-O2 with orthonl inlined and hoisting disabled:

   7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
              3139              context-switches          #    0.000 K/sec
                20                cpu-migrations             #    0.000 K/sec
             40846              page-faults                  #    0.005 K/sec
     7797221351467      cycles                          #    1.000 GHz
     6187348757324      instructions                  #    0.79  insn per cycle
      461840800061       branches                      #   59.231 M/sec
       26920311761        branch-misses             #    5.83% of all branches

Perf profiles for
-O2 -fno-code-hoisting and inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data

          3196866 |1f04:    ldur   d1, [x1, #-248]
216348301800│            add    w0, w0, #0x1
            985098 |            add    x2, x2, #0x18
216215999206│            add    x1, x1, #0x48
215630376504│            fmul   d1, d5, d1
863829148015│            fmul   d1, d1, d6
864228353526│            fmul   d0, d1, d0
864568163014│            fmadd  d2, d0, d16, d2
                        │             cmp    w0, #0x4
216125427594│          ↓ b.eq   1f34
        15010377│             ldur   d0, [x2, #-8]
143753737468│          ↑ b      1f04

-O2 with inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data

359871503840│ 1ef8:   ldur   d15, [x1, #-248]
144055883055│            add    w0, w0, #0x1
  72262104254│            add    x2, x2, #0x18
143991169721│            add    x1, x1, #0x48
288648917780│            fmul   d15, d17, d15
864665644756│            fmul   d15, d15, d18
863868426387│            fmul   d14, d15, d14
865228159813│            fmadd  d16, d14, d31, d16
            245967│            cmp    w0, #0x4
215396760545│         ↓ b.eq   1f28
      704732365│            ldur   d14, [x2, #-8]
143775979620│         ↑ b      1ef8

AFAIU,
(a) Disabling PRE, results in removal of extra branch around the loop,
but that results only in slight performance increase (around 1.3%).

(b) Disabling hoisting brings back performance to (slightly more than)
-O2 without inlining orthonl. The generated code for the loop, has
similar layout as -O2 with inlined orthonl, but uses low numbered
regs. Again, not sure if it's relevant, the load from [x1, #-248]
seems to take much lesser time with hoisting disabled. I tried to
check if this was possibly an alignment issue but that seems not to be
the case because in both cases (with / without hoisting) address
pointed to by x1 was aligned properly, and only with a difference of
32 bytes between both cases.

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-04  9:52         ` Prathamesh Kulkarni
@ 2020-09-04 11:38           ` Alexander Monakov
  2020-09-21  9:49             ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Monakov @ 2020-09-04 11:38 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Richard Biener, GCC Development

> I obtained perf stat results for following benchmark runs:
> 
> -O2:
> 
>     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
>               3758               context-switches          #    0.000 K/sec
>                 40                 cpu-migrations             #    0.000 K/sec
>              40847              page-faults                   #    0.005 K/sec
>      7856782413676      cycles                           #    1.000 GHz
>      6034510093417      instructions                   #    0.77  insn per cycle
>       363937274287       branches                       #   46.321 M/sec
>        48557110132       branch-misses                #   13.34% of all branches

(ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
enough for this kind of code)

> -O2 with orthonl inlined:
> 
>     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
>               4285               context-switches         #    0.001 K/sec
>                 28                 cpu-migrations            #    0.000 K/sec
>              40843              page-faults                  #    0.005 K/sec
>      8319591038295      cycles                          #    1.000 GHz
>      6276338800377      instructions                  #    0.75  insn per cycle
>       467400726106       branches                      #   56.180 M/sec
>        45986364011        branch-misses              #    9.84% of all branches

So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
that extra instructions are appearing in this loop nest, but not in the innermost
loop. As a reminder for others, the innermost loop has only 3 iterations.

> -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> 
>    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
>               2266               context-switches    #    0.000 K/sec
>                 32                 cpu-migrations       #    0.000 K/sec
>              40846              page-faults             #    0.005 K/sec
>      8207292032467      cycles                     #   1.000 GHz
>      6035724436440      instructions             #    0.74  insn per cycle
>       364415440156       branches                 #   44.401 M/sec
>        53138327276        branch-misses         #   14.58% of all branches

This seems to match baseline in terms of instruction count, but without PRE
the loop nest may be carrying some dependencies over memory. I would simply
check the assembly for the entire 6-level loop nest in question, I hope it's
not very complicated (though Fortran array addressing...).

> -O2 with orthonl inlined and hoisting disabled:
> 
>    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
>               3139              context-switches          #    0.000 K/sec
>                 20                cpu-migrations             #    0.000 K/sec
>              40846              page-faults                  #    0.005 K/sec
>      7797221351467      cycles                          #    1.000 GHz
>      6187348757324      instructions                  #    0.79  insn per cycle
>       461840800061       branches                      #   59.231 M/sec
>        26920311761        branch-misses             #    5.83% of all branches

There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
I don't think the former fully covers the latter (there's also a 90e9 reduction
in insn count).

Given that the inner loop iterates only 3 times, my main suggestion is to
consider how the profile for the entire loop nest looks like (it's 6 loops deep,
each iterating only 3 times).

> Perf profiles for
> -O2 -fno-code-hoisting and inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> 
>           3196866 |1f04:    ldur   d1, [x1, #-248]
> 216348301800│            add    w0, w0, #0x1
>             985098 |            add    x2, x2, #0x18
> 216215999206│            add    x1, x1, #0x48
> 215630376504│            fmul   d1, d5, d1
> 863829148015│            fmul   d1, d1, d6
> 864228353526│            fmul   d0, d1, d0
> 864568163014│            fmadd  d2, d0, d16, d2
>                         │             cmp    w0, #0x4
> 216125427594│          ↓ b.eq   1f34
>         15010377│             ldur   d0, [x2, #-8]
> 143753737468│          ↑ b      1f04
> 
> -O2 with inlined orthonl:
> https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> 
> 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> 144055883055│            add    w0, w0, #0x1
>   72262104254│            add    x2, x2, #0x18
> 143991169721│            add    x1, x1, #0x48
> 288648917780│            fmul   d15, d17, d15
> 864665644756│            fmul   d15, d15, d18
> 863868426387│            fmul   d14, d15, d14
> 865228159813│            fmadd  d16, d14, d31, d16
>             245967│            cmp    w0, #0x4
> 215396760545│         ↓ b.eq   1f28
>       704732365│            ldur   d14, [x2, #-8]
> 143775979620│         ↑ b      1ef8

This indicates that the loop only covers about 46-48% of overall time.

High count on the initial ldur instruction could be explained if the loop
is not entered by "fallthru" from the preceding block, or if its backedge
is mispredicted. Sampling mispredictions should be possible with perf record,
and you may be able to check if loop entry is fallthrough by inspecting
assembly.

It may also be possible to check if code alignment matters, by compiling with
-falign-loops=32.

Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-04 11:38           ` Alexander Monakov
@ 2020-09-21  9:49             ` Prathamesh Kulkarni
  2020-09-21 12:44               ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-21  9:49 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Richard Biener, GCC Development

[-- Attachment #1: Type: text/plain, Size: 8873 bytes --]

On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
>
> > I obtained perf stat results for following benchmark runs:
> >
> > -O2:
> >
> >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> >               3758               context-switches          #    0.000 K/sec
> >                 40                 cpu-migrations             #    0.000 K/sec
> >              40847              page-faults                   #    0.005 K/sec
> >      7856782413676      cycles                           #    1.000 GHz
> >      6034510093417      instructions                   #    0.77  insn per cycle
> >       363937274287       branches                       #   46.321 M/sec
> >        48557110132       branch-misses                #   13.34% of all branches
>
> (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> enough for this kind of code)
>
> > -O2 with orthonl inlined:
> >
> >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> >               4285               context-switches         #    0.001 K/sec
> >                 28                 cpu-migrations            #    0.000 K/sec
> >              40843              page-faults                  #    0.005 K/sec
> >      8319591038295      cycles                          #    1.000 GHz
> >      6276338800377      instructions                  #    0.75  insn per cycle
> >       467400726106       branches                      #   56.180 M/sec
> >        45986364011        branch-misses              #    9.84% of all branches
>
> So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> that extra instructions are appearing in this loop nest, but not in the innermost
> loop. As a reminder for others, the innermost loop has only 3 iterations.
>
> > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> >
> >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> >               2266               context-switches    #    0.000 K/sec
> >                 32                 cpu-migrations       #    0.000 K/sec
> >              40846              page-faults             #    0.005 K/sec
> >      8207292032467      cycles                     #   1.000 GHz
> >      6035724436440      instructions             #    0.74  insn per cycle
> >       364415440156       branches                 #   44.401 M/sec
> >        53138327276        branch-misses         #   14.58% of all branches
>
> This seems to match baseline in terms of instruction count, but without PRE
> the loop nest may be carrying some dependencies over memory. I would simply
> check the assembly for the entire 6-level loop nest in question, I hope it's
> not very complicated (though Fortran array addressing...).
>
> > -O2 with orthonl inlined and hoisting disabled:
> >
> >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> >               3139              context-switches          #    0.000 K/sec
> >                 20                cpu-migrations             #    0.000 K/sec
> >              40846              page-faults                  #    0.005 K/sec
> >      7797221351467      cycles                          #    1.000 GHz
> >      6187348757324      instructions                  #    0.79  insn per cycle
> >       461840800061       branches                      #   59.231 M/sec
> >        26920311761        branch-misses             #    5.83% of all branches
>
> There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> I don't think the former fully covers the latter (there's also a 90e9 reduction
> in insn count).
>
> Given that the inner loop iterates only 3 times, my main suggestion is to
> consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> each iterating only 3 times).
>
> > Perf profiles for
> > -O2 -fno-code-hoisting and inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > 216348301800│            add    w0, w0, #0x1
> >             985098 |            add    x2, x2, #0x18
> > 216215999206│            add    x1, x1, #0x48
> > 215630376504│            fmul   d1, d5, d1
> > 863829148015│            fmul   d1, d1, d6
> > 864228353526│            fmul   d0, d1, d0
> > 864568163014│            fmadd  d2, d0, d16, d2
> >                         │             cmp    w0, #0x4
> > 216125427594│          ↓ b.eq   1f34
> >         15010377│             ldur   d0, [x2, #-8]
> > 143753737468│          ↑ b      1f04
> >
> > -O2 with inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > 144055883055│            add    w0, w0, #0x1
> >   72262104254│            add    x2, x2, #0x18
> > 143991169721│            add    x1, x1, #0x48
> > 288648917780│            fmul   d15, d17, d15
> > 864665644756│            fmul   d15, d15, d18
> > 863868426387│            fmul   d14, d15, d14
> > 865228159813│            fmadd  d16, d14, d31, d16
> >             245967│            cmp    w0, #0x4
> > 215396760545│         ↓ b.eq   1f28
> >       704732365│            ldur   d14, [x2, #-8]
> > 143775979620│         ↑ b      1ef8
>
> This indicates that the loop only covers about 46-48% of overall time.
>
> High count on the initial ldur instruction could be explained if the loop
> is not entered by "fallthru" from the preceding block, or if its backedge
> is mispredicted. Sampling mispredictions should be possible with perf record,
> and you may be able to check if loop entry is fallthrough by inspecting
> assembly.
>
> It may also be possible to check if code alignment matters, by compiling with
> -falign-loops=32.
Hi,
Thanks a lot for the detailed feedback, and I am sorry for late response.

The hoisting region is:
if(mattyp.eq.1) then
  4 loops
elseif(mattyp.eq.2) then
  {
     orthonl inlined into basic block;
     loads w[0] .. w[8]
  }
else
   6 loops  // load anisox

followed by basic block:

 senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
                     s(ii1,jj1)=s(ii1,jj1)+senergy
                     s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
                     s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy

Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
right in block 181, which is:
if (mattyp.eq.2) goto <bb 182> else goto <bb 193>

which is then further hoisted to block 173:
if (mattyp.eq.1) goto <bb 392> else goto <bb 181>

From block 181, we have two paths towards senergy block (bb 194):
bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
AND
bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
which has a path length of around 18 blocks.
(bb 194 post-dominates bb 181 and bb 173).

Disabling only load hoisting within blocks 173 and 181
(simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
avoid hoisting of 'w' array and brings back most of performance. Which
verifies that it is hoisting of the
'w' array (w[0] ... w[8]), which is causing the slowdown ?

I obtained perf profiles for full hoisting, and disabled hoisting of
'w' array for the 6 loops, and the most drastic difference was
for ldur instruction:

With full hoisting:
359871503840│ 1ef8:   ldur   d15, [x1, #-248]

Without full hoisting:
3441224 │1edc:   ldur   d1, [x1, #-248]

(The loop entry seems to be fall thru in both cases. I have attached
profiles for both cases).

IIUC, the instruction seems to be loading the first element from anisox array,
which makes me wonder if the issue was with data-cache miss for slower version.
I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
and it reported two cache misses on the ldur instruction in full hoisting case,
while it reported zero for the disabled load hoisting case.
So I wonder if the slowdown happens because hoisting of 'w' array
possibly results
in eviction of anisox thus causing a cache miss inside the inner loop
and making load slower ?

Hoisting also seems to improve the number of overall cache misses tho.
For disabled hoisting  of 'w' array case, there were a total of 463
cache misses, while with full hoisting there were 357 cache misses
(with period = 1 million).
Does that happen because hoisting probably reduces cache misses along
the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?

Thanks,
Prathamesh
>
> Alexander

[-- Attachment #2: fullhoist_profile.txt --]
[-- Type: text/plain, Size: 12850 bytes --]

  884982389 │1e40:   ldr    x0, [sp, #448]                                                                          ◆            │        fmov   d19, d6                                                                                 ▒  871517886 │        ldr    x1, [sp, #808]                                                                          ▒            │        add    x16, sp, #0x720                                                                         ▒  904652642 │        ldr    x13, [sp, #784]                                                                         ▒            │        sub    x15, x26, #0x1                                                                          ▒  892180199 │        mov    x24, x27                                                                                ▒            │        add    x28, x27, #0xf8                                                                         ▒  881362543 │        add    x22, x1, x0, lsl #3                                                                     ▒            │        mov    x12, #0x9                       // #9                                                   ▒
  906876972 │        mov    x23, #0x1                       // #1                                                   ▒
 5342906864 │1e6c:   fmov   d17, d1                                                                                 ▒
 2622786801 │        mov    x14, #0x1778                    // #6008
            │        mov    x20, x28                                                                                ▒ 2680397945 │        add    x19, sp, x14                                                                            ▒
            │        mov    x18, x24                                                                                ▒
 2629152729 │        mov    x21, x30                                                                                ▒
            │        ldr    d16, [x22]                                                                              ▒
 4571598336 │        mov    x17, #0x1e                      // #30                                                  ▒
15904018941 │1e8c:   mov    x11, x19                                                                                ▒
 8106237022 │        mov    x10, x20                                                                                ▒
            │        mov    x14, x21                                                                                ▒
 7958740225 │        mov    x9, x18                                                                                 ▒
            │        mov    x8, #0x1b                       // #27                                                  ▒
41353477432 │1ea0:   ldr    d14, [x9]                                                                               ▒
 1220553185 │        fmov   d18, d22                                                                                ◆
22852558475 │        fmov   d20, d19                                                                                ▒ 1199867833 │        mov    x3, x11                                                                                 ▒
22706386191 │        mov    x7, x16                                                                                 ▒
 1177543527 │        mov    x6, x10                                                                                 ▒
22767111709 │        fmul   d14, d17, d14                                                                           ▒
 1195454897 │        mov    x5, #0x1                        // #1                                                   ▒
94868835951 │        fmadd  d16, d14, d31, d16                                                                      ▒
48021203056 │1ec4:   ldur   d15, [x6, #-248]                                                                        ▒
30707657072 │        sub    x4, x3, #0x140                                                                          ▒
41301831015 │        fmov   d14, d19                                                                                ▒
32467499777 │        mov    x2, x13                                                                                 ▒
39498561992 │        mov    x1, x3                                                                                  ▒
32503985332 │        mov    w0, #0x1                        // #1                                                   ▒39636367978 │        fmul   d15, d17, d15                                                                           ▒56642417403 │        ldr    d21, [x4, x12, lsl #3]                                                                  ▒215900325343│         fmul   d21, d17, d21                                                                          ▒49939836468 │        fmul   d15, d15, d20                                                                           ▒238451679574│         fmul   d20, d21, d18                                                                          ▒49692127013 │        fmadd  d15, d15, d31, d16                                                                      ▒287649913912│         fmadd  d16, d20, d31, d15                                                                     ▒359871503840│ 1ef8:   ldur   d15, [x1, #-248]                                                                       ▒144055883055│         add    w0, w0, #0x1                                                                           ▒72262104254 │        add    x2, x2, #0x18                                                                           ▒143991169721│         add    x1, x1, #0x48                                                                          ▒288648917780│         fmul   d15, d17, d15                                                                          ▒864665644756│         fmul   d15, d15, d18                                                                          ▒863868426387│         fmul   d14, d15, d14                                                                          ◆
865228159813│         fmadd  d16, d14, d31, d16                                                                     ▒
     245967 │        cmp    w0, #0x4                                                                                ▒
215396760545│       ↓ b.eq   1f28                                                                                   ▒
  704732365 │        ldur   d14, [x2, #-8]                                                                          ▒
143775979620│       ↑ b      1ef8                                                                                   ▒
 2623253706 │1f28:   add    x5, x5, #0x1                                                                            ▒71700007726 │        add    x6, x6, #0x48                                                                           ▒  291326727 │        add    x3, x3, #0x8                                                                            ▒41539387956 │        cmp    x5, #0x4                                                                                ▒  291327452 │      ↓ b.eq   1f4c                                                                                    ▒
152721910227│         ldr    d18, [x7, x15, lsl #3]                                                                 ▒
 8561615599 │        add    x7, x7, #0x18                                                                           ▒
96142935717 │        ldur   d20, [x7, #-24]                                                                         ▒
 8495464096 │      ↑ b      1ec4                                                                                    ▒
201164546300│ 1f4c:   add    x8, x8, #0x1b                                                                          ▒
22086088222 │        add    x9, x9, #0xd8                                                                           ▒
 1882100212 │        add    x14, x14, #0x18                                                                         ▒
22119311849 │        add    x10, x10, #0xd8                                                                         ▒
 1892034271 │        add    x11, x11, #0xd8                                                                         ▒
13413581701 │        cmp    x8, #0x6c                                                                               ▒
 1191551884 │      ↓ b.eq   1f70                                                                                    ▒26310755425 │        ldur   d17, [x14, #-8]                                                                         ▒ 1210506566 │      ↑ b      1ea0                                                                                    ▒71960439728 │1f70:   add    x17, x17, #0x3                                                                          ▒            │        add    x18, x18, #0x18                                                                         ◆
 8069920125 │        add    x20, x20, #0x18                                                                         ▒
            │        add    x19, x19, #0x18                                                                         ▒
 4645045210 │        cmp    x17, #0x27                                                                              ▒
            │      ↓ b.eq   1f90                                                                                    ▒
10962695888 │        ldr    d17, [x21], #8                                                                          ▒
            │      ↑ b      1e8c                                                                                    ▒
23927242012 │1f90:   add    x23, x23, #0x1                                                                          ▒            │        str    d16, [x22]                                                                              ▒ 2672842806 │        add    x16, x16, #0x8                                                                          ▒            │        add    x12, x12, #0x9                                                                          ▒
2653094829  │        sub    x15, x15, #0x1                                                                          ▒            │        add    x24, x24, #0x48                                                                         ▒ 2692030697 │        add    x22, x22, #0x1e0                                                                        ▒            │        cmp    x23, #0x4                                                                               ▒ 1721216607 │      ↓ b.eq   1fbc                                                                                    ▒  448331273 │        ldr    d19, [x13], #8                                                                          ▒ 1778236919 │      ↑ b      1e6c                                                                                    ▒ 7971009272 │1fbc:   ldr    x0, [sp, #448]                                                                          ▒  911313572 │        add    x26, x26, #0x1                                                                          ▒            │        add    x27, x27, #0x8                                                                          ▒  902215785 │        add    x0, x0, #0x1                                                                            ▒            │        str    x0, [sp, #448]                                                                          ▒  478032817 │        cmp    x26, #0x4                                                                               ▒            │      ↓ b.eq   1fe8                                                                                    ▒ 1475545769 │        add    x0, sp, #0x708                                                                          ◆
            │        add    x0, x0, x26, lsl #3                                                                     ▒ 1806982272 │        ldur   d22, [x0, #-8]                                                                          ▒            │      ↑ b      1e40                                                                                    

[-- Attachment #3: noloadhoist_profile.txt --]
[-- Type: text/plain, Size: 10969 bytes --]

  589937229 │1e30:   mov    x15, #0x1760                    // #5984                                                ◆  904297989 │        add    x0, sp, x15                                                                             ▒  870649879 │        add    x22, x0, x22                                                                            ▒
            │        fmov   d7, d24                                                                                 ▒
  891274869 │        ldr    x0, [sp, #448]                                                                          ▒
            │        add    x14, sp, #0x710                                                                         ▒
  909978719 │        ldr    x12, [sp, #728]                                                                         ▒
            │        sub    x13, x27, #0x1                                                                          ▒
  882715766 │        add    x18, x28, #0x8                                                                          ▒
            │        sub    x19, x0, x28                                                                            ▒
  885884552 │        mov    x9, #0x9                        // #9                                                   ▒
            │        mov    x20, #0x1                       // #1                                                   ▒
 6279074827 │1e60:   mov    x17, x22
            │        mov    x16, x30                                                                                ▒ 2666213304 │        ldr    d2, [x19]                                                                               ▒
            │        mov    x15, #0x3                       // #3                                                   ▒
18990367400 │1e70:   mov    x8, x17                                                                                 ▒
            │        mov    x11, x16                                                                                ▒
 8057495884 │        mov    x10, #0x1b                      // #27                                                  ▒
14947123246 │1e7c:   sub    x0, x8, #0x140                                                                          ▒
22985623052 │        ldur   d5, [x11, #-8]                                                                          ▒
 1060364445 │        fmov   d6, d8                                                                                  ▒
23956420799 │        fmov   d3, d7                                                                                  ▒
            │        add    x3, x18, x8                                                                             ◆
24065319873 │        mov    x7, x14                                                                                 ▒            │        ldr    d0, [x0, x9, lsl #3]                                                                    ▒24187025828 │        mov    x6, x8                                                                                  ▒            │        mov    x5, #0x1                        // #1                                                   ▒
48132474841 │        fmul   d0, d5, d0                                                                              ▒
96001335773 │        fmadd  d2, d0, d16, d2                                                                         ▒
61067761742 │1ea8:   ldur   d4, [x6, #-248]                                                                         ▒
14089308947 │        sub    x4, x3, #0x140                                                                          ▒
58091146403 │        fmov   d0, d7                                                                                  ▒
14028168886 │        mov    x2, x12                                                                                 ▒
57897209384 │        mov    x1, x3                                                                                  ▒
13994185270 │        mov    w0, #0x1                        // #1
67891460180 │        fmul   d4, d5, d4                                                                              ▒28006688701 │        ldr    d1, [x4, x9, lsl #3]                                                                    ▒215655048826│         fmul   d1, d5, d1                                                                             ▒57701202743 │        fmul   d3, d4, d3                                                                              ▒230116393416│         fmul   d1, d1, d6                                                                             ▒57977229144 │        fmadd  d2, d3, d16, d2                                                                         ▒301775181164│         fmadd  d2, d1, d16, d2                                                                        ▒    3441224 │1edc:   ldur   d1, [x1, #-248]                                                                         ▒216111094536│         add    w0, w0, #0x1                                                                           ▒    1473566 │        add    x2, x2, #0x18                                                                           ▒215873683406│         add    x1, x1, #0x48                                                                          ▒216166335905│         fmul   d1, d5, d1                                                                             ▒864007322335│         fmul   d1, d1, d6                                                                             ▒863815029515│         fmul   d0, d1, d0                                                                             ▒864900327399│         fmadd  d2, d0, d16, d2                                                                        ◆
            │        cmp    w0, #0x4                                                                                ▒
216329679631│       ↓ b.eq   1f0c                                                                                   ▒
   22872044 │        ldur   d0, [x2, #-8]                                                                           ▒
143941131893│       ↑ b      1edc                                                                                   ▒
  277804663 │1f0c:   add    x5, x5, #0x1                                                                            ▒72179847520 │        add    x6, x6, #0x48                                                                           ▒            │        add    x3, x3, #0x8                                                                            ▒65738463940 │        cmp    x5, #0x4                                                                                ▒            │      ↓ b.eq   1f30                   
123097375558│         ldr    d6, [x7, x13, lsl #3]                                                                  ▒
            │        add    x7, x7, #0x18                                                                           ▒
96061189670 │        ldur   d3, [x7, #-24]                                                                          ▒
            │      ↑ b      1ea8                                                                                    ▒
42647845407 │1f30:   add    x10, x10, #0x1b                                                                         ▒
            │        add    x11, x11, #0x18                                                                         ▒
24141022972 │        add    x8, x8, #0xd8                                                                           ▒
            │        cmp    x10, #0x6c                                                                              ▒
14573046432 │      ↑ b.ne   1e7c                                                                                    ▒
72139544087 │        add    x15, x15, #0x3                                                                          ▒
 8028370830 │        add    x16, x16, #0x8                                                                          ▒
            │        add    x17, x17, #0x18                                                                         ▒ 4860057143 │        cmp    x15, #0xc                                                                               ▒            │      ↑ b.ne   1e70                                                                                    ▒23912996709 │        add    x20, x20, #0x1                                                                          ◆
            │        str    d2, [x19]                                                                               ▒
 2670529487 │        add    x14, x14, #0x8                                                                          ▒            │        add    x9, x9, #0x9                                                                            ▒ 2659625346 │        sub    x13, x13, #0x1                                                                          ▒            │        add    x19, x19, #0x1e0                                                                        ▒ 1606030574 │        cmp    x20, #0x4                                                                               ▒            │      ↓ b.eq   1f80                                                                                    ▒ 3096553445 │        ldr    d7, [x12], #8                                                                           ▒            │      ↑ b      1e60                                                                                    ▒ 7964390214 │1f80:   add    x27, x27, #0x1                                                                          ▒            │        sub    x28, x28, #0x8
  529029469 │        cmp    x27, #0x4                                                                               ◆
            │      ↓ b.eq   2028                                                                                    ▒ 1176126379 │        lsl    x22, x27, #3                                                                            ▒            │        add    x0, sp, #0x6f8                                                                          ▒  593893747 │        add    x0, x0, x22                                                                             ▒ 1798781807 │        ldur   d8, [x0, #-8]                                                                           ▒  580872685 │      ↑ b      1e30             

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-21  9:49             ` Prathamesh Kulkarni
@ 2020-09-21 12:44               ` Prathamesh Kulkarni
  2020-09-22  5:08                 ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-21 12:44 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Richard Biener, GCC Development

On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> >
> > > I obtained perf stat results for following benchmark runs:
> > >
> > > -O2:
> > >
> > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > >               3758               context-switches          #    0.000 K/sec
> > >                 40                 cpu-migrations             #    0.000 K/sec
> > >              40847              page-faults                   #    0.005 K/sec
> > >      7856782413676      cycles                           #    1.000 GHz
> > >      6034510093417      instructions                   #    0.77  insn per cycle
> > >       363937274287       branches                       #   46.321 M/sec
> > >        48557110132       branch-misses                #   13.34% of all branches
> >
> > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > enough for this kind of code)
> >
> > > -O2 with orthonl inlined:
> > >
> > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > >               4285               context-switches         #    0.001 K/sec
> > >                 28                 cpu-migrations            #    0.000 K/sec
> > >              40843              page-faults                  #    0.005 K/sec
> > >      8319591038295      cycles                          #    1.000 GHz
> > >      6276338800377      instructions                  #    0.75  insn per cycle
> > >       467400726106       branches                      #   56.180 M/sec
> > >        45986364011        branch-misses              #    9.84% of all branches
> >
> > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > that extra instructions are appearing in this loop nest, but not in the innermost
> > loop. As a reminder for others, the innermost loop has only 3 iterations.
> >
> > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > >
> > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > >               2266               context-switches    #    0.000 K/sec
> > >                 32                 cpu-migrations       #    0.000 K/sec
> > >              40846              page-faults             #    0.005 K/sec
> > >      8207292032467      cycles                     #   1.000 GHz
> > >      6035724436440      instructions             #    0.74  insn per cycle
> > >       364415440156       branches                 #   44.401 M/sec
> > >        53138327276        branch-misses         #   14.58% of all branches
> >
> > This seems to match baseline in terms of instruction count, but without PRE
> > the loop nest may be carrying some dependencies over memory. I would simply
> > check the assembly for the entire 6-level loop nest in question, I hope it's
> > not very complicated (though Fortran array addressing...).
> >
> > > -O2 with orthonl inlined and hoisting disabled:
> > >
> > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > >               3139              context-switches          #    0.000 K/sec
> > >                 20                cpu-migrations             #    0.000 K/sec
> > >              40846              page-faults                  #    0.005 K/sec
> > >      7797221351467      cycles                          #    1.000 GHz
> > >      6187348757324      instructions                  #    0.79  insn per cycle
> > >       461840800061       branches                      #   59.231 M/sec
> > >        26920311761        branch-misses             #    5.83% of all branches
> >
> > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > in insn count).
> >
> > Given that the inner loop iterates only 3 times, my main suggestion is to
> > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > each iterating only 3 times).
> >
> > > Perf profiles for
> > > -O2 -fno-code-hoisting and inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > 216348301800│            add    w0, w0, #0x1
> > >             985098 |            add    x2, x2, #0x18
> > > 216215999206│            add    x1, x1, #0x48
> > > 215630376504│            fmul   d1, d5, d1
> > > 863829148015│            fmul   d1, d1, d6
> > > 864228353526│            fmul   d0, d1, d0
> > > 864568163014│            fmadd  d2, d0, d16, d2
> > >                         │             cmp    w0, #0x4
> > > 216125427594│          ↓ b.eq   1f34
> > >         15010377│             ldur   d0, [x2, #-8]
> > > 143753737468│          ↑ b      1f04
> > >
> > > -O2 with inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > 144055883055│            add    w0, w0, #0x1
> > >   72262104254│            add    x2, x2, #0x18
> > > 143991169721│            add    x1, x1, #0x48
> > > 288648917780│            fmul   d15, d17, d15
> > > 864665644756│            fmul   d15, d15, d18
> > > 863868426387│            fmul   d14, d15, d14
> > > 865228159813│            fmadd  d16, d14, d31, d16
> > >             245967│            cmp    w0, #0x4
> > > 215396760545│         ↓ b.eq   1f28
> > >       704732365│            ldur   d14, [x2, #-8]
> > > 143775979620│         ↑ b      1ef8
> >
> > This indicates that the loop only covers about 46-48% of overall time.
> >
> > High count on the initial ldur instruction could be explained if the loop
> > is not entered by "fallthru" from the preceding block, or if its backedge
> > is mispredicted. Sampling mispredictions should be possible with perf record,
> > and you may be able to check if loop entry is fallthrough by inspecting
> > assembly.
> >
> > It may also be possible to check if code alignment matters, by compiling with
> > -falign-loops=32.
> Hi,
> Thanks a lot for the detailed feedback, and I am sorry for late response.
>
> The hoisting region is:
> if(mattyp.eq.1) then
>   4 loops
> elseif(mattyp.eq.2) then
>   {
>      orthonl inlined into basic block;
>      loads w[0] .. w[8]
>   }
> else
>    6 loops  // load anisox
>
> followed by basic block:
>
>  senergy=
>      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
>      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
>      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
>                      s(ii1,jj1)=s(ii1,jj1)+senergy
>                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
>                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
>
> Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> right in block 181, which is:
> if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
>
> which is then further hoisted to block 173:
> if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
>
> From block 181, we have two paths towards senergy block (bb 194):
> bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> AND
> bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> which has a path length of around 18 blocks.
> (bb 194 post-dominates bb 181 and bb 173).
>
> Disabling only load hoisting within blocks 173 and 181
> (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> avoid hoisting of 'w' array and brings back most of performance. Which
> verifies that it is hoisting of the
> 'w' array (w[0] ... w[8]), which is causing the slowdown ?
>
> I obtained perf profiles for full hoisting, and disabled hoisting of
> 'w' array for the 6 loops, and the most drastic difference was
> for ldur instruction:
>
> With full hoisting:
> 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
>
> Without full hoisting:
> 3441224 │1edc:   ldur   d1, [x1, #-248]
>
> (The loop entry seems to be fall thru in both cases. I have attached
> profiles for both cases).
>
> IIUC, the instruction seems to be loading the first element from anisox array,
> which makes me wonder if the issue was with data-cache miss for slower version.
> I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> and it reported two cache misses on the ldur instruction in full hoisting case,
> while it reported zero for the disabled load hoisting case.
> So I wonder if the slowdown happens because hoisting of 'w' array
> possibly results
> in eviction of anisox thus causing a cache miss inside the inner loop
> and making load slower ?
>
> Hoisting also seems to improve the number of overall cache misses tho.
> For disabled hoisting  of 'w' array case, there were a total of 463
> cache misses, while with full hoisting there were 357 cache misses
> (with period = 1 million).
> Does that happen because hoisting probably reduces cache misses along
> the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
Hi,
In general I feel for this or PR80155 case, the issues come with long
range hoistings, inside a large CFG, since we don't have an accurate
way to model target resources (register pressure in PR80155 case / or
possibly cache pressure in this case?) at tree level and we end up
with register spill or cache miss inside loops, which may offset the
benefit of hoisting. As previously discussed the right way to go is a
live range splitting pass, at GIMPLE -> RTL border which can also help
with other code-movement optimizations (or if the source had variables
with long live ranges).

I was wondering tho as a cheap workaround, would it make sense to
check if we are hoisting across a "large" region of nested loops, and
avoid in that case since hoisting may exert resource pressure inside
loop region ? (Especially, in the cases where hoisted expressions were
not originally AVAIL in any of the loop blocks, and the loop region
doesn't benefit from hoisting).

For instance:
FOR_EACH_EDGE (e, ei, block)
  {
    /* Avoid hoisting across more than 3 nested loops */
    if (e->dest is a loop pre-header or loop header
        && nesting depth of loop is > 3)
     return false;
  }

I think this would work for resolving the calculix issue because it
hoists across one region of 6 loops and another of 4 loops (didn' test
yet).
It's not bulletproof in that it will miss detecting cases where loop
header (or pre-header) isn't a successor of candidate block (checking
for
that might get expensive tho?). I will test it on gcc suite and SPEC
for any regressions.
Does this sound like a reasonable heuristic ?

Thanks,
Prathamesh



Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-21 12:44               ` Prathamesh Kulkarni
@ 2020-09-22  5:08                 ` Prathamesh Kulkarni
  2020-09-22  7:25                   ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22  5:08 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Richard Biener, GCC Development

[-- Attachment #1: Type: text/plain, Size: 12256 bytes --]

On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > >
> > > > I obtained perf stat results for following benchmark runs:
> > > >
> > > > -O2:
> > > >
> > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > >               3758               context-switches          #    0.000 K/sec
> > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > >              40847              page-faults                   #    0.005 K/sec
> > > >      7856782413676      cycles                           #    1.000 GHz
> > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > >       363937274287       branches                       #   46.321 M/sec
> > > >        48557110132       branch-misses                #   13.34% of all branches
> > >
> > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > enough for this kind of code)
> > >
> > > > -O2 with orthonl inlined:
> > > >
> > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > >               4285               context-switches         #    0.001 K/sec
> > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > >              40843              page-faults                  #    0.005 K/sec
> > > >      8319591038295      cycles                          #    1.000 GHz
> > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > >       467400726106       branches                      #   56.180 M/sec
> > > >        45986364011        branch-misses              #    9.84% of all branches
> > >
> > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > >
> > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > >
> > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > >               2266               context-switches    #    0.000 K/sec
> > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > >              40846              page-faults             #    0.005 K/sec
> > > >      8207292032467      cycles                     #   1.000 GHz
> > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > >       364415440156       branches                 #   44.401 M/sec
> > > >        53138327276        branch-misses         #   14.58% of all branches
> > >
> > > This seems to match baseline in terms of instruction count, but without PRE
> > > the loop nest may be carrying some dependencies over memory. I would simply
> > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > not very complicated (though Fortran array addressing...).
> > >
> > > > -O2 with orthonl inlined and hoisting disabled:
> > > >
> > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > >               3139              context-switches          #    0.000 K/sec
> > > >                 20                cpu-migrations             #    0.000 K/sec
> > > >              40846              page-faults                  #    0.005 K/sec
> > > >      7797221351467      cycles                          #    1.000 GHz
> > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > >       461840800061       branches                      #   59.231 M/sec
> > > >        26920311761        branch-misses             #    5.83% of all branches
> > >
> > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > in insn count).
> > >
> > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > each iterating only 3 times).
> > >
> > > > Perf profiles for
> > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > >
> > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > 216348301800│            add    w0, w0, #0x1
> > > >             985098 |            add    x2, x2, #0x18
> > > > 216215999206│            add    x1, x1, #0x48
> > > > 215630376504│            fmul   d1, d5, d1
> > > > 863829148015│            fmul   d1, d1, d6
> > > > 864228353526│            fmul   d0, d1, d0
> > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > >                         │             cmp    w0, #0x4
> > > > 216125427594│          ↓ b.eq   1f34
> > > >         15010377│             ldur   d0, [x2, #-8]
> > > > 143753737468│          ↑ b      1f04
> > > >
> > > > -O2 with inlined orthonl:
> > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > >
> > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > 144055883055│            add    w0, w0, #0x1
> > > >   72262104254│            add    x2, x2, #0x18
> > > > 143991169721│            add    x1, x1, #0x48
> > > > 288648917780│            fmul   d15, d17, d15
> > > > 864665644756│            fmul   d15, d15, d18
> > > > 863868426387│            fmul   d14, d15, d14
> > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > >             245967│            cmp    w0, #0x4
> > > > 215396760545│         ↓ b.eq   1f28
> > > >       704732365│            ldur   d14, [x2, #-8]
> > > > 143775979620│         ↑ b      1ef8
> > >
> > > This indicates that the loop only covers about 46-48% of overall time.
> > >
> > > High count on the initial ldur instruction could be explained if the loop
> > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > and you may be able to check if loop entry is fallthrough by inspecting
> > > assembly.
> > >
> > > It may also be possible to check if code alignment matters, by compiling with
> > > -falign-loops=32.
> > Hi,
> > Thanks a lot for the detailed feedback, and I am sorry for late response.
> >
> > The hoisting region is:
> > if(mattyp.eq.1) then
> >   4 loops
> > elseif(mattyp.eq.2) then
> >   {
> >      orthonl inlined into basic block;
> >      loads w[0] .. w[8]
> >   }
> > else
> >    6 loops  // load anisox
> >
> > followed by basic block:
> >
> >  senergy=
> >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> >
> > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > right in block 181, which is:
> > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> >
> > which is then further hoisted to block 173:
> > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> >
> > From block 181, we have two paths towards senergy block (bb 194):
> > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > AND
> > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > which has a path length of around 18 blocks.
> > (bb 194 post-dominates bb 181 and bb 173).
> >
> > Disabling only load hoisting within blocks 173 and 181
> > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > avoid hoisting of 'w' array and brings back most of performance. Which
> > verifies that it is hoisting of the
> > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> >
> > I obtained perf profiles for full hoisting, and disabled hoisting of
> > 'w' array for the 6 loops, and the most drastic difference was
> > for ldur instruction:
> >
> > With full hoisting:
> > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> >
> > Without full hoisting:
> > 3441224 │1edc:   ldur   d1, [x1, #-248]
> >
> > (The loop entry seems to be fall thru in both cases. I have attached
> > profiles for both cases).
> >
> > IIUC, the instruction seems to be loading the first element from anisox array,
> > which makes me wonder if the issue was with data-cache miss for slower version.
> > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > and it reported two cache misses on the ldur instruction in full hoisting case,
> > while it reported zero for the disabled load hoisting case.
> > So I wonder if the slowdown happens because hoisting of 'w' array
> > possibly results
> > in eviction of anisox thus causing a cache miss inside the inner loop
> > and making load slower ?
> >
> > Hoisting also seems to improve the number of overall cache misses tho.
> > For disabled hoisting  of 'w' array case, there were a total of 463
> > cache misses, while with full hoisting there were 357 cache misses
> > (with period = 1 million).
> > Does that happen because hoisting probably reduces cache misses along
> > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> Hi,
> In general I feel for this or PR80155 case, the issues come with long
> range hoistings, inside a large CFG, since we don't have an accurate
> way to model target resources (register pressure in PR80155 case / or
> possibly cache pressure in this case?) at tree level and we end up
> with register spill or cache miss inside loops, which may offset the
> benefit of hoisting. As previously discussed the right way to go is a
> live range splitting pass, at GIMPLE -> RTL border which can also help
> with other code-movement optimizations (or if the source had variables
> with long live ranges).
>
> I was wondering tho as a cheap workaround, would it make sense to
> check if we are hoisting across a "large" region of nested loops, and
> avoid in that case since hoisting may exert resource pressure inside
> loop region ? (Especially, in the cases where hoisted expressions were
> not originally AVAIL in any of the loop blocks, and the loop region
> doesn't benefit from hoisting).
>
> For instance:
> FOR_EACH_EDGE (e, ei, block)
>   {
>     /* Avoid hoisting across more than 3 nested loops */
>     if (e->dest is a loop pre-header or loop header
>         && nesting depth of loop is > 3)
>      return false;
>   }
>
> I think this would work for resolving the calculix issue because it
> hoists across one region of 6 loops and another of 4 loops (didn' test
> yet).
> It's not bulletproof in that it will miss detecting cases where loop
> header (or pre-header) isn't a successor of candidate block (checking
> for
> that might get expensive tho?). I will test it on gcc suite and SPEC
> for any regressions.
> Does this sound like a reasonable heuristic ?
Hi,
The attached patch implements the above heuristic.
Bootstrapped + tested on x86_64-linux-gnu with no regressions.
And it brings back most of performance for calculix on par with O2
(without inlining orthonl).
I verified that with patch there is no cache miss happening on load
insn inside loop
(with perf report -e L1-dcache-load-misses/period=1000000/)

I am in the process of benchmarking the patch on aarch64 for SPEC for
speed and will report numbers
in couple of days. (If required, we could parametrize number of nested
loops, hardcoded (arbitrarily to) 3 in this patch,
and set it in backend to not affect other targets).

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
>
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Alexander

[-- Attachment #2: gnu-659-loop-1.diff --]
[-- Type: application/octet-stream, Size: 1786 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..9017ad9a4cb 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3528,13 +3528,30 @@ do_hoist_insertion (basic_block block)
   bitmap_head availout_in_some;
   bitmap_initialize (&availout_in_some, &grand_bitmap_obstack);
   FOR_EACH_EDGE (e, ei, block->succs)
-    /* Do not consider expressions solely because their availability
-       on loop exits.  They'd be ANTIC-IN throughout the whole loop
-       and thus effectively hoisted across loops by combination of
-       PRE and hoisting.  */
-    if (! loop_exit_edge_p (block->loop_father, e))
-      bitmap_ior_and_into (&availout_in_some, &hoistable_set.values,
-			   &AVAIL_OUT (e->dest)->values);
+    {
+      /* Do not consider expressions solely because their availability
+	 on loop exits.  They'd be ANTIC-IN throughout the whole loop
+	 and thus effectively hoisted across loops by combination of
+	 PRE and hoisting.  */
+      if (! loop_exit_edge_p (block->loop_father, e))
+	bitmap_ior_and_into (&availout_in_some, &hoistable_set.values,
+			     &AVAIL_OUT (e->dest)->values);
+
+     /* Avoid hoisting if a successor block is either loop header or pre-header,
+	and the loop region has more than 3 nested loops to not exert resource
+	pressure inside loop region.  */
+
+     basic_block header_bb = NULL;
+      if (bb_loop_header_p (e->dest))
+	header_bb = e->dest;
+      else if (single_succ_p (e->dest)
+	       && bb_loop_header_p (single_succ (e->dest)))
+	header_bb = single_succ (e->dest);
+
+      if (header_bb && header_bb->loop_father
+	  && get_loop_level (header_bb->loop_father) > 3)
+	return false;
+    }
   bitmap_clear (&hoistable_set.values);
 
   /* Short-cut for a common case: availout_in_some is empty.  */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-22  5:08                 ` Prathamesh Kulkarni
@ 2020-09-22  7:25                   ` Richard Biener
  2020-09-22  9:37                     ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-22  7:25 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > >
> > > > > I obtained perf stat results for following benchmark runs:
> > > > >
> > > > > -O2:
> > > > >
> > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > >               3758               context-switches          #    0.000 K/sec
> > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > >              40847              page-faults                   #    0.005 K/sec
> > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > >       363937274287       branches                       #   46.321 M/sec
> > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > >
> > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > enough for this kind of code)
> > > >
> > > > > -O2 with orthonl inlined:
> > > > >
> > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > >               4285               context-switches         #    0.001 K/sec
> > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > >              40843              page-faults                  #    0.005 K/sec
> > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > >       467400726106       branches                      #   56.180 M/sec
> > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > >
> > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > >
> > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > >
> > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > >               2266               context-switches    #    0.000 K/sec
> > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > >              40846              page-faults             #    0.005 K/sec
> > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > >       364415440156       branches                 #   44.401 M/sec
> > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > >
> > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > not very complicated (though Fortran array addressing...).
> > > >
> > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > >
> > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > >               3139              context-switches          #    0.000 K/sec
> > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > >              40846              page-faults                  #    0.005 K/sec
> > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > >       461840800061       branches                      #   59.231 M/sec
> > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > >
> > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > in insn count).
> > > >
> > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > each iterating only 3 times).
> > > >
> > > > > Perf profiles for
> > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > >
> > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > 216348301800│            add    w0, w0, #0x1
> > > > >             985098 |            add    x2, x2, #0x18
> > > > > 216215999206│            add    x1, x1, #0x48
> > > > > 215630376504│            fmul   d1, d5, d1
> > > > > 863829148015│            fmul   d1, d1, d6
> > > > > 864228353526│            fmul   d0, d1, d0
> > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > >                         │             cmp    w0, #0x4
> > > > > 216125427594│          ↓ b.eq   1f34
> > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > 143753737468│          ↑ b      1f04
> > > > >
> > > > > -O2 with inlined orthonl:
> > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > >
> > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > 144055883055│            add    w0, w0, #0x1
> > > > >   72262104254│            add    x2, x2, #0x18
> > > > > 143991169721│            add    x1, x1, #0x48
> > > > > 288648917780│            fmul   d15, d17, d15
> > > > > 864665644756│            fmul   d15, d15, d18
> > > > > 863868426387│            fmul   d14, d15, d14
> > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > >             245967│            cmp    w0, #0x4
> > > > > 215396760545│         ↓ b.eq   1f28
> > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > 143775979620│         ↑ b      1ef8
> > > >
> > > > This indicates that the loop only covers about 46-48% of overall time.
> > > >
> > > > High count on the initial ldur instruction could be explained if the loop
> > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > assembly.
> > > >
> > > > It may also be possible to check if code alignment matters, by compiling with
> > > > -falign-loops=32.
> > > Hi,
> > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > >
> > > The hoisting region is:
> > > if(mattyp.eq.1) then
> > >   4 loops
> > > elseif(mattyp.eq.2) then
> > >   {
> > >      orthonl inlined into basic block;
> > >      loads w[0] .. w[8]
> > >   }
> > > else
> > >    6 loops  // load anisox
> > >
> > > followed by basic block:
> > >
> > >  senergy=
> > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > >
> > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > right in block 181, which is:
> > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > >
> > > which is then further hoisted to block 173:
> > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > >
> > > From block 181, we have two paths towards senergy block (bb 194):
> > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > AND
> > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > which has a path length of around 18 blocks.
> > > (bb 194 post-dominates bb 181 and bb 173).
> > >
> > > Disabling only load hoisting within blocks 173 and 181
> > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > verifies that it is hoisting of the
> > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > >
> > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > 'w' array for the 6 loops, and the most drastic difference was
> > > for ldur instruction:
> > >
> > > With full hoisting:
> > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > >
> > > Without full hoisting:
> > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > >
> > > (The loop entry seems to be fall thru in both cases. I have attached
> > > profiles for both cases).
> > >
> > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > while it reported zero for the disabled load hoisting case.
> > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > possibly results
> > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > and making load slower ?
> > >
> > > Hoisting also seems to improve the number of overall cache misses tho.
> > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > cache misses, while with full hoisting there were 357 cache misses
> > > (with period = 1 million).
> > > Does that happen because hoisting probably reduces cache misses along
> > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > Hi,
> > In general I feel for this or PR80155 case, the issues come with long
> > range hoistings, inside a large CFG, since we don't have an accurate
> > way to model target resources (register pressure in PR80155 case / or
> > possibly cache pressure in this case?) at tree level and we end up
> > with register spill or cache miss inside loops, which may offset the
> > benefit of hoisting. As previously discussed the right way to go is a
> > live range splitting pass, at GIMPLE -> RTL border which can also help
> > with other code-movement optimizations (or if the source had variables
> > with long live ranges).
> >
> > I was wondering tho as a cheap workaround, would it make sense to
> > check if we are hoisting across a "large" region of nested loops, and
> > avoid in that case since hoisting may exert resource pressure inside
> > loop region ? (Especially, in the cases where hoisted expressions were
> > not originally AVAIL in any of the loop blocks, and the loop region
> > doesn't benefit from hoisting).
> >
> > For instance:
> > FOR_EACH_EDGE (e, ei, block)
> >   {
> >     /* Avoid hoisting across more than 3 nested loops */
> >     if (e->dest is a loop pre-header or loop header
> >         && nesting depth of loop is > 3)
> >      return false;
> >   }
> >
> > I think this would work for resolving the calculix issue because it
> > hoists across one region of 6 loops and another of 4 loops (didn' test
> > yet).
> > It's not bulletproof in that it will miss detecting cases where loop
> > header (or pre-header) isn't a successor of candidate block (checking
> > for
> > that might get expensive tho?). I will test it on gcc suite and SPEC
> > for any regressions.
> > Does this sound like a reasonable heuristic ?
> Hi,
> The attached patch implements the above heuristic.
> Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> And it brings back most of performance for calculix on par with O2
> (without inlining orthonl).
> I verified that with patch there is no cache miss happening on load
> insn inside loop
> (with perf report -e L1-dcache-load-misses/period=1000000/)
>
> I am in the process of benchmarking the patch on aarch64 for SPEC for
> speed and will report numbers
> in couple of days. (If required, we could parametrize number of nested
> loops, hardcoded (arbitrarily to) 3 in this patch,
> and set it in backend to not affect other targets).

I don't think this patch captures the case in a sensible way - it will simply
never hoist computations out of loop header blocks with depth > 3 which
is certainly not what you want.  Also the pre-header check is odd - we're
looking for computations in successors of BLOCK but clearly a computation
in a pre-header is not at the same loop level as one in the header itself.

Note the difficulty to capture "distance" is that the distance is simply not
available at this point - it is the anticipated values from the successors
that do _not_ compute the value itself that are the issue.  To capture
"distance" you'd need to somehow "age" anticipated value when
propagating them upwards during compute_antic (where it then
would also apply to PRE in general).

As with all other heuristics the only place one could do hackish attempts
with at least some reasoning is the elimination phase where
we make use of the (hoist) insertions - of course for hoisting we already
know we'll get the "close" use in one of the successors so I fear even
there it will be impossible to do something sensible.

Richard.

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> >
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-22  7:25                   ` Richard Biener
@ 2020-09-22  9:37                     ` Prathamesh Kulkarni
  2020-09-22 11:06                       ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22  9:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

[-- Attachment #1: Type: text/plain, Size: 15825 bytes --]

On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > >
> > > > > > I obtained perf stat results for following benchmark runs:
> > > > > >
> > > > > > -O2:
> > > > > >
> > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > >
> > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > enough for this kind of code)
> > > > >
> > > > > > -O2 with orthonl inlined:
> > > > > >
> > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > >
> > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > >
> > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > >
> > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > >
> > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > not very complicated (though Fortran array addressing...).
> > > > >
> > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > >
> > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > >
> > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > in insn count).
> > > > >
> > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > each iterating only 3 times).
> > > > >
> > > > > > Perf profiles for
> > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > >
> > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > >                         │             cmp    w0, #0x4
> > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > 143753737468│          ↑ b      1f04
> > > > > >
> > > > > > -O2 with inlined orthonl:
> > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > >
> > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > >             245967│            cmp    w0, #0x4
> > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > 143775979620│         ↑ b      1ef8
> > > > >
> > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > >
> > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > assembly.
> > > > >
> > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > -falign-loops=32.
> > > > Hi,
> > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > >
> > > > The hoisting region is:
> > > > if(mattyp.eq.1) then
> > > >   4 loops
> > > > elseif(mattyp.eq.2) then
> > > >   {
> > > >      orthonl inlined into basic block;
> > > >      loads w[0] .. w[8]
> > > >   }
> > > > else
> > > >    6 loops  // load anisox
> > > >
> > > > followed by basic block:
> > > >
> > > >  senergy=
> > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > >
> > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > right in block 181, which is:
> > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > >
> > > > which is then further hoisted to block 173:
> > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > >
> > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > AND
> > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > which has a path length of around 18 blocks.
> > > > (bb 194 post-dominates bb 181 and bb 173).
> > > >
> > > > Disabling only load hoisting within blocks 173 and 181
> > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > verifies that it is hoisting of the
> > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > >
> > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > for ldur instruction:
> > > >
> > > > With full hoisting:
> > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > >
> > > > Without full hoisting:
> > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > >
> > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > profiles for both cases).
> > > >
> > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > while it reported zero for the disabled load hoisting case.
> > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > possibly results
> > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > and making load slower ?
> > > >
> > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > cache misses, while with full hoisting there were 357 cache misses
> > > > (with period = 1 million).
> > > > Does that happen because hoisting probably reduces cache misses along
> > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > Hi,
> > > In general I feel for this or PR80155 case, the issues come with long
> > > range hoistings, inside a large CFG, since we don't have an accurate
> > > way to model target resources (register pressure in PR80155 case / or
> > > possibly cache pressure in this case?) at tree level and we end up
> > > with register spill or cache miss inside loops, which may offset the
> > > benefit of hoisting. As previously discussed the right way to go is a
> > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > with other code-movement optimizations (or if the source had variables
> > > with long live ranges).
> > >
> > > I was wondering tho as a cheap workaround, would it make sense to
> > > check if we are hoisting across a "large" region of nested loops, and
> > > avoid in that case since hoisting may exert resource pressure inside
> > > loop region ? (Especially, in the cases where hoisted expressions were
> > > not originally AVAIL in any of the loop blocks, and the loop region
> > > doesn't benefit from hoisting).
> > >
> > > For instance:
> > > FOR_EACH_EDGE (e, ei, block)
> > >   {
> > >     /* Avoid hoisting across more than 3 nested loops */
> > >     if (e->dest is a loop pre-header or loop header
> > >         && nesting depth of loop is > 3)
> > >      return false;
> > >   }
> > >
> > > I think this would work for resolving the calculix issue because it
> > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > yet).
> > > It's not bulletproof in that it will miss detecting cases where loop
> > > header (or pre-header) isn't a successor of candidate block (checking
> > > for
> > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > for any regressions.
> > > Does this sound like a reasonable heuristic ?
> > Hi,
> > The attached patch implements the above heuristic.
> > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > And it brings back most of performance for calculix on par with O2
> > (without inlining orthonl).
> > I verified that with patch there is no cache miss happening on load
> > insn inside loop
> > (with perf report -e L1-dcache-load-misses/period=1000000/)
> >
> > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > speed and will report numbers
> > in couple of days. (If required, we could parametrize number of nested
> > loops, hardcoded (arbitrarily to) 3 in this patch,
> > and set it in backend to not affect other targets).
>
> I don't think this patch captures the case in a sensible way - it will simply
> never hoist computations out of loop header blocks with depth > 3 which
> is certainly not what you want.  Also the pre-header check is odd - we're
> looking for computations in successors of BLOCK but clearly a computation
> in a pre-header is not at the same loop level as one in the header itself.
Well, my intent was to check if we are hoisting across a region,
which has multiple nested loops, and in that case, avoid hoisting expressions
that do not belong to any loop blocks, because that may increase
resource pressure
inside loops. For instance, in the calculix issue we hoist 'w' array
from post-dom and neither
loop region has any uses of 'w'.  I agree checking just for loop level
is too coarse.
The check with pre-header was essentially the same to see if we are
hoisting across a loop region,
not necessarily from within the loops.
>
> Note the difficulty to capture "distance" is that the distance is simply not
> available at this point - it is the anticipated values from the successors
> that do _not_ compute the value itself that are the issue.  To capture
> "distance" you'd need to somehow "age" anticipated value when
> propagating them upwards during compute_antic (where it then
> would also apply to PRE in general).
Yes, indeed.  As a hack, would it make sense to avoid inserting an
expression in the block,
if it's ANTIC in post-dom block as a trade-off between extending live
range and hoisting
if the "distance" between block and post-dom is "too far" ? In
general, as you point out, we'd need to compute,
distance info for successors block during compute_antic, but special
casing for post-dom should be easy enough
during do_hoist_insertion, and hoisting an expr that is ANTIC in
post-dom could be potentially "long range", if the region is large.
It's still a coarse heuristic tho. I tried it in the attached patch.

Thanks,
Prathamesh


>
> As with all other heuristics the only place one could do hackish attempts
> with at least some reasoning is the elimination phase where
> we make use of the (hoist) insertions - of course for hoisting we already
> know we'll get the "close" use in one of the successors so I fear even
> there it will be impossible to do something sensible.
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Alexander

[-- Attachment #2: gnu-659-pdom-3.diff --]
[-- Type: application/octet-stream, Size: 2870 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..64842e01fa5 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3477,6 +3477,43 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
   return new_stuff;
 }
 
+/* Return true if PDOM_BB is within DIST_LIMIT of BLOCK,
+   where "distance" is measured in terms of number of basic blocks.  */
+
+static bool
+pdom_within_dist_p_1 (basic_block block, basic_block pdom_bb,
+		      bool *visited, unsigned dist_limit,
+		      unsigned dist_from_block)
+{
+  if (dist_from_block >= dist_limit)
+    return false;
+
+  if (block == pdom_bb)
+    return true;
+
+  edge e;
+  edge_iterator ei;
+  visited[block->index] = true;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    if (!visited[e->dest->index]
+	&& !pdom_within_dist_p_1 (e->dest, pdom_bb, visited,
+				  dist_limit, dist_from_block + 1))
+      return false;
+  return true; 
+}
+
+static bool
+pdom_within_dist_p (basic_block bb, basic_block pdom_bb, unsigned dist)
+{
+  unsigned n_bbs = n_basic_blocks_for_fn (cfun); 
+  bool *visited = new bool[n_bbs]; 
+  memset (visited, false, n_bbs); 
+  bool ret = pdom_within_dist_p_1 (bb, pdom_bb, visited, dist, 0);
+  delete[] visited;
+  return ret;
+}
+
 /* Insert expressions in BLOCK to compute hoistable values up.
    Return TRUE if something was inserted, otherwise return FALSE.
    The caller has to make sure that BLOCK has at least two successors.  */
@@ -3541,6 +3578,19 @@ do_hoist_insertion (basic_block block)
   if (bitmap_empty_p (&availout_in_some))
     return false;
 
+  /* Check if any of the hoistable expressions are ANTIC in post-dom,
+     and avoid inserting those, if ppost-dom is beyond threshold distance
+     from the block.  */
+
+  basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+  bitmap_head hoist_from_pdom;
+  bitmap_initialize (&hoist_from_pdom, &grand_bitmap_obstack);
+  bitmap_and (&hoist_from_pdom, &availout_in_some, &ANTIC_IN (pdom_bb)->values);
+ 
+  if (!bitmap_empty_p (&hoist_from_pdom)
+      && !pdom_within_dist_p (block, pdom_bb, 17))
+    bitmap_and_compl_into (&availout_in_some, &ANTIC_IN (pdom_bb)->values);
+
   /* Hack hoitable_set in-place so we can use sorted_array_from_bitmap_set.  */
   bitmap_move (&hoistable_set.values, &availout_in_some);
   hoistable_set.expressions = ANTIC_IN (block)->expressions;
@@ -4099,6 +4149,7 @@ init_pre (void)
   alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
 
   calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
 
   bitmap_obstack_initialize (&grand_bitmap_obstack);
   phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4182,7 @@ fini_pre ()
   name_to_id.release ();
 
   free_aux_for_blocks ();
+  free_dominance_info (CDI_POST_DOMINATORS);
 }
 
 namespace {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-22  9:37                     ` Prathamesh Kulkarni
@ 2020-09-22 11:06                       ` Richard Biener
  2020-09-22 16:24                         ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-22 11:06 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > >
> > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > >
> > > > > > > -O2:
> > > > > > >
> > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > >
> > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > enough for this kind of code)
> > > > > >
> > > > > > > -O2 with orthonl inlined:
> > > > > > >
> > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > >
> > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > >
> > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > >
> > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > >
> > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > not very complicated (though Fortran array addressing...).
> > > > > >
> > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > >
> > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > >
> > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > in insn count).
> > > > > >
> > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > each iterating only 3 times).
> > > > > >
> > > > > > > Perf profiles for
> > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > >
> > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > >                         │             cmp    w0, #0x4
> > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > 143753737468│          ↑ b      1f04
> > > > > > >
> > > > > > > -O2 with inlined orthonl:
> > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > >
> > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > 143775979620│         ↑ b      1ef8
> > > > > >
> > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > >
> > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > assembly.
> > > > > >
> > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > -falign-loops=32.
> > > > > Hi,
> > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > >
> > > > > The hoisting region is:
> > > > > if(mattyp.eq.1) then
> > > > >   4 loops
> > > > > elseif(mattyp.eq.2) then
> > > > >   {
> > > > >      orthonl inlined into basic block;
> > > > >      loads w[0] .. w[8]
> > > > >   }
> > > > > else
> > > > >    6 loops  // load anisox
> > > > >
> > > > > followed by basic block:
> > > > >
> > > > >  senergy=
> > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > >
> > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > right in block 181, which is:
> > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > >
> > > > > which is then further hoisted to block 173:
> > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > >
> > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > AND
> > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > which has a path length of around 18 blocks.
> > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > >
> > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > verifies that it is hoisting of the
> > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > >
> > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > for ldur instruction:
> > > > >
> > > > > With full hoisting:
> > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > >
> > > > > Without full hoisting:
> > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > >
> > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > profiles for both cases).
> > > > >
> > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > while it reported zero for the disabled load hoisting case.
> > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > possibly results
> > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > and making load slower ?
> > > > >
> > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > (with period = 1 million).
> > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > Hi,
> > > > In general I feel for this or PR80155 case, the issues come with long
> > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > way to model target resources (register pressure in PR80155 case / or
> > > > possibly cache pressure in this case?) at tree level and we end up
> > > > with register spill or cache miss inside loops, which may offset the
> > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > with other code-movement optimizations (or if the source had variables
> > > > with long live ranges).
> > > >
> > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > check if we are hoisting across a "large" region of nested loops, and
> > > > avoid in that case since hoisting may exert resource pressure inside
> > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > doesn't benefit from hoisting).
> > > >
> > > > For instance:
> > > > FOR_EACH_EDGE (e, ei, block)
> > > >   {
> > > >     /* Avoid hoisting across more than 3 nested loops */
> > > >     if (e->dest is a loop pre-header or loop header
> > > >         && nesting depth of loop is > 3)
> > > >      return false;
> > > >   }
> > > >
> > > > I think this would work for resolving the calculix issue because it
> > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > yet).
> > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > for
> > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > for any regressions.
> > > > Does this sound like a reasonable heuristic ?
> > > Hi,
> > > The attached patch implements the above heuristic.
> > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > And it brings back most of performance for calculix on par with O2
> > > (without inlining orthonl).
> > > I verified that with patch there is no cache miss happening on load
> > > insn inside loop
> > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > >
> > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > speed and will report numbers
> > > in couple of days. (If required, we could parametrize number of nested
> > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > and set it in backend to not affect other targets).
> >
> > I don't think this patch captures the case in a sensible way - it will simply
> > never hoist computations out of loop header blocks with depth > 3 which
> > is certainly not what you want.  Also the pre-header check is odd - we're
> > looking for computations in successors of BLOCK but clearly a computation
> > in a pre-header is not at the same loop level as one in the header itself.
> Well, my intent was to check if we are hoisting across a region,
> which has multiple nested loops, and in that case, avoid hoisting expressions
> that do not belong to any loop blocks, because that may increase
> resource pressure
> inside loops. For instance, in the calculix issue we hoist 'w' array
> from post-dom and neither
> loop region has any uses of 'w'.  I agree checking just for loop level
> is too coarse.
> The check with pre-header was essentially the same to see if we are
> hoisting across a loop region,
> not necessarily from within the loops.

But it will fail to hoist *p in

   if (x)
     {
        a = *p;
     }
  else
    {
       b = *p;
    }

<make distance large>
pdom:
  c = *p;

so it isn't what matters either.  What happens at the immediate post-dominator
isn't directly relevant - what matters would be if the pdom is the one making
the value antic on one of the outgoing edges.  In that case we're also going
to PRE *p into the arm not computing *p (albeit in a later position).  But
that property is impossible to compute from the sets itself (not even mentioning
the arbitrary CFG that can be inbetween the block and its pdom or the weird
pdoms we compute for regions not having a path to exit, like infinite loops).

You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
in each of them we _might_ have the situation you want to protect against.
But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
of them ...

> >
> > Note the difficulty to capture "distance" is that the distance is simply not
> > available at this point - it is the anticipated values from the successors
> > that do _not_ compute the value itself that are the issue.  To capture
> > "distance" you'd need to somehow "age" anticipated value when
> > propagating them upwards during compute_antic (where it then
> > would also apply to PRE in general).
> Yes, indeed.  As a hack, would it make sense to avoid inserting an
> expression in the block,
> if it's ANTIC in post-dom block as a trade-off between extending live
> range and hoisting
> if the "distance" between block and post-dom is "too far" ? In
> general, as you point out, we'd need to compute,
> distance info for successors block during compute_antic, but special
> casing for post-dom should be easy enough
> during do_hoist_insertion, and hoisting an expr that is ANTIC in
> post-dom could be potentially "long range", if the region is large.
> It's still a coarse heuristic tho. I tried it in the attached patch.
>
> Thanks,
> Prathamesh
>
>
> >
> > As with all other heuristics the only place one could do hackish attempts
> > with at least some reasoning is the elimination phase where
> > we make use of the (hoist) insertions - of course for hoisting we already
> > know we'll get the "close" use in one of the successors so I fear even
> > there it will be impossible to do something sensible.
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-22 11:06                       ` Richard Biener
@ 2020-09-22 16:24                         ` Prathamesh Kulkarni
  2020-09-23  7:52                           ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-22 16:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > >
> > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > >
> > > > > > > > -O2:
> > > > > > > >
> > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > >
> > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > enough for this kind of code)
> > > > > > >
> > > > > > > > -O2 with orthonl inlined:
> > > > > > > >
> > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > >
> > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > >
> > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > >
> > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > >
> > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > >
> > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > in insn count).
> > > > > > >
> > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > each iterating only 3 times).
> > > > > > >
> > > > > > > > Perf profiles for
> > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > >
> > > > > > > > -O2 with inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > >
> > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > >
> > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > assembly.
> > > > > > >
> > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > -falign-loops=32.
> > > > > > Hi,
> > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > >
> > > > > > The hoisting region is:
> > > > > > if(mattyp.eq.1) then
> > > > > >   4 loops
> > > > > > elseif(mattyp.eq.2) then
> > > > > >   {
> > > > > >      orthonl inlined into basic block;
> > > > > >      loads w[0] .. w[8]
> > > > > >   }
> > > > > > else
> > > > > >    6 loops  // load anisox
> > > > > >
> > > > > > followed by basic block:
> > > > > >
> > > > > >  senergy=
> > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > >
> > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > right in block 181, which is:
> > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > >
> > > > > > which is then further hoisted to block 173:
> > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > >
> > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > AND
> > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > which has a path length of around 18 blocks.
> > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > >
> > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > verifies that it is hoisting of the
> > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > >
> > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > for ldur instruction:
> > > > > >
> > > > > > With full hoisting:
> > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > >
> > > > > > Without full hoisting:
> > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > >
> > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > profiles for both cases).
> > > > > >
> > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > possibly results
> > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > and making load slower ?
> > > > > >
> > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > (with period = 1 million).
> > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > Hi,
> > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > with register spill or cache miss inside loops, which may offset the
> > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > with other code-movement optimizations (or if the source had variables
> > > > > with long live ranges).
> > > > >
> > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > doesn't benefit from hoisting).
> > > > >
> > > > > For instance:
> > > > > FOR_EACH_EDGE (e, ei, block)
> > > > >   {
> > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > >     if (e->dest is a loop pre-header or loop header
> > > > >         && nesting depth of loop is > 3)
> > > > >      return false;
> > > > >   }
> > > > >
> > > > > I think this would work for resolving the calculix issue because it
> > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > yet).
> > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > for
> > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > for any regressions.
> > > > > Does this sound like a reasonable heuristic ?
> > > > Hi,
> > > > The attached patch implements the above heuristic.
> > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > And it brings back most of performance for calculix on par with O2
> > > > (without inlining orthonl).
> > > > I verified that with patch there is no cache miss happening on load
> > > > insn inside loop
> > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > >
> > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > speed and will report numbers
> > > > in couple of days. (If required, we could parametrize number of nested
> > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > and set it in backend to not affect other targets).
> > >
> > > I don't think this patch captures the case in a sensible way - it will simply
> > > never hoist computations out of loop header blocks with depth > 3 which
> > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > looking for computations in successors of BLOCK but clearly a computation
> > > in a pre-header is not at the same loop level as one in the header itself.
> > Well, my intent was to check if we are hoisting across a region,
> > which has multiple nested loops, and in that case, avoid hoisting expressions
> > that do not belong to any loop blocks, because that may increase
> > resource pressure
> > inside loops. For instance, in the calculix issue we hoist 'w' array
> > from post-dom and neither
> > loop region has any uses of 'w'.  I agree checking just for loop level
> > is too coarse.
> > The check with pre-header was essentially the same to see if we are
> > hoisting across a loop region,
> > not necessarily from within the loops.
>
> But it will fail to hoist *p in
>
>    if (x)
>      {
>         a = *p;
>      }
>   else
>     {
>        b = *p;
>     }
>
> <make distance large>
> pdom:
>   c = *p;
>
> so it isn't what matters either.  What happens at the immediate post-dominator
> isn't directly relevant - what matters would be if the pdom is the one making
> the value antic on one of the outgoing edges.  In that case we're also going
> to PRE *p into the arm not computing *p (albeit in a later position).  But
> that property is impossible to compute from the sets itself (not even mentioning
> the arbitrary CFG that can be inbetween the block and its pdom or the weird
> pdoms we compute for regions not having a path to exit, like infinite loops).
>
> You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> in each of them we _might_ have the situation you want to protect against.
> But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> of them ...
Hi Richard,
Thanks for the suggestions. Right, the issue seems to be here that
post-dom block is making expressions ANTIC. Before doing insert, could
we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
as a guard against PRE eventually inserting expressions in pred blocks
of pdom and making them available?
And during hoisting, we could check if each expr that is ANTIC_IN
(pdom) is ORIG_AVAIL_OUT in each pred of pdom,
if the distance is "large".

Thanks,
Prathamesh


>
> > >
> > > Note the difficulty to capture "distance" is that the distance is simply not
> > > available at this point - it is the anticipated values from the successors
> > > that do _not_ compute the value itself that are the issue.  To capture
> > > "distance" you'd need to somehow "age" anticipated value when
> > > propagating them upwards during compute_antic (where it then
> > > would also apply to PRE in general).
> > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > expression in the block,
> > if it's ANTIC in post-dom block as a trade-off between extending live
> > range and hoisting
> > if the "distance" between block and post-dom is "too far" ? In
> > general, as you point out, we'd need to compute,
> > distance info for successors block during compute_antic, but special
> > casing for post-dom should be easy enough
> > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > post-dom could be potentially "long range", if the region is large.
> > It's still a coarse heuristic tho. I tried it in the attached patch.
> >
> > Thanks,
> > Prathamesh
> >
> >
> > >
> > > As with all other heuristics the only place one could do hackish attempts
> > > with at least some reasoning is the elimination phase where
> > > we make use of the (hoist) insertions - of course for hoisting we already
> > > know we'll get the "close" use in one of the successors so I fear even
> > > there it will be impossible to do something sensible.
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-22 16:24                         ` Prathamesh Kulkarni
@ 2020-09-23  7:52                           ` Richard Biener
  2020-09-23 10:10                             ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-23  7:52 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > >
> > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > >
> > > > > > > > > -O2:
> > > > > > > > >
> > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > >
> > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > enough for this kind of code)
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > >
> > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > >
> > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > >
> > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > >
> > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > >
> > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > >
> > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > >
> > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > in insn count).
> > > > > > > >
> > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > each iterating only 3 times).
> > > > > > > >
> > > > > > > > > Perf profiles for
> > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > >
> > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > >
> > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > >
> > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > >
> > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > >
> > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > assembly.
> > > > > > > >
> > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > -falign-loops=32.
> > > > > > > Hi,
> > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > >
> > > > > > > The hoisting region is:
> > > > > > > if(mattyp.eq.1) then
> > > > > > >   4 loops
> > > > > > > elseif(mattyp.eq.2) then
> > > > > > >   {
> > > > > > >      orthonl inlined into basic block;
> > > > > > >      loads w[0] .. w[8]
> > > > > > >   }
> > > > > > > else
> > > > > > >    6 loops  // load anisox
> > > > > > >
> > > > > > > followed by basic block:
> > > > > > >
> > > > > > >  senergy=
> > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > >
> > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > right in block 181, which is:
> > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > >
> > > > > > > which is then further hoisted to block 173:
> > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > >
> > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > AND
> > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > which has a path length of around 18 blocks.
> > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > >
> > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > verifies that it is hoisting of the
> > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > >
> > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > for ldur instruction:
> > > > > > >
> > > > > > > With full hoisting:
> > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > >
> > > > > > > Without full hoisting:
> > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > >
> > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > profiles for both cases).
> > > > > > >
> > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > possibly results
> > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > and making load slower ?
> > > > > > >
> > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > (with period = 1 million).
> > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > Hi,
> > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > with long live ranges).
> > > > > >
> > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > doesn't benefit from hoisting).
> > > > > >
> > > > > > For instance:
> > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > >   {
> > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > >         && nesting depth of loop is > 3)
> > > > > >      return false;
> > > > > >   }
> > > > > >
> > > > > > I think this would work for resolving the calculix issue because it
> > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > yet).
> > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > for
> > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > for any regressions.
> > > > > > Does this sound like a reasonable heuristic ?
> > > > > Hi,
> > > > > The attached patch implements the above heuristic.
> > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > And it brings back most of performance for calculix on par with O2
> > > > > (without inlining orthonl).
> > > > > I verified that with patch there is no cache miss happening on load
> > > > > insn inside loop
> > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > >
> > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > speed and will report numbers
> > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > and set it in backend to not affect other targets).
> > > >
> > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > looking for computations in successors of BLOCK but clearly a computation
> > > > in a pre-header is not at the same loop level as one in the header itself.
> > > Well, my intent was to check if we are hoisting across a region,
> > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > that do not belong to any loop blocks, because that may increase
> > > resource pressure
> > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > from post-dom and neither
> > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > is too coarse.
> > > The check with pre-header was essentially the same to see if we are
> > > hoisting across a loop region,
> > > not necessarily from within the loops.
> >
> > But it will fail to hoist *p in
> >
> >    if (x)
> >      {
> >         a = *p;
> >      }
> >   else
> >     {
> >        b = *p;
> >     }
> >
> > <make distance large>
> > pdom:
> >   c = *p;
> >
> > so it isn't what matters either.  What happens at the immediate post-dominator
> > isn't directly relevant - what matters would be if the pdom is the one making
> > the value antic on one of the outgoing edges.  In that case we're also going
> > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > that property is impossible to compute from the sets itself (not even mentioning
> > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > pdoms we compute for regions not having a path to exit, like infinite loops).
> >
> > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > in each of them we _might_ have the situation you want to protect against.
> > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > of them ...
> Hi Richard,
> Thanks for the suggestions. Right, the issue seems to be here that
> post-dom block is making expressions ANTIC. Before doing insert, could
> we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> as a guard against PRE eventually inserting expressions in pred blocks
> of pdom and making them available?
> And during hoisting, we could check if each expr that is ANTIC_IN
> (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> if the distance is "large".

Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
very large (it's actually quadratic in size of the CFG * # values), in
particular
we're inserting in RPO and update AVAIL_OUT only up to the current block
(from dominators) so the PDOM block should have the original AVAIL_OUT
(but from the last iteration - we do iterate insertion).

Note I'm still not happy with pulling off this kind of heuristics.
What the suggestion
means is that for

   if (x)
     y = *p;
   z = *p;

we'll end up with

  if (x)
    y = *p;
  else
    z = *p;

instead of

   tem = *p;
   if (x)
    y = tem;
  else
    z = tem;

that is, we get the runtime benefit but not the code-size one
(hoisting mostly helps code size plus allows if-conversion as followup
in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
then the above would still cause hoisting - so the heuristic isn't sustainable.
Again, sth like "distance" isn't really available.

Richard.

> Thanks,
> Prathamesh
>
>
> >
> > > >
> > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > available at this point - it is the anticipated values from the successors
> > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > "distance" you'd need to somehow "age" anticipated value when
> > > > propagating them upwards during compute_antic (where it then
> > > > would also apply to PRE in general).
> > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > expression in the block,
> > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > range and hoisting
> > > if the "distance" between block and post-dom is "too far" ? In
> > > general, as you point out, we'd need to compute,
> > > distance info for successors block during compute_antic, but special
> > > casing for post-dom should be easy enough
> > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > post-dom could be potentially "long range", if the region is large.
> > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > >
> > > > As with all other heuristics the only place one could do hackish attempts
> > > > with at least some reasoning is the elimination phase where
> > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > know we'll get the "close" use in one of the successors so I fear even
> > > > there it will be impossible to do something sensible.
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-23  7:52                           ` Richard Biener
@ 2020-09-23 10:10                             ` Prathamesh Kulkarni
  2020-09-23 11:10                               ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-23 10:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

[-- Attachment #1: Type: text/plain, Size: 22587 bytes --]

On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > >
> > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > >
> > > > > > > > > > -O2:
> > > > > > > > > >
> > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > >
> > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > enough for this kind of code)
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > >
> > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > >
> > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > >
> > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > >
> > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > >
> > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > >
> > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > in insn count).
> > > > > > > > >
> > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > each iterating only 3 times).
> > > > > > > > >
> > > > > > > > > > Perf profiles for
> > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > >
> > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > >
> > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > >
> > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > >
> > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > >
> > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > assembly.
> > > > > > > > >
> > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > -falign-loops=32.
> > > > > > > > Hi,
> > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > >
> > > > > > > > The hoisting region is:
> > > > > > > > if(mattyp.eq.1) then
> > > > > > > >   4 loops
> > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > >   {
> > > > > > > >      orthonl inlined into basic block;
> > > > > > > >      loads w[0] .. w[8]
> > > > > > > >   }
> > > > > > > > else
> > > > > > > >    6 loops  // load anisox
> > > > > > > >
> > > > > > > > followed by basic block:
> > > > > > > >
> > > > > > > >  senergy=
> > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > >
> > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > right in block 181, which is:
> > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > >
> > > > > > > > which is then further hoisted to block 173:
> > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > >
> > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > AND
> > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > >
> > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > verifies that it is hoisting of the
> > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > >
> > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > for ldur instruction:
> > > > > > > >
> > > > > > > > With full hoisting:
> > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > >
> > > > > > > > Without full hoisting:
> > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > >
> > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > profiles for both cases).
> > > > > > > >
> > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > possibly results
> > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > and making load slower ?
> > > > > > > >
> > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > (with period = 1 million).
> > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > Hi,
> > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > with long live ranges).
> > > > > > >
> > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > doesn't benefit from hoisting).
> > > > > > >
> > > > > > > For instance:
> > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > >   {
> > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > >         && nesting depth of loop is > 3)
> > > > > > >      return false;
> > > > > > >   }
> > > > > > >
> > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > yet).
> > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > for
> > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > for any regressions.
> > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > Hi,
> > > > > > The attached patch implements the above heuristic.
> > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > (without inlining orthonl).
> > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > insn inside loop
> > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > >
> > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > speed and will report numbers
> > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > and set it in backend to not affect other targets).
> > > > >
> > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > Well, my intent was to check if we are hoisting across a region,
> > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > that do not belong to any loop blocks, because that may increase
> > > > resource pressure
> > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > from post-dom and neither
> > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > is too coarse.
> > > > The check with pre-header was essentially the same to see if we are
> > > > hoisting across a loop region,
> > > > not necessarily from within the loops.
> > >
> > > But it will fail to hoist *p in
> > >
> > >    if (x)
> > >      {
> > >         a = *p;
> > >      }
> > >   else
> > >     {
> > >        b = *p;
> > >     }
> > >
> > > <make distance large>
> > > pdom:
> > >   c = *p;
> > >
> > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > isn't directly relevant - what matters would be if the pdom is the one making
> > > the value antic on one of the outgoing edges.  In that case we're also going
> > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > that property is impossible to compute from the sets itself (not even mentioning
> > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > >
> > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > in each of them we _might_ have the situation you want to protect against.
> > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > of them ...
> > Hi Richard,
> > Thanks for the suggestions. Right, the issue seems to be here that
> > post-dom block is making expressions ANTIC. Before doing insert, could
> > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > as a guard against PRE eventually inserting expressions in pred blocks
> > of pdom and making them available?
> > And during hoisting, we could check if each expr that is ANTIC_IN
> > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > if the distance is "large".
>
> Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> very large (it's actually quadratic in size of the CFG * # values), in
> particular
> we're inserting in RPO and update AVAIL_OUT only up to the current block
> (from dominators) so the PDOM block should have the original AVAIL_OUT
> (but from the last iteration - we do iterate insertion).
>
> Note I'm still not happy with pulling off this kind of heuristics.
> What the suggestion
> means is that for
>
>    if (x)
>      y = *p;
>    z = *p;
>
> we'll end up with
>
>   if (x)
>     y = *p;
>   else
>     z = *p;
>
> instead of
>
>    tem = *p;
>    if (x)
>     y = tem;
>   else
>     z = tem;
>
> that is, we get the runtime benefit but not the code-size one
> (hoisting mostly helps code size plus allows if-conversion as followup
> in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> then the above would still cause hoisting - so the heuristic isn't sustainable.
> Again, sth like "distance" isn't really available.
Hi Richard,
It doesn't work without copying AVAIL_OUT.
I tried for small example with attached patch:

int foo(int cond, int x, int y)
{
  int t;
  void f(int);

  if (cond)
    t = x + y;
  else
    t = x - y;

  f (t);
  int t2 = (x + y) * 10;
  return t2;
}

By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
it does not hoist in first pass, but then PRE inserts x + y in the "else block",
and we eventually hoist before if (cond). Similarly for e_c3d
hoistings in calculix.

IIUC, we want hoisting to be:
(ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
to ensure that hoisted expressions are along all paths from block to post-dom ?
If copying AVAIL_OUT sets is too large, could we keep another set that
precomputes intersection of AVAIL_OUT of pdom preds
for each block and then use this info during hoisting ?

For computing "distance", I implemented a simple DFS walk from block
to post-dom, that gives up if depth crosses
threshold before reaching post-dom. I am not sure tho, how expensive
that can get.

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> >
> >
> > >
> > > > >
> > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > available at this point - it is the anticipated values from the successors
> > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > propagating them upwards during compute_antic (where it then
> > > > > would also apply to PRE in general).
> > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > expression in the block,
> > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > range and hoisting
> > > > if the "distance" between block and post-dom is "too far" ? In
> > > > general, as you point out, we'd need to compute,
> > > > distance info for successors block during compute_antic, but special
> > > > casing for post-dom should be easy enough
> > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > post-dom could be potentially "long range", if the region is large.
> > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > >
> > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > with at least some reasoning is the elimination phase where
> > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > there it will be impossible to do something sensible.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Alexander

[-- Attachment #2: gnu-659-pdom-4.diff --]
[-- Type: application/octet-stream, Size: 2584 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..8cee5707e7d 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3477,6 +3477,43 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
   return new_stuff;
 }
 
+/* Return true if PDOM_BB is within DIST_LIMIT of BLOCK,
+   where "distance" is measured in terms of number of basic blocks.  */
+
+static bool
+pdom_within_dist_p_1 (basic_block block, basic_block pdom_bb,
+		      bool *visited, unsigned dist_limit,
+		      unsigned dist_from_block)
+{
+  if (dist_from_block >= dist_limit)
+    return false;
+
+  if (block == pdom_bb)
+    return true;
+
+  edge e;
+  edge_iterator ei;
+  visited[block->index] = true;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    if (!visited[e->dest->index]
+	&& !pdom_within_dist_p_1 (e->dest, pdom_bb, visited,
+				  dist_limit, dist_from_block + 1))
+      return false;
+  return true; 
+}
+
+static bool
+pdom_within_dist_p (basic_block bb, basic_block pdom_bb, unsigned dist)
+{
+  unsigned n_bbs = n_basic_blocks_for_fn (cfun); 
+  bool *visited = new bool[n_bbs]; 
+  memset (visited, false, n_bbs); 
+  bool ret = pdom_within_dist_p_1 (bb, pdom_bb, visited, dist, 0);
+  delete[] visited;
+  return ret;
+}
+
 /* Insert expressions in BLOCK to compute hoistable values up.
    Return TRUE if something was inserted, otherwise return FALSE.
    The caller has to make sure that BLOCK has at least two successors.  */
@@ -3537,6 +3574,14 @@ do_hoist_insertion (basic_block block)
 			   &AVAIL_OUT (e->dest)->values);
   bitmap_clear (&hoistable_set.values);
 
+  /* Intersect with AVAIL_OUT of preds of post-dom, to check that
+     hoisted exprs are along all paths from block to pdom.  */
+
+  basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+  if (!pdom_within_dist_p (block, pdom_bb, 0))
+    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
+
   /* Short-cut for a common case: availout_in_some is empty.  */
   if (bitmap_empty_p (&availout_in_some))
     return false;
@@ -4099,6 +4144,7 @@ init_pre (void)
   alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
 
   calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
 
   bitmap_obstack_initialize (&grand_bitmap_obstack);
   phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4177,7 @@ fini_pre ()
   name_to_id.release ();
 
   free_aux_for_blocks ();
+  free_dominance_info (CDI_POST_DOMINATORS);
 }
 
 namespace {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-23 10:10                             ` Prathamesh Kulkarni
@ 2020-09-23 11:10                               ` Richard Biener
  2020-09-24 10:36                                 ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-23 11:10 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > >
> > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > >
> > > > > > > > > > > -O2:
> > > > > > > > > > >
> > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > >
> > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > enough for this kind of code)
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > >
> > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > >
> > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > >
> > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > >
> > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > >
> > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > >
> > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > >
> > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > in insn count).
> > > > > > > > > >
> > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > each iterating only 3 times).
> > > > > > > > > >
> > > > > > > > > > > Perf profiles for
> > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > >
> > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > >
> > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > >
> > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > >
> > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > >
> > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > assembly.
> > > > > > > > > >
> > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > -falign-loops=32.
> > > > > > > > > Hi,
> > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > >
> > > > > > > > > The hoisting region is:
> > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > >   4 loops
> > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > >   {
> > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > >   }
> > > > > > > > > else
> > > > > > > > >    6 loops  // load anisox
> > > > > > > > >
> > > > > > > > > followed by basic block:
> > > > > > > > >
> > > > > > > > >  senergy=
> > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > >
> > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > right in block 181, which is:
> > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > >
> > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > >
> > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > AND
> > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > >
> > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > >
> > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > for ldur instruction:
> > > > > > > > >
> > > > > > > > > With full hoisting:
> > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > >
> > > > > > > > > Without full hoisting:
> > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > >
> > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > profiles for both cases).
> > > > > > > > >
> > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > possibly results
> > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > and making load slower ?
> > > > > > > > >
> > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > (with period = 1 million).
> > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > Hi,
> > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > with long live ranges).
> > > > > > > >
> > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > doesn't benefit from hoisting).
> > > > > > > >
> > > > > > > > For instance:
> > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > >   {
> > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > >      return false;
> > > > > > > >   }
> > > > > > > >
> > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > yet).
> > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > for
> > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > for any regressions.
> > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > Hi,
> > > > > > > The attached patch implements the above heuristic.
> > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > (without inlining orthonl).
> > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > insn inside loop
> > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > >
> > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > speed and will report numbers
> > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > and set it in backend to not affect other targets).
> > > > > >
> > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > that do not belong to any loop blocks, because that may increase
> > > > > resource pressure
> > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > from post-dom and neither
> > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > is too coarse.
> > > > > The check with pre-header was essentially the same to see if we are
> > > > > hoisting across a loop region,
> > > > > not necessarily from within the loops.
> > > >
> > > > But it will fail to hoist *p in
> > > >
> > > >    if (x)
> > > >      {
> > > >         a = *p;
> > > >      }
> > > >   else
> > > >     {
> > > >        b = *p;
> > > >     }
> > > >
> > > > <make distance large>
> > > > pdom:
> > > >   c = *p;
> > > >
> > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > >
> > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > in each of them we _might_ have the situation you want to protect against.
> > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > of them ...
> > > Hi Richard,
> > > Thanks for the suggestions. Right, the issue seems to be here that
> > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > as a guard against PRE eventually inserting expressions in pred blocks
> > > of pdom and making them available?
> > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > if the distance is "large".
> >
> > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > very large (it's actually quadratic in size of the CFG * # values), in
> > particular
> > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > (but from the last iteration - we do iterate insertion).
> >
> > Note I'm still not happy with pulling off this kind of heuristics.
> > What the suggestion
> > means is that for
> >
> >    if (x)
> >      y = *p;
> >    z = *p;
> >
> > we'll end up with
> >
> >   if (x)
> >     y = *p;
> >   else
> >     z = *p;
> >
> > instead of
> >
> >    tem = *p;
> >    if (x)
> >     y = tem;
> >   else
> >     z = tem;
> >
> > that is, we get the runtime benefit but not the code-size one
> > (hoisting mostly helps code size plus allows if-conversion as followup
> > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > Again, sth like "distance" isn't really available.
> Hi Richard,
> It doesn't work without copying AVAIL_OUT.
> I tried for small example with attached patch:
>
> int foo(int cond, int x, int y)
> {
>   int t;
>   void f(int);
>
>   if (cond)
>     t = x + y;
>   else
>     t = x - y;
>
>   f (t);
>   int t2 = (x + y) * 10;
>   return t2;
> }
>
> By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> and we eventually hoist before if (cond). Similarly for e_c3d
> hoistings in calculix.
>
> IIUC, we want hoisting to be:
> (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> to ensure that hoisted expressions are along all paths from block to post-dom ?
> If copying AVAIL_OUT sets is too large, could we keep another set that
> precomputes intersection of AVAIL_OUT of pdom preds
> for each block and then use this info during hoisting ?
>
> For computing "distance", I implemented a simple DFS walk from block
> to post-dom, that gives up if depth crosses
> threshold before reaching post-dom. I am not sure tho, how expensive
> that can get.

As written it is quadratic in the CFG size.

You can optimize away the

+    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);

loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.

As said, I don't think this is the way to go - trying to avoid code
hoisting isn't
what we'd want to do - your quoted assembly instead points to a loop
with a non-empty latch which is usually caused by PRE and avoided with -O3
because it also harms vectorization.

Richard.

> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > >
> > > > > >
> > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > available at this point - it is the anticipated values from the successors
> > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > propagating them upwards during compute_antic (where it then
> > > > > > would also apply to PRE in general).
> > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > expression in the block,
> > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > range and hoisting
> > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > general, as you point out, we'd need to compute,
> > > > > distance info for successors block during compute_antic, but special
> > > > > casing for post-dom should be easy enough
> > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > post-dom could be potentially "long range", if the region is large.
> > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > > >
> > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > with at least some reasoning is the elimination phase where
> > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > there it will be impossible to do something sensible.
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-23 11:10                               ` Richard Biener
@ 2020-09-24 10:36                                 ` Prathamesh Kulkarni
  2020-09-24 11:14                                   ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-09-24 10:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

[-- Attachment #1: Type: text/plain, Size: 26427 bytes --]

On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > >
> > > > > > > > > > > > -O2:
> > > > > > > > > > > >
> > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > > >
> > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > >
> > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > > >
> > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > >
> > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > > >
> > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > >
> > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > > >
> > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > in insn count).
> > > > > > > > > > >
> > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > >
> > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > >
> > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > >
> > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > >
> > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > assembly.
> > > > > > > > > > >
> > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > Hi,
> > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > >
> > > > > > > > > > The hoisting region is:
> > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > >   4 loops
> > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > >   {
> > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > >   }
> > > > > > > > > > else
> > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > >
> > > > > > > > > > followed by basic block:
> > > > > > > > > >
> > > > > > > > > >  senergy=
> > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > >
> > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > right in block 181, which is:
> > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > >
> > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > >
> > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > AND
> > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > >
> > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > >
> > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > for ldur instruction:
> > > > > > > > > >
> > > > > > > > > > With full hoisting:
> > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > Without full hoisting:
> > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > profiles for both cases).
> > > > > > > > > >
> > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > possibly results
> > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > and making load slower ?
> > > > > > > > > >
> > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > (with period = 1 million).
> > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > Hi,
> > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > with long live ranges).
> > > > > > > > >
> > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > >
> > > > > > > > > For instance:
> > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > >   {
> > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > >      return false;
> > > > > > > > >   }
> > > > > > > > >
> > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > yet).
> > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > for
> > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > for any regressions.
> > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > Hi,
> > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > (without inlining orthonl).
> > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > insn inside loop
> > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > >
> > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > speed and will report numbers
> > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > and set it in backend to not affect other targets).
> > > > > > >
> > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > resource pressure
> > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > from post-dom and neither
> > > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > > is too coarse.
> > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > hoisting across a loop region,
> > > > > > not necessarily from within the loops.
> > > > >
> > > > > But it will fail to hoist *p in
> > > > >
> > > > >    if (x)
> > > > >      {
> > > > >         a = *p;
> > > > >      }
> > > > >   else
> > > > >     {
> > > > >        b = *p;
> > > > >     }
> > > > >
> > > > > <make distance large>
> > > > > pdom:
> > > > >   c = *p;
> > > > >
> > > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > >
> > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > of them ...
> > > > Hi Richard,
> > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > of pdom and making them available?
> > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > if the distance is "large".
> > >
> > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > very large (it's actually quadratic in size of the CFG * # values), in
> > > particular
> > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > (but from the last iteration - we do iterate insertion).
> > >
> > > Note I'm still not happy with pulling off this kind of heuristics.
> > > What the suggestion
> > > means is that for
> > >
> > >    if (x)
> > >      y = *p;
> > >    z = *p;
> > >
> > > we'll end up with
> > >
> > >   if (x)
> > >     y = *p;
> > >   else
> > >     z = *p;
> > >
> > > instead of
> > >
> > >    tem = *p;
> > >    if (x)
> > >     y = tem;
> > >   else
> > >     z = tem;
> > >
> > > that is, we get the runtime benefit but not the code-size one
> > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > Again, sth like "distance" isn't really available.
> > Hi Richard,
> > It doesn't work without copying AVAIL_OUT.
> > I tried for small example with attached patch:
> >
> > int foo(int cond, int x, int y)
> > {
> >   int t;
> >   void f(int);
> >
> >   if (cond)
> >     t = x + y;
> >   else
> >     t = x - y;
> >
> >   f (t);
> >   int t2 = (x + y) * 10;
> >   return t2;
> > }
> >
> > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > and we eventually hoist before if (cond). Similarly for e_c3d
> > hoistings in calculix.
> >
> > IIUC, we want hoisting to be:
> > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > If copying AVAIL_OUT sets is too large, could we keep another set that
> > precomputes intersection of AVAIL_OUT of pdom preds
> > for each block and then use this info during hoisting ?
> >
> > For computing "distance", I implemented a simple DFS walk from block
> > to post-dom, that gives up if depth crosses
> > threshold before reaching post-dom. I am not sure tho, how expensive
> > that can get.
>
> As written it is quadratic in the CFG size.
>
> You can optimize away the
>
> +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
>
> loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
>
> As said, I don't think this is the way to go - trying to avoid code
> hoisting isn't
> what we'd want to do - your quoted assembly instead points to a loop
> with a non-empty latch which is usually caused by PRE and avoided with -O3
> because it also harms vectorization.
But disabling PRE (which removes non empty latch), only results in
marginal performance improvement,
while disabling hoisting of 'w' array, with non-empty latch, seems to
gain most of the performance.
AFAIU, that was happening, because after disabling hoisting of 'w',
there wasn't a cache miss (as seen with perf -e
L1-dcache-load-misses),
on the load instruction inside the inner loop.

For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
node, since that's quadratic in terms of CFG size.
Would it make sense to break interaction between PRE and hoisting,
only for the case when inserting into preds of pdom ?
I tried doing that in attached patch, where insert runs in two phases:
(a) PRE and hoisting, where hoisting marks block to not do PRE for.
(b) Second phase, which only runs PRE on all blocks.
This (expectedly) regresses ssa-hoist-3.c.

If the heuristic isn't acceptable, I suppose encoding distance of expr
within ANTIC sets
during compute_antic would be the right way to fix this ?
So ANTIC_IN (block) contains the anticipated expressions, and for each
antic expr, the "distance" from the furthest block
it's computed in ? Could you please elaborate a bit on how we could go
about encoding distance during compute_antic ?

Thanks,
Prathamesh


Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > >
> > > > > > >
> > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > would also apply to PRE in general).
> > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > expression in the block,
> > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > range and hoisting
> > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > general, as you point out, we'd need to compute,
> > > > > > distance info for successors block during compute_antic, but special
> > > > > > casing for post-dom should be easy enough
> > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > there it will be impossible to do something sensible.
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Alexander

[-- Attachment #2: gnu-659-pdom-5.diff --]
[-- Type: application/octet-stream, Size: 3362 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 0c1654f3580..5be4c3cc9d4 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -3482,7 +3482,7 @@ do_pre_partial_partial_insertion (basic_block block, basic_block dom)
    The caller has to make sure that BLOCK has at least two successors.  */
 
 static bool
-do_hoist_insertion (basic_block block)
+do_hoist_insertion (basic_block block, hash_set<basic_block> *late_pre_bbs)
 {
   edge e;
   edge_iterator ei;
@@ -3537,6 +3537,22 @@ do_hoist_insertion (basic_block block)
 			   &AVAIL_OUT (e->dest)->values);
   bitmap_clear (&hoistable_set.values);
 
+  /* Intersect with AVAIL_OUT of preds of post-dom, to check that
+     hoisted exprs are along all paths from block to pdom.  */
+
+  basic_block pdom_bb = get_immediate_dominator (CDI_POST_DOMINATORS, block);
+  bitmap_head S;
+  bitmap_initialize (&S, &grand_bitmap_obstack);
+  bitmap_and (&S, &availout_in_some, &ANTIC_IN (pdom_bb)->values);
+  if (!bitmap_empty_p (&S))
+    {
+      /* Mark pdom_bb in late_pre_bbs, to avoid PRE for this block
+	 during hoisting.  */
+      late_pre_bbs->add (pdom_bb);
+      FOR_EACH_EDGE (e, ei, pdom_bb->preds)
+	bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
+    }
+
   /* Short-cut for a common case: availout_in_some is empty.  */
   if (bitmap_empty_p (&availout_in_some))
     return false;
@@ -3615,9 +3631,10 @@ do_hoist_insertion (basic_block block)
 /* Perform insertion of partially redundant and hoistable values.  */
 
 static void
-insert (void)
+insert_1 (bool late_pre)
 {
   basic_block bb;
+  hash_set<basic_block> *late_pre_bbs = new hash_set<basic_block> ();
 
   FOR_ALL_BB_FN (bb, cfun)
     NEW_SETS (bb) = bitmap_set_new ();
@@ -3664,14 +3681,21 @@ insert (void)
 	      /* Insert expressions for partial redundancies.  */
 	      if (flag_tree_pre && !single_pred_p (block))
 		{
-		  changed |= do_pre_regular_insertion (block, dom);
-		  if (do_partial_partial)
-		    changed |= do_pre_partial_partial_insertion (block, dom);
+		  /* If hoisting marked to not insert in preds of block,
+		     skip for now, and insert during "late pre".  */
+		  if (!late_pre && late_pre_bbs->contains (block))
+		    ;
+		  else
+		    {
+		      changed |= do_pre_regular_insertion (block, dom);
+		      if (do_partial_partial)
+			changed |= do_pre_partial_partial_insertion (block, dom);
+		    }
 		}
 
 	      /* Insert expressions for hoisting.  */
-	      if (flag_code_hoisting && EDGE_COUNT (block->succs) >= 2)
-		changed |= do_hoist_insertion (block);
+	      if (!late_pre && flag_code_hoisting && EDGE_COUNT (block->succs) >= 2)
+		changed |= do_hoist_insertion (block, late_pre_bbs);
 	    }
 	}
 
@@ -3688,6 +3712,12 @@ insert (void)
   free (rpo);
 }
 
+static void
+insert ()
+{
+  insert_1 (false);
+  insert_1 (true);
+}
 
 /* Compute the AVAIL set for all basic blocks.
 
@@ -4099,6 +4129,7 @@ init_pre (void)
   alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
 
   calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
 
   bitmap_obstack_initialize (&grand_bitmap_obstack);
   phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
@@ -4131,6 +4162,7 @@ fini_pre ()
   name_to_id.release ();
 
   free_aux_for_blocks ();
+  free_dominance_info (CDI_POST_DOMINATORS);
 }
 
 namespace {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-24 10:36                                 ` Prathamesh Kulkarni
@ 2020-09-24 11:14                                   ` Richard Biener
  2020-10-21 10:03                                     ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-09-24 11:14 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > >
> > > > > > > > > > > > > -O2:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > >
> > > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > > > >
> > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > in insn count).
> > > > > > > > > > > >
> > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > >
> > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > >
> > > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > > >
> > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > >
> > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > > >
> > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > >
> > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > assembly.
> > > > > > > > > > > >
> > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > Hi,
> > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > >
> > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > >   4 loops
> > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > >   {
> > > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > > >   }
> > > > > > > > > > > else
> > > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > > >
> > > > > > > > > > > followed by basic block:
> > > > > > > > > > >
> > > > > > > > > > >  senergy=
> > > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > >
> > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > >
> > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > >
> > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > AND
> > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > >
> > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > >
> > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > for ldur instruction:
> > > > > > > > > > >
> > > > > > > > > > > With full hoisting:
> > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > >
> > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > > >
> > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > profiles for both cases).
> > > > > > > > > > >
> > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > possibly results
> > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > and making load slower ?
> > > > > > > > > > >
> > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > Hi,
> > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > with long live ranges).
> > > > > > > > > >
> > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > >
> > > > > > > > > > For instance:
> > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > >   {
> > > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > > >      return false;
> > > > > > > > > >   }
> > > > > > > > > >
> > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > yet).
> > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > for
> > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > for any regressions.
> > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > Hi,
> > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > (without inlining orthonl).
> > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > insn inside loop
> > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > >
> > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > speed and will report numbers
> > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > >
> > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > resource pressure
> > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > from post-dom and neither
> > > > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > > > is too coarse.
> > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > hoisting across a loop region,
> > > > > > > not necessarily from within the loops.
> > > > > >
> > > > > > But it will fail to hoist *p in
> > > > > >
> > > > > >    if (x)
> > > > > >      {
> > > > > >         a = *p;
> > > > > >      }
> > > > > >   else
> > > > > >     {
> > > > > >        b = *p;
> > > > > >     }
> > > > > >
> > > > > > <make distance large>
> > > > > > pdom:
> > > > > >   c = *p;
> > > > > >
> > > > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > >
> > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > of them ...
> > > > > Hi Richard,
> > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > of pdom and making them available?
> > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > if the distance is "large".
> > > >
> > > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > particular
> > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > (but from the last iteration - we do iterate insertion).
> > > >
> > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > What the suggestion
> > > > means is that for
> > > >
> > > >    if (x)
> > > >      y = *p;
> > > >    z = *p;
> > > >
> > > > we'll end up with
> > > >
> > > >   if (x)
> > > >     y = *p;
> > > >   else
> > > >     z = *p;
> > > >
> > > > instead of
> > > >
> > > >    tem = *p;
> > > >    if (x)
> > > >     y = tem;
> > > >   else
> > > >     z = tem;
> > > >
> > > > that is, we get the runtime benefit but not the code-size one
> > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > Again, sth like "distance" isn't really available.
> > > Hi Richard,
> > > It doesn't work without copying AVAIL_OUT.
> > > I tried for small example with attached patch:
> > >
> > > int foo(int cond, int x, int y)
> > > {
> > >   int t;
> > >   void f(int);
> > >
> > >   if (cond)
> > >     t = x + y;
> > >   else
> > >     t = x - y;
> > >
> > >   f (t);
> > >   int t2 = (x + y) * 10;
> > >   return t2;
> > > }
> > >
> > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > hoistings in calculix.
> > >
> > > IIUC, we want hoisting to be:
> > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > precomputes intersection of AVAIL_OUT of pdom preds
> > > for each block and then use this info during hoisting ?
> > >
> > > For computing "distance", I implemented a simple DFS walk from block
> > > to post-dom, that gives up if depth crosses
> > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > that can get.
> >
> > As written it is quadratic in the CFG size.
> >
> > You can optimize away the
> >
> > +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> >
> > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> >
> > As said, I don't think this is the way to go - trying to avoid code
> > hoisting isn't
> > what we'd want to do - your quoted assembly instead points to a loop
> > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > because it also harms vectorization.
> But disabling PRE (which removes non empty latch), only results in
> marginal performance improvement,
> while disabling hoisting of 'w' array, with non-empty latch, seems to
> gain most of the performance.
> AFAIU, that was happening, because after disabling hoisting of 'w',
> there wasn't a cache miss (as seen with perf -e
> L1-dcache-load-misses),
> on the load instruction inside the inner loop.

But that doesn't make much sense then.  If code generation isn't
an issue I don't see how the hoisted loads should cause a L1
dcache load miss for data that is accessed in the respective loop
as well (though not hoisted from it since at -O2 not sufficiently
unrolled)

> For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> node, since that's quadratic in terms of CFG size.
> Would it make sense to break interaction between PRE and hoisting,
> only for the case when inserting into preds of pdom ?
> I tried doing that in attached patch, where insert runs in two phases:
> (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> (b) Second phase, which only runs PRE on all blocks.
> This (expectedly) regresses ssa-hoist-3.c.
>
> If the heuristic isn't acceptable, I suppose encoding distance of expr
> within ANTIC sets
> during compute_antic would be the right way to fix this ?
> So ANTIC_IN (block) contains the anticipated expressions, and for each
> antic expr, the "distance" from the furthest block
> it's computed in ? Could you please elaborate a bit on how we could go
> about encoding distance during compute_antic ?

But the distance in this case is just one CFG node ... we have

 if (mattyp.eq.1)
   ... use of w but not with constant indices
 else if (mattyp.eq.2)
   .. inlined orthonl with constant index w() accesses, single BB
 else
   ... use of w but not with constant indices - the actual relevant
loop of calculix
 endif
 ... constant index w() accesses, single BB

so the CFG distance is one node - unless you want to compute the
maximum distance?  Btw, I only see 9 loads hoisted.

I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.

And indeed this case is exactly one where hoisting is superior to
PRE which would happily insert the 9 loads into the two variable-access
predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.

In .optimized I see

  pretmp_5573 = w[0];
  pretmp_5574 = w[3];
  pretmp_5575 = w[6];
  pretmp_5576 = w[1];
  pretmp_5577 = w[4];
  pretmp_5578 = w[7];
  pretmp_5579 = w[2];
  pretmp_5580 = w[5];
  pretmp_5581 = w[8];
  if (mattyp.157_742 == 1)

I do remember talks/patches about ordering of such sequences of loads
to make them prefetch-happier.  Are the loads actually emitted in-order
for arm?  Thus w[0]...w[8] rather than as seen above with some random
permutes inbetween?  On x86 they are emitted in random order
(they are also spilled immediately).

Richard.

> Thanks,
> Prathamesh
>
>
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > would also apply to PRE in general).
> > > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > > expression in the block,
> > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > range and hoisting
> > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > general, as you point out, we'd need to compute,
> > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > casing for post-dom should be easy enough
> > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > there it will be impossible to do something sensible.
> > > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-09-24 11:14                                   ` Richard Biener
@ 2020-10-21 10:03                                     ` Prathamesh Kulkarni
  2020-10-21 10:39                                       ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-10-21 10:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > > > >
> > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > assembly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > >
> > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > >   4 loops
> > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > >   {
> > > > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > > > >   }
> > > > > > > > > > > > else
> > > > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > > > >
> > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > >
> > > > > > > > > > > >  senergy=
> > > > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > >
> > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > >
> > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > >
> > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > AND
> > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > >
> > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > >
> > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > >
> > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > >
> > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > > > >
> > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > >
> > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > possibly results
> > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > >
> > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > Hi,
> > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > with long live ranges).
> > > > > > > > > > >
> > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > >
> > > > > > > > > > > For instance:
> > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > >   {
> > > > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > > > >      return false;
> > > > > > > > > > >   }
> > > > > > > > > > >
> > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > yet).
> > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > for
> > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > for any regressions.
> > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > Hi,
> > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > insn inside loop
> > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > >
> > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > speed and will report numbers
> > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > >
> > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > resource pressure
> > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > from post-dom and neither
> > > > > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > > > > is too coarse.
> > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > hoisting across a loop region,
> > > > > > > > not necessarily from within the loops.
> > > > > > >
> > > > > > > But it will fail to hoist *p in
> > > > > > >
> > > > > > >    if (x)
> > > > > > >      {
> > > > > > >         a = *p;
> > > > > > >      }
> > > > > > >   else
> > > > > > >     {
> > > > > > >        b = *p;
> > > > > > >     }
> > > > > > >
> > > > > > > <make distance large>
> > > > > > > pdom:
> > > > > > >   c = *p;
> > > > > > >
> > > > > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > >
> > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > of them ...
> > > > > > Hi Richard,
> > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > of pdom and making them available?
> > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > if the distance is "large".
> > > > >
> > > > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > particular
> > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > (but from the last iteration - we do iterate insertion).
> > > > >
> > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > What the suggestion
> > > > > means is that for
> > > > >
> > > > >    if (x)
> > > > >      y = *p;
> > > > >    z = *p;
> > > > >
> > > > > we'll end up with
> > > > >
> > > > >   if (x)
> > > > >     y = *p;
> > > > >   else
> > > > >     z = *p;
> > > > >
> > > > > instead of
> > > > >
> > > > >    tem = *p;
> > > > >    if (x)
> > > > >     y = tem;
> > > > >   else
> > > > >     z = tem;
> > > > >
> > > > > that is, we get the runtime benefit but not the code-size one
> > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > Again, sth like "distance" isn't really available.
> > > > Hi Richard,
> > > > It doesn't work without copying AVAIL_OUT.
> > > > I tried for small example with attached patch:
> > > >
> > > > int foo(int cond, int x, int y)
> > > > {
> > > >   int t;
> > > >   void f(int);
> > > >
> > > >   if (cond)
> > > >     t = x + y;
> > > >   else
> > > >     t = x - y;
> > > >
> > > >   f (t);
> > > >   int t2 = (x + y) * 10;
> > > >   return t2;
> > > > }
> > > >
> > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > hoistings in calculix.
> > > >
> > > > IIUC, we want hoisting to be:
> > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > for each block and then use this info during hoisting ?
> > > >
> > > > For computing "distance", I implemented a simple DFS walk from block
> > > > to post-dom, that gives up if depth crosses
> > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > that can get.
> > >
> > > As written it is quadratic in the CFG size.
> > >
> > > You can optimize away the
> > >
> > > +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > >
> > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > >
> > > As said, I don't think this is the way to go - trying to avoid code
> > > hoisting isn't
> > > what we'd want to do - your quoted assembly instead points to a loop
> > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > because it also harms vectorization.
> > But disabling PRE (which removes non empty latch), only results in
> > marginal performance improvement,
> > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > gain most of the performance.
> > AFAIU, that was happening, because after disabling hoisting of 'w',
> > there wasn't a cache miss (as seen with perf -e
> > L1-dcache-load-misses),
> > on the load instruction inside the inner loop.
>
> But that doesn't make much sense then.  If code generation isn't
> an issue I don't see how the hoisted loads should cause a L1
> dcache load miss for data that is accessed in the respective loop
> as well (though not hoisted from it since at -O2 not sufficiently
> unrolled)
Hi Richard,
I am very sorry to respond late, I was away from work for some
personal commitments, and couldn't respond earlier.
Yes, I agree this doesn't seem to make much sense but I am
consistently seeing L1 dcache load misses, which goes away
after disabling hoisting of 'w'. I am not sure tho why this happens.
Also the load instruction is the only one that has most
significant performance difference across several runs. Or maybe I am
interpreting the results incorrectly.
Do you have suggestions for any benchmarking experiment I could try ?
>
> > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > node, since that's quadratic in terms of CFG size.
> > Would it make sense to break interaction between PRE and hoisting,
> > only for the case when inserting into preds of pdom ?
> > I tried doing that in attached patch, where insert runs in two phases:
> > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > (b) Second phase, which only runs PRE on all blocks.
> > This (expectedly) regresses ssa-hoist-3.c.
> >
> > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > within ANTIC sets
> > during compute_antic would be the right way to fix this ?
> > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > antic expr, the "distance" from the furthest block
> > it's computed in ? Could you please elaborate a bit on how we could go
> > about encoding distance during compute_antic ?
>
> But the distance in this case is just one CFG node ... we have
>
>  if (mattyp.eq.1)
>    ... use of w but not with constant indices
>  else if (mattyp.eq.2)
>    .. inlined orthonl with constant index w() accesses, single BB
>  else
>    ... use of w but not with constant indices - the actual relevant
> loop of calculix
>  endif
>  ... constant index w() accesses, single BB
>
> so the CFG distance is one node - unless you want to compute the
> maximum distance?  Btw, I only see 9 loads hoisted.
>
> I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
>
> And indeed this case is exactly one where hoisting is superior to
> PRE which would happily insert the 9 loads into the two variable-access
> predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
>
> In .optimized I see
>
>   pretmp_5573 = w[0];
>   pretmp_5574 = w[3];
>   pretmp_5575 = w[6];
>   pretmp_5576 = w[1];
>   pretmp_5577 = w[4];
>   pretmp_5578 = w[7];
>   pretmp_5579 = w[2];
>   pretmp_5580 = w[5];
>   pretmp_5581 = w[8];
>   if (mattyp.157_742 == 1)
>
> I do remember talks/patches about ordering of such sequences of loads
> to make them prefetch-happier.  Are the loads actually emitted in-order
> for arm?  Thus w[0]...w[8] rather than as seen above with some random
> permutes inbetween?  On x86 they are emitted in random order
> (they are also spilled immediately).
On aarch64, they are emitted in random order as well.

Thanks,
Prathamesh


Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > would also apply to PRE in general).
> > > > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > > > expression in the block,
> > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > range and hoisting
> > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > casing for post-dom should be easy enough
> > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > >
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-10-21 10:03                                     ` Prathamesh Kulkarni
@ 2020-10-21 10:39                                       ` Richard Biener
  2020-10-28  6:55                                         ` Prathamesh Kulkarni
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2020-10-21 10:39 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: Alexander Monakov, GCC Development

On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > >
> > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > > assembly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > >   4 loops
> > > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > >   {
> > > > > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > > > > >   }
> > > > > > > > > > > > > else
> > > > > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > > > > >
> > > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > > >
> > > > > > > > > > > > >  senergy=
> > > > > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > > >
> > > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > > >
> > > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > > AND
> > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > > >
> > > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > > > > >
> > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > > >
> > > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > > possibly results
> > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > > with long live ranges).
> > > > > > > > > > > >
> > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > > >
> > > > > > > > > > > > For instance:
> > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > >   {
> > > > > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > > > > >      return false;
> > > > > > > > > > > >   }
> > > > > > > > > > > >
> > > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > > yet).
> > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > > for
> > > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > > for any regressions.
> > > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > > Hi,
> > > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > > insn inside loop
> > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > > >
> > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > > speed and will report numbers
> > > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > > >
> > > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > > resource pressure
> > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > > from post-dom and neither
> > > > > > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > > > > > is too coarse.
> > > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > > hoisting across a loop region,
> > > > > > > > > not necessarily from within the loops.
> > > > > > > >
> > > > > > > > But it will fail to hoist *p in
> > > > > > > >
> > > > > > > >    if (x)
> > > > > > > >      {
> > > > > > > >         a = *p;
> > > > > > > >      }
> > > > > > > >   else
> > > > > > > >     {
> > > > > > > >        b = *p;
> > > > > > > >     }
> > > > > > > >
> > > > > > > > <make distance large>
> > > > > > > > pdom:
> > > > > > > >   c = *p;
> > > > > > > >
> > > > > > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > > > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > > >
> > > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > > of them ...
> > > > > > > Hi Richard,
> > > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > > of pdom and making them available?
> > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > > if the distance is "large".
> > > > > >
> > > > > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > > particular
> > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > > (but from the last iteration - we do iterate insertion).
> > > > > >
> > > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > > What the suggestion
> > > > > > means is that for
> > > > > >
> > > > > >    if (x)
> > > > > >      y = *p;
> > > > > >    z = *p;
> > > > > >
> > > > > > we'll end up with
> > > > > >
> > > > > >   if (x)
> > > > > >     y = *p;
> > > > > >   else
> > > > > >     z = *p;
> > > > > >
> > > > > > instead of
> > > > > >
> > > > > >    tem = *p;
> > > > > >    if (x)
> > > > > >     y = tem;
> > > > > >   else
> > > > > >     z = tem;
> > > > > >
> > > > > > that is, we get the runtime benefit but not the code-size one
> > > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > > Again, sth like "distance" isn't really available.
> > > > > Hi Richard,
> > > > > It doesn't work without copying AVAIL_OUT.
> > > > > I tried for small example with attached patch:
> > > > >
> > > > > int foo(int cond, int x, int y)
> > > > > {
> > > > >   int t;
> > > > >   void f(int);
> > > > >
> > > > >   if (cond)
> > > > >     t = x + y;
> > > > >   else
> > > > >     t = x - y;
> > > > >
> > > > >   f (t);
> > > > >   int t2 = (x + y) * 10;
> > > > >   return t2;
> > > > > }
> > > > >
> > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > > hoistings in calculix.
> > > > >
> > > > > IIUC, we want hoisting to be:
> > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > > for each block and then use this info during hoisting ?
> > > > >
> > > > > For computing "distance", I implemented a simple DFS walk from block
> > > > > to post-dom, that gives up if depth crosses
> > > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > > that can get.
> > > >
> > > > As written it is quadratic in the CFG size.
> > > >
> > > > You can optimize away the
> > > >
> > > > +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > > +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > > >
> > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > > >
> > > > As said, I don't think this is the way to go - trying to avoid code
> > > > hoisting isn't
> > > > what we'd want to do - your quoted assembly instead points to a loop
> > > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > > because it also harms vectorization.
> > > But disabling PRE (which removes non empty latch), only results in
> > > marginal performance improvement,
> > > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > > gain most of the performance.
> > > AFAIU, that was happening, because after disabling hoisting of 'w',
> > > there wasn't a cache miss (as seen with perf -e
> > > L1-dcache-load-misses),
> > > on the load instruction inside the inner loop.
> >
> > But that doesn't make much sense then.  If code generation isn't
> > an issue I don't see how the hoisted loads should cause a L1
> > dcache load miss for data that is accessed in the respective loop
> > as well (though not hoisted from it since at -O2 not sufficiently
> > unrolled)
> Hi Richard,
> I am very sorry to respond late, I was away from work for some
> personal commitments, and couldn't respond earlier.
> Yes, I agree this doesn't seem to make much sense but I am
> consistently seeing L1 dcache load misses, which goes away
> after disabling hoisting of 'w'. I am not sure tho why this happens.
> Also the load instruction is the only one that has most
> significant performance difference across several runs. Or maybe I am
> interpreting the results incorrectly.
> Do you have suggestions for any benchmarking experiment I could try ?

I think you want to edit the bad assembly manually and try a few things,
like ordering the hoisted loads and then re-measure.  Unfortunately
modern CPU pipelines have no easy way to tell us why they're unhappy.

> >
> > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > > node, since that's quadratic in terms of CFG size.
> > > Would it make sense to break interaction between PRE and hoisting,
> > > only for the case when inserting into preds of pdom ?
> > > I tried doing that in attached patch, where insert runs in two phases:
> > > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > > (b) Second phase, which only runs PRE on all blocks.
> > > This (expectedly) regresses ssa-hoist-3.c.
> > >
> > > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > > within ANTIC sets
> > > during compute_antic would be the right way to fix this ?
> > > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > > antic expr, the "distance" from the furthest block
> > > it's computed in ? Could you please elaborate a bit on how we could go
> > > about encoding distance during compute_antic ?
> >
> > But the distance in this case is just one CFG node ... we have
> >
> >  if (mattyp.eq.1)
> >    ... use of w but not with constant indices
> >  else if (mattyp.eq.2)
> >    .. inlined orthonl with constant index w() accesses, single BB
> >  else
> >    ... use of w but not with constant indices - the actual relevant
> > loop of calculix
> >  endif
> >  ... constant index w() accesses, single BB
> >
> > so the CFG distance is one node - unless you want to compute the
> > maximum distance?  Btw, I only see 9 loads hoisted.
> >
> > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
> >
> > And indeed this case is exactly one where hoisting is superior to
> > PRE which would happily insert the 9 loads into the two variable-access
> > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
> >
> > In .optimized I see
> >
> >   pretmp_5573 = w[0];
> >   pretmp_5574 = w[3];
> >   pretmp_5575 = w[6];
> >   pretmp_5576 = w[1];
> >   pretmp_5577 = w[4];
> >   pretmp_5578 = w[7];
> >   pretmp_5579 = w[2];
> >   pretmp_5580 = w[5];
> >   pretmp_5581 = w[8];
> >   if (mattyp.157_742 == 1)
> >
> > I do remember talks/patches about ordering of such sequences of loads
> > to make them prefetch-happier.  Are the loads actually emitted in-order
> > for arm?  Thus w[0]...w[8] rather than as seen above with some random
> > permutes inbetween?  On x86 they are emitted in random order
> > (they are also spilled immediately).
> On aarch64, they are emitted in random order as well.
>
> Thanks,
> Prathamesh
>
>
> Thanks,
> Prathamesh
> >
> > Richard.
> >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > > would also apply to PRE in general).
> > > > > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > > > > expression in the block,
> > > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > > range and hoisting
> > > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > > casing for post-dom should be easy enough
> > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > > >
> > > > > > > > > > Richard.
> > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: LTO slows down calculix by more than 10% on aarch64
  2020-10-21 10:39                                       ` Richard Biener
@ 2020-10-28  6:55                                         ` Prathamesh Kulkarni
  0 siblings, 0 replies; 25+ messages in thread
From: Prathamesh Kulkarni @ 2020-10-28  6:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexander Monakov, GCC Development

On Wed, 21 Oct 2020 at 16:10, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >
> > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > > > >               3758               context-switches          #    0.000 K/sec
> > > > > > > > > > > > > > > >                 40                 cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > > > >              40847              page-faults                   #    0.005 K/sec
> > > > > > > > > > > > > > > >      7856782413676      cycles                           #    1.000 GHz
> > > > > > > > > > > > > > > >      6034510093417      instructions                   #    0.77  insn per cycle
> > > > > > > > > > > > > > > >       363937274287       branches                       #   46.321 M/sec
> > > > > > > > > > > > > > > >        48557110132       branch-misses                #   13.34% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> > > > > > > > > > > > > > > >               4285               context-switches         #    0.001 K/sec
> > > > > > > > > > > > > > > >                 28                 cpu-migrations            #    0.000 K/sec
> > > > > > > > > > > > > > > >              40843              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > > > >      8319591038295      cycles                          #    1.000 GHz
> > > > > > > > > > > > > > > >      6276338800377      instructions                  #    0.75  insn per cycle
> > > > > > > > > > > > > > > >       467400726106       branches                      #   56.180 M/sec
> > > > > > > > > > > > > > > >        45986364011        branch-misses              #    9.84% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > > > > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> > > > > > > > > > > > > > > >               2266               context-switches    #    0.000 K/sec
> > > > > > > > > > > > > > > >                 32                 cpu-migrations       #    0.000 K/sec
> > > > > > > > > > > > > > > >              40846              page-faults             #    0.005 K/sec
> > > > > > > > > > > > > > > >      8207292032467      cycles                     #   1.000 GHz
> > > > > > > > > > > > > > > >      6035724436440      instructions             #    0.74  insn per cycle
> > > > > > > > > > > > > > > >       364415440156       branches                 #   44.401 M/sec
> > > > > > > > > > > > > > > >        53138327276        branch-misses         #   14.58% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > > > > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> > > > > > > > > > > > > > > >               3139              context-switches          #    0.000 K/sec
> > > > > > > > > > > > > > > >                 20                cpu-migrations             #    0.000 K/sec
> > > > > > > > > > > > > > > >              40846              page-faults                  #    0.005 K/sec
> > > > > > > > > > > > > > > >      7797221351467      cycles                          #    1.000 GHz
> > > > > > > > > > > > > > > >      6187348757324      instructions                  #    0.79  insn per cycle
> > > > > > > > > > > > > > > >       461840800061       branches                      #   59.231 M/sec
> > > > > > > > > > > > > > > >        26920311761        branch-misses             #    5.83% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > > > > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > > > > > > > > > in insn count).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > > > > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > > > > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > > > > > > > > > assembly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > > >   4 loops
> > > > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > > >   {
> > > > > > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > > > > > >   }
> > > > > > > > > > > > > > else
> > > > > > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  senergy=
> > > > > > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > > >      &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > > > > > > > > > right in block 181, which is:
> > > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > > > > > AND
> > > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > > > > > > > > > for ldur instruction:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > > > > > > > > > profiles for both cases).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > > > > > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > > > > > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > > > > > > > > > possibly results
> > > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > > > > > > > > > and making load slower ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > > > > > > > > > For disabled hoisting  of 'w' array case, there were a total of 463
> > > > > > > > > > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > > > > > > > > > (with period = 1 million).
> > > > > > > > > > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > > > > > > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > > > > > > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > > > > > > > > > with register spill or cache miss inside loops, which may offset the
> > > > > > > > > > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > > > > > > > > > with other code-movement optimizations (or if the source had variables
> > > > > > > > > > > > > with long live ranges).
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > > > > > > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > > > > > > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > > > > > > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > > > > > >
> > > > > > > > > > > > > For instance:
> > > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > > >   {
> > > > > > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > > > > > >      return false;
> > > > > > > > > > > > >   }
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think this would work for resolving the calculix issue because it
> > > > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > > > > > > > > > yet).
> > > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > > > > > > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > > > > > > > > > for
> > > > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > > > > > > > > > for any regressions.
> > > > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > > > > > And it brings back most of performance for calculix on par with O2
> > > > > > > > > > > > (without inlining orthonl).
> > > > > > > > > > > > I verified that with patch there is no cache miss happening on load
> > > > > > > > > > > > insn inside loop
> > > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > > > >
> > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > > > > > > > > > speed and will report numbers
> > > > > > > > > > > > in couple of days. (If required, we could parametrize number of nested
> > > > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > > > >
> > > > > > > > > > > I don't think this patch captures the case in a sensible way - it will simply
> > > > > > > > > > > never hoist computations out of loop header blocks with depth > 3 which
> > > > > > > > > > > is certainly not what you want.  Also the pre-header check is odd - we're
> > > > > > > > > > > looking for computations in successors of BLOCK but clearly a computation
> > > > > > > > > > > in a pre-header is not at the same loop level as one in the header itself.
> > > > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > > > which has multiple nested loops, and in that case, avoid hoisting expressions
> > > > > > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > > > > > resource pressure
> > > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > > > > > from post-dom and neither
> > > > > > > > > > loop region has any uses of 'w'.  I agree checking just for loop level
> > > > > > > > > > is too coarse.
> > > > > > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > > > > > hoisting across a loop region,
> > > > > > > > > > not necessarily from within the loops.
> > > > > > > > >
> > > > > > > > > But it will fail to hoist *p in
> > > > > > > > >
> > > > > > > > >    if (x)
> > > > > > > > >      {
> > > > > > > > >         a = *p;
> > > > > > > > >      }
> > > > > > > > >   else
> > > > > > > > >     {
> > > > > > > > >        b = *p;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > > <make distance large>
> > > > > > > > > pdom:
> > > > > > > > >   c = *p;
> > > > > > > > >
> > > > > > > > > so it isn't what matters either.  What happens at the immediate post-dominator
> > > > > > > > > isn't directly relevant - what matters would be if the pdom is the one making
> > > > > > > > > the value antic on one of the outgoing edges.  In that case we're also going
> > > > > > > > > to PRE *p into the arm not computing *p (albeit in a later position).  But
> > > > > > > > > that property is impossible to compute from the sets itself (not even mentioning
> > > > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or the weird
> > > > > > > > > pdoms we compute for regions not having a path to exit, like infinite loops).
> > > > > > > > >
> > > > > > > > > You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> > > > > > > > > in each of them we _might_ have the situation you want to protect against.
> > > > > > > > > But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> > > > > > > > > of them ...
> > > > > > > > Hi Richard,
> > > > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > > > > > we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
> > > > > > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > > > > > of pdom and making them available?
> > > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > > > if the distance is "large".
> > > > > > >
> > > > > > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > > > particular
> > > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > > > > > (but from the last iteration - we do iterate insertion).
> > > > > > >
> > > > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > > > What the suggestion
> > > > > > > means is that for
> > > > > > >
> > > > > > >    if (x)
> > > > > > >      y = *p;
> > > > > > >    z = *p;
> > > > > > >
> > > > > > > we'll end up with
> > > > > > >
> > > > > > >   if (x)
> > > > > > >     y = *p;
> > > > > > >   else
> > > > > > >     z = *p;
> > > > > > >
> > > > > > > instead of
> > > > > > >
> > > > > > >    tem = *p;
> > > > > > >    if (x)
> > > > > > >     y = tem;
> > > > > > >   else
> > > > > > >     z = tem;
> > > > > > >
> > > > > > > that is, we get the runtime benefit but not the code-size one
> > > > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > > > in some cases).  Now, if we iterate (like if we'd have a second hoisting pass)
> > > > > > > then the above would still cause hoisting - so the heuristic isn't sustainable.
> > > > > > > Again, sth like "distance" isn't really available.
> > > > > > Hi Richard,
> > > > > > It doesn't work without copying AVAIL_OUT.
> > > > > > I tried for small example with attached patch:
> > > > > >
> > > > > > int foo(int cond, int x, int y)
> > > > > > {
> > > > > >   int t;
> > > > > >   void f(int);
> > > > > >
> > > > > >   if (cond)
> > > > > >     t = x + y;
> > > > > >   else
> > > > > >     t = x - y;
> > > > > >
> > > > > >   f (t);
> > > > > >   int t2 = (x + y) * 10;
> > > > > >   return t2;
> > > > > > }
> > > > > >
> > > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > > > it does not hoist in first pass, but then PRE inserts x + y in the "else block",
> > > > > > and we eventually hoist before if (cond). Similarly for e_c3d
> > > > > > hoistings in calculix.
> > > > > >
> > > > > > IIUC, we want hoisting to be:
> > > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > > > > > to ensure that hoisted expressions are along all paths from block to post-dom ?
> > > > > > If copying AVAIL_OUT sets is too large, could we keep another set that
> > > > > > precomputes intersection of AVAIL_OUT of pdom preds
> > > > > > for each block and then use this info during hoisting ?
> > > > > >
> > > > > > For computing "distance", I implemented a simple DFS walk from block
> > > > > > to post-dom, that gives up if depth crosses
> > > > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > > > that can get.
> > > > >
> > > > > As written it is quadratic in the CFG size.
> > > > >
> > > > > You can optimize away the
> > > > >
> > > > > +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > > > +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > > > >
> > > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > > > >
> > > > > As said, I don't think this is the way to go - trying to avoid code
> > > > > hoisting isn't
> > > > > what we'd want to do - your quoted assembly instead points to a loop
> > > > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > > > because it also harms vectorization.
> > > > But disabling PRE (which removes non empty latch), only results in
> > > > marginal performance improvement,
> > > > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > > > gain most of the performance.
> > > > AFAIU, that was happening, because after disabling hoisting of 'w',
> > > > there wasn't a cache miss (as seen with perf -e
> > > > L1-dcache-load-misses),
> > > > on the load instruction inside the inner loop.
> > >
> > > But that doesn't make much sense then.  If code generation isn't
> > > an issue I don't see how the hoisted loads should cause a L1
> > > dcache load miss for data that is accessed in the respective loop
> > > as well (though not hoisted from it since at -O2 not sufficiently
> > > unrolled)
> > Hi Richard,
> > I am very sorry to respond late, I was away from work for some
> > personal commitments, and couldn't respond earlier.
> > Yes, I agree this doesn't seem to make much sense but I am
> > consistently seeing L1 dcache load misses, which goes away
> > after disabling hoisting of 'w'. I am not sure tho why this happens.
> > Also the load instruction is the only one that has most
> > significant performance difference across several runs. Or maybe I am
> > interpreting the results incorrectly.
> > Do you have suggestions for any benchmarking experiment I could try ?
>
> I think you want to edit the bad assembly manually and try a few things,
> like ordering the hoisted loads and then re-measure.  Unfortunately
> modern CPU pipelines have no easy way to tell us why they're unhappy.
Hi Richard,
Thanks for the suggestions. I reordered the hoisted loads to be
in-order but that
didn't seem to improve performance.

Thanks,
Prathamesh
>
> > >
> > > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > > > node, since that's quadratic in terms of CFG size.
> > > > Would it make sense to break interaction between PRE and hoisting,
> > > > only for the case when inserting into preds of pdom ?
> > > > I tried doing that in attached patch, where insert runs in two phases:
> > > > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > > > (b) Second phase, which only runs PRE on all blocks.
> > > > This (expectedly) regresses ssa-hoist-3.c.
> > > >
> > > > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > > > within ANTIC sets
> > > > during compute_antic would be the right way to fix this ?
> > > > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > > > antic expr, the "distance" from the furthest block
> > > > it's computed in ? Could you please elaborate a bit on how we could go
> > > > about encoding distance during compute_antic ?
> > >
> > > But the distance in this case is just one CFG node ... we have
> > >
> > >  if (mattyp.eq.1)
> > >    ... use of w but not with constant indices
> > >  else if (mattyp.eq.2)
> > >    .. inlined orthonl with constant index w() accesses, single BB
> > >  else
> > >    ... use of w but not with constant indices - the actual relevant
> > > loop of calculix
> > >  endif
> > >  ... constant index w() accesses, single BB
> > >
> > > so the CFG distance is one node - unless you want to compute the
> > > maximum distance?  Btw, I only see 9 loads hoisted.
> > >
> > > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
> > >
> > > And indeed this case is exactly one where hoisting is superior to
> > > PRE which would happily insert the 9 loads into the two variable-access
> > > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
> > >
> > > In .optimized I see
> > >
> > >   pretmp_5573 = w[0];
> > >   pretmp_5574 = w[3];
> > >   pretmp_5575 = w[6];
> > >   pretmp_5576 = w[1];
> > >   pretmp_5577 = w[4];
> > >   pretmp_5578 = w[7];
> > >   pretmp_5579 = w[2];
> > >   pretmp_5580 = w[5];
> > >   pretmp_5581 = w[8];
> > >   if (mattyp.157_742 == 1)
> > >
> > > I do remember talks/patches about ordering of such sequences of loads
> > > to make them prefetch-happier.  Are the loads actually emitted in-order
> > > for arm?  Thus w[0]...w[8] rather than as seen above with some random
> > > permutes inbetween?  On x86 they are emitted in random order
> > > (they are also spilled immediately).
> > On aarch64, they are emitted in random order as well.
> >
> > Thanks,
> > Prathamesh
> >
> >
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Note the difficulty to capture "distance" is that the distance is simply not
> > > > > > > > > > > available at this point - it is the anticipated values from the successors
> > > > > > > > > > > that do _not_ compute the value itself that are the issue.  To capture
> > > > > > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > > > > > would also apply to PRE in general).
> > > > > > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > > > > > expression in the block,
> > > > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending live
> > > > > > > > > > range and hoisting
> > > > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > > > distance info for successors block during compute_antic, but special
> > > > > > > > > > casing for post-dom should be easy enough
> > > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > As with all other heuristics the only place one could do hackish attempts
> > > > > > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > > > > > we make use of the (hoist) insertions - of course for hoisting we already
> > > > > > > > > > > know we'll get the "close" use in one of the successors so I fear even
> > > > > > > > > > > there it will be impossible to do something sensible.
> > > > > > > > > > >
> > > > > > > > > > > Richard.
> > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-10-28  6:56 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26 10:32 LTO slows down calculix by more than 10% on aarch64 Prathamesh Kulkarni
2020-08-26 11:20 ` Richard Biener
2020-08-28 11:16   ` Prathamesh Kulkarni
2020-08-28 11:57     ` Richard Biener
2020-08-31 11:21       ` Prathamesh Kulkarni
2020-08-31 11:40         ` Jan Hubicka
2020-08-28 12:03     ` Alexander Monakov
2020-08-31 11:23       ` Prathamesh Kulkarni
2020-09-04  9:52         ` Prathamesh Kulkarni
2020-09-04 11:38           ` Alexander Monakov
2020-09-21  9:49             ` Prathamesh Kulkarni
2020-09-21 12:44               ` Prathamesh Kulkarni
2020-09-22  5:08                 ` Prathamesh Kulkarni
2020-09-22  7:25                   ` Richard Biener
2020-09-22  9:37                     ` Prathamesh Kulkarni
2020-09-22 11:06                       ` Richard Biener
2020-09-22 16:24                         ` Prathamesh Kulkarni
2020-09-23  7:52                           ` Richard Biener
2020-09-23 10:10                             ` Prathamesh Kulkarni
2020-09-23 11:10                               ` Richard Biener
2020-09-24 10:36                                 ` Prathamesh Kulkarni
2020-09-24 11:14                                   ` Richard Biener
2020-10-21 10:03                                     ` Prathamesh Kulkarni
2020-10-21 10:39                                       ` Richard Biener
2020-10-28  6:55                                         ` Prathamesh Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).