public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "dorit at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/33243] Missed opportunities for vectorization due to unhandled real_type Date: Thu, 30 Aug 2007 10:12:00 -0000 [thread overview] Message-ID: <20070830101226.1983.qmail@sourceware.org> (raw) In-Reply-To: <bug-33243-7780@http.gcc.gnu.org/bugzilla/> ------- Comment #2 from dorit at gcc dot gnu dot org 2007-08-30 10:12 ------- > There are two time consuming routines in air.f90 of the Polyhedron > benchmark that are not vectorized: lines 1328 and 1354. These appear > in the top counting of execution time with oprofile: > > SUBROUTINE DERIVY(D,U,Uy,Al,Np,Nd,M) > IMPLICIT REAL*8(A-H,O-Z) > PARAMETER (NX=150,NY=150) > DIMENSION D(NY,33) , U(NX,NY) , Uy(NX,NY) , Al(30) , Np(30) > DO jm = 1 , M > jmax = 0 > jmin = 1 > DO i = 1 , Nd > jmax = jmax + Np(i) + 1 > DO j = jmin , jmax > uyt = 0. > DO k = 0 , Np(i) > uyt = uyt + D(j,k+1)*U(jm,jmin+k) > ENDDO > Uy(jm,j) = uyt*Al(i) > ENDDO > jmin = jmin + Np(i) + 1 > ENDDO > ENDDO > CONTINUE > END > > ./poly_air_1354.f90:12: note: def_stmt: uyt_1 = PHI <0.0(9), uyt_42(11)> > ./poly_air_1354.f90:12: note: Unsupported pattern. > ./poly_air_1354.f90:12: note: not vectorized: unsupported use in stmt. > ./poly_air_1354.f90:12: note: unexpected pattern. > ./poly_air_1354.f90:1: note: vectorized 0 loops in function. > > This is due to an unsupported type, real_type, for the reduction variable uyt: > (this is on an i686-linux machine) There is no "unhandled real_type" problem, you just need to use -ffast-math to allow vectorization of summation of fp types (or the new reassociation flag): pr33243b.f90:12: note: Analyze phi: uyt_1 = PHI <0.0(9), uyt_42(11)> pr33243b.f90:12: note: reduction: unsafe fp math optimization: D.1386_41 + uyt_1 pr33243b.f90:12: note: Unknown def-use cycle pattern. If you use -ffast-math the reduction is detected: pr33243b.f90:12: note: Analyze phi: uyt_1 = PHI <0.0(9), uyt_42(11)> pr33243b.f90:12: note: detected reduction:D.1386_41 + uyt_1 pr33243b.f90:12: note: Detected reduction. However, the loop will still not get vectorized because there is a non-consecutive access in the loop: pr33243b.f90:12: note: === vect_analyze_data_ref_accesses === pr33243b.f90:12: note: not consecutive access pr33243b.f90:12: note: not vectorized: complicated access pattern. This is because the stride of the accesses to D(j,k+1) and U(jm,jmin+k) in the inner-loop (k-loop) between inner-loop iterations is 1200B: DO j = jmin , jmax uyt = 0. DO k = 0 , NP(i) uyt = uyt + D(j,k+1)*U(jm,jmin+k) ENDDO Uy(jm,j) = uyt*Al(i) ENDDO In the outer-loop (j-loop) these accesses are consecutive, and also you don't need to use the -ffast-math flag. However there are other problems: 1) the compiler creates a guard to control whether to enter the inner-loop or not (cause it may execute 0 times). This creates a more involved control-flow than the outer-loop vectorizer is willing to work with. A solution would be to create this guard outside the outer-loop (in case it is invariant, as is the case here), which is like versioning the loop (or unswichting the loop). 2) if you change the loop count to something constant (just to bypass the above problem), then indeed no guard code is generated, but there is a computation (advancing an iv) in the latch block of the outer-loop (so it is not empty, and we are not willing to work with such loops). We need to clean that away. 3) After these problems are solved, we still need to deal with a non-consecutive access in the outer-loop - the store to Uy(jm,j). AFAICS, this requires either transposing the Uy array in advance, or teaching the vectorizer to "scatter" the results to the non-adjacent locations (which would be quite expensive, but we could give it a try). Alternatively, vectorizing the inner-loop would require transposing the D and U matrices. Another option is to interchange the jm loop with the j loop - I think this way all accesses would be consecutive, and we could vectorize the jm loop (which would now be a doubly-nested loop that the outer-loop vectorizer could handle). So, the PR for this testcase would be better classified under one of the above problems/missed-optimizations rather than "unhandled real_type". > > Another similar routine that also appears in the top ranked and not > vectorized due to the same unsupported real_type reasons is in air.f90:1181 > > > SUBROUTINE FVSPLTX2 > IMPLICIT REAL*8(A-H,O-Z) > PARAMETER (NX=150,NY=150) > DIMENSION DX(NX,33) , ALX(30) , NPX(30) > DIMENSION FP1(NX,NY) , FM1(NX,NY) , FP1x(30,NX) , FM1x(30,NX) > DIMENSION FP2(NX,NY) , FM2(NX,NY) , FP2x(30,NX) , FM2x(30,NX) > DIMENSION FP3(NX,NY) , FM3(NX,NY) , FP3x(30,NX) , FM3x(30,NX) > DIMENSION FP4(NX,NY) , FM4(NX,NY) , FP4x(30,NX) , FM4x(30,NX) > DIMENSION FV2(NX,NY) , DXP2(30,NX) , DXM2(30,NX) > DIMENSION FV3(NX,NY) , DXP3(30,NX) , DXM3(30,NX) > DIMENSION FV4(NX,NY) , DXP4(30,NX) , DXM4(30,NX) > COMMON /XD1 / FP1 , FM1 , FP2 , FM2 , FP3 , FM3 , FP4 , FM4 , & > & FP1x , FM1x , FP2x , FM2x , FP3x , FM3x , FP4x , & > & FM4x , FV2 , FV3 , FV4 , DXP2 , DXM2 , DXP3 , & > & DXM3 , DXP4 , DXM4 , DX , NPX , ALX , NDX , MXPy > > > DO ik = 1 , MXPy > jmax = 0 > jmin = 1 > DO i = 1 , NDX > jmax = jmax + NPX(i) + 1 > ! > ! INITIALIZE > ! > FP1x(i,ik) = 0. > FM1x(i,ik) = 0. > FP2x(i,ik) = 0. > FM2x(i,ik) = 0. > FP3x(i,ik) = 0. > FM3x(i,ik) = 0. > FP4x(i,ik) = 0. > FM4x(i,ik) = 0. > DXP2(i,ik) = 0. > DXM2(i,ik) = 0. > DXP3(i,ik) = 0. > DXM3(i,ik) = 0. > DXP4(i,ik) = 0. > DXM4(i,ik) = 0. > DO k = 0 , NPX(i) > jk = jmin + k > FP1x(i,ik) = FP1x(i,ik) + DX(jmax,k+1)*FP1(jk,ik) > FM1x(i,ik) = FM1x(i,ik) + DX(jmin,k+1)*FM1(jk,ik) > FP2x(i,ik) = FP2x(i,ik) + DX(jmax,k+1)*FP2(jk,ik) > FM2x(i,ik) = FM2x(i,ik) + DX(jmin,k+1)*FM2(jk,ik) > FP3x(i,ik) = FP3x(i,ik) + DX(jmax,k+1)*FP3(jk,ik) > FM3x(i,ik) = FM3x(i,ik) + DX(jmin,k+1)*FM3(jk,ik) > FP4x(i,ik) = FP4x(i,ik) + DX(jmax,k+1)*FP4(jk,ik) > FM4x(i,ik) = FM4x(i,ik) + DX(jmin,k+1)*FM4(jk,ik) > DXP2(i,ik) = DXP2(i,ik) + DX(jmax,k+1)*FV2(jk,ik) > DXM2(i,ik) = DXM2(i,ik) + DX(jmin,k+1)*FV2(jk,ik) > DXP3(i,ik) = DXP3(i,ik) + DX(jmax,k+1)*FV3(jk,ik) > DXM3(i,ik) = DXM3(i,ik) + DX(jmin,k+1)*FV3(jk,ik) > DXP4(i,ik) = DXP4(i,ik) + DX(jmax,k+1)*FV4(jk,ik) > DXM4(i,ik) = DXM4(i,ik) + DX(jmin,k+1)*FV4(jk,ik) > ENDDO > FP1x(i,ik) = FP1x(i,ik)*ALX(i) > FM1x(i,ik) = FM1x(i,ik)*ALX(i) > FP2x(i,ik) = FP2x(i,ik)*ALX(i) > FM2x(i,ik) = FM2x(i,ik)*ALX(i) > FP3x(i,ik) = FP3x(i,ik)*ALX(i) > FM3x(i,ik) = FM3x(i,ik)*ALX(i) > FP4x(i,ik) = FP4x(i,ik)*ALX(i) > FM4x(i,ik) = FM4x(i,ik)*ALX(i) > DXP2(i,ik) = DXP2(i,ik)*ALX(i) > DXM2(i,ik) = DXM2(i,ik)*ALX(i) > DXP3(i,ik) = DXP3(i,ik)*ALX(i) > DXM3(i,ik) = DXM3(i,ik)*ALX(i) > DXP4(i,ik) = DXP4(i,ik)*ALX(i) > DXM4(i,ik) = DXM4(i,ik)*ALX(i) > jmin = jmin + NPX(i) + 1 > ENDDO > ENDDO > CONTINUE > END > Here again, it's not an issue with unhandling real_types, but a problem to detect the reduction because of extra copy stmts that lim leaves behind: # xd1__dxm4xd1_I_I_lsm.42_89 = PHI <xd1__dxm4xd1_I_I_lsm.42_429(7), xd1__dxm4xd1_I_I_lsm.42_320(9)> D.1487_302 = xd1__dxm4xd1_I_I_lsm.42_89; D.1489_313 = D.1488_312 + D.1487_302; xd1__dxm4xd1_I_I_lsm.42_320 = D.1489_313; This problem with detecting only a restricted form of a reduction (that consists of a single stmt + phi node) is already reported under PR32824. It would be solved by a more general reduction detection utility that Razya is planning to implement, and/or by cleaning away the stuff that lim leaves behind. (By the way, you'd still need to use -ffast-math). However, even when this is fixed, we have non consecutive accesses to array X, which would need to be transposed. If you look at the outer-loop (i-loop), I think everything is consecutive there. The problem is that the inner-loop bound is not invariant in the outer-loop, so that would prevent outer-loop vectorization: pr33243c.f90:21: note: Symbolic number of iterations is (<unnamed-unsigned:32>) D.1420_13 + 1 pr33243c.f90:21: note: not vectorized: inner-loop count not invariant. pr33243c.f90:21: note: bad loop form. > > Here are some kernels from test_fpu.f90 that could be vectorized, > but are not, due to the exact same problem with the real_type not > supported. The places where the vectorization fails are marked > with a comment at the end of the line: !seb. > I couldn't compile the rest of the kernels because of the "USE kinds" - I get: Fatal Error: Can't open module file 'kinds.mod' for reading How do I get around that? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33243
next prev parent reply other threads:[~2007-08-30 10:12 UTC|newest] Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2007-08-30 2:47 [Bug tree-optimization/33243] New: " spop at gcc dot gnu dot org 2007-08-30 3:11 ` [Bug tree-optimization/33243] " spop at gcc dot gnu dot org 2007-08-30 10:12 ` dorit at gcc dot gnu dot org [this message] 2007-08-30 14:19 ` spop at gcc dot gnu dot org [not found] <bug-33243-4@http.gcc.gnu.org/bugzilla/> 2021-07-21 2:41 ` pinskia at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20070830101226.1983.qmail@sourceware.org \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).