From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-365202-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 18461 invoked by alias); 9 Aug 2011 17:34:24 -0000
Received: (qmail 18006 invoked by uid 22791); 9 Aug 2011 17:34:23 -0000
X-SWARE-Spam-Status: No, hits=-2.5 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00,TW_BD,TW_DN,TW_FS,TW_GL,TW_VC,TW_VM,TW_VP,TW_XS,TW_XV,TW_XX
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 09 Aug 2011 17:34:08 +0000
From: "meissner at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/50031] New: Sphinx3 has a 10% regression going from GCC 4.5 to GCC 4.6 on powerpc
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: meissner at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-50031-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Date: Tue, 09 Aug 2011 17:34:00 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2011-08/txt/msg00976.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031

           Summary: Sphinx3 has a 10% regression going from GCC 4.5 to GCC
                    4.6 on powerpc
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: meissner@gcc.gnu.org
              Host: powerpc64-linux power-linux
            Target: powerpc64-linux
             Build: powerpc64-linux


Created attachment 24964
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24964
Cut down example of the function with the regression.

The sphinx3 benchmark from Spec 2006, exhibits a 10% regression when comparing
a GCC 4.5 build targeting power7 with vectorization to GCC 4.6.  The GCC 4.7
trunk shows the same slowdown.

In doing various profiling runs, I have traced the slowdown to the
vector_gautbl_eval_logs3 function in vector.c.

The main part of the function is the inner loop:

{
  int32 i, r;
  float64 f;
  int32 end, veclen;
  float32 *m1, *m2, *v1, *v2;
  float64 dval1, dval2, diff1, diff2;

  f = log_to_logs3_factor();

  end = offset + count;
  veclen = gautbl->veclen;

  for (r = offset; r < end-1; r += 2) {
    m1 = gautbl->mean[r];
    m2 = gautbl->mean[r+1];
    v1 = gautbl->var[r];
    v2 = gautbl->var[r+1];
    dval1 = gautbl->lrd[r];
    dval2 = gautbl->lrd[r+1];

    /* start of the critical loop */
    for (i = 0; i < veclen; i++) {
      diff1 = x[i] - m1[i];
      dval1 -= diff1 * diff1 * v1[i];
      diff2 = x[i] - m2[i];
      dval2 -= diff2 * diff2 * v2[i];
    }
    /* end of the critical loop */

    if (dval1 < gautbl->distfloor)
      dval1 = gautbl->distfloor;
    if (dval2 < gautbl->distfloor)
      dval2 = gautbl->distfloor;

    score[r] = (int32)(f * dval1);
    score[r+1] = (int32)(f * dval2);
  }

  if (r < end) {
    m1 = gautbl->mean[r];
    v1 = gautbl->var[r];
    dval1 = gautbl->lrd[r];

    for (i = 0; i < veclen; i++) {
      diff1 = x[i] - m1[i];
      dval1 -= diff1 * diff1 * v1[i];
    }

    if (dval1 < gautbl->distfloor)
      dval1 = gautbl->distfloor;

    score[r] = (int32)(f * dval1);
  }
}

Now, notice that the loop is done by reading single precision floating point,
doing a subtraction, and then converting the result to double precision.  The
code tries to 'help' the compiler by unrolling the loop twice.

The code produced by GCC 4.5 with -O3 -ffast-math -mcpu=power7 is fairly
straight forward scalar evaluation of the inner loop:

.L4:
        lfsx 0,28,9
        lfsx 10,10,9
        lfsx 13,11,9
        lfsx 9,8,9
        fsubs 13,0,13
        fsubs 0,0,10
        lfsx 10,6,9
        addi 9,9,4
        cmpd 7,9,0
        xsmuldp 13,13,13
        xsmuldp 0,0,0
        xsnmsubadp 11,13,9
        xsnmsubadp 12,0,10
        bne 7,.L4

Now, with GCC 4.6, the compiler figures it can vectorize the loop:

.L5:
        add 25,24,27
        add 26,23,27
        rldicr 20,27,0,59
        rldicr 25,25,0,59
        rldicr 26,26,0,59
        lxvw4x 44,0,25
        lxvw4x 45,0,20
        add 25,22,27
        lxvw4x 42,0,26
        add 26,21,27
        rldicr 25,25,0,59
        rldicr 26,26,0,59
        addi 27,27,16
        vperm 11,11,13,2
        vperm 9,9,12,19
        vperm 7,7,10,17
        xvsubsp 9,43,41
        lxvw4x 41,0,26
        xvsubsp 10,43,39
        lxvw4x 43,0,25
        vperm 0,0,9,16
        vperm 1,1,11,18
        xxmrglw 36,32,32
        xxmrglw 37,9,9
        xxmrglw 38,10,10
        xxmrghw 9,9,9
        xvcvspdp 37,37
        xxmrglw 35,33,33
        xvcvspdp 38,38
        xxmrghw 10,10,10
        xvcvspdp 39,9
        xxmrghw 32,32,32
        xvcvspdp 35,35
        xxmrghw 33,33,33
        xvcvspdp 40,10
        xvcvspdp 36,36
        xvmuldp 37,37,37
        xvmuldp 38,38,38
        xvmuldp 9,39,39
        xvcvspdp 33,33
        xvcvspdp 0,32
        xvmuldp 10,40,40
        xvmuldp 6,37,35
        xvmuldp 7,38,36
        xxlor 32,41,41
        xxlor 39,42,42
        xxlor 41,44,44
        xvmaddadp 6,9,33
        xvmaddadp 7,10,0
        xxlor 33,43,43
        xxlor 43,45,45
        xvsubdp 4,4,6
        xvsubdp 5,5,7
        bdnz .L5

This loop demonstrates several problems, some are powerpc backend only and some
tree optimizers (that would need hooks from the backend):

1) When tree-vec-data-refs doesn't know the alignment of memory in a loop that
is vectorized, and the machine has a vec_realign_load_<type> pattern, the loop
that is generated always uses the unalgined load, even though it might be slow.
 On the power7, the realign code uses a vector load and the lvsr instruction to
create a permute mask, and then in the inner loop, after each load, use the
permute mask to do the unaligned loads.  Thus, in the loop, before doing the
conversions, we will be doing 4 vector loads, and 4 permutes.  The vector
conversion from 32-bit to 64-bit, involve two more permutes to split the V4SF
values into the appropriate registers before doing the float->double convert. 
Thus in the loop we will have 4 permutes for the 4 loads that are done, and 8
permutes for the conversions.  The power7 only has one permute functional unit,
and multiple permutes can slow things down.  The code has one segment with 3
back to back permutes, and another with 6 back to back permutes.

If vectorizer could clone the loop and on one side test to see if the pointers
are aligned, and if the pointers are aligned, do the aligned loads, and on the
other side, do unalgined loads it would help.  I experimented with an option to
disable the vec_realign_load_<type> pattern, and it helped this particular
benchmark, but hurt other benchmarks, because the code would do the vectorized
loop only if the pointers are aligned, and fell back to scalar loop if they
were unaligned.  I would think falling back to use vec_realign_<xxx> would be a
win.

2) In looking at the documentation, I discovered that vec_realign_<xxx> is not
documented in md.texi.

3) The powerpc backend doesn't realize it could use the Altivec memory
instruction to load memory (since the Altivec load implicitly ignores the
bottom bits of the address).

4) The code in tree-vec-stmts.c, tree-vec-slp.c, and tree-vec-loop.c that calls
the vectorization cost target hook, never pass in the actual type to the
argument vectype, or set the misalign argument to non-zero.  I would imagine
that vector systems might have different costs, depending on the type.  Maybe
the two arguments should be eliminated if we aren't going to pass useful
information.  In addition, there doesn't seem to be a cost of doing
vec_realign.  There is cost for unaligned loads (via movmisalign), but there
doesn't seem to be a cost for realignment.

I have patches that fix the immediate problem by disabling the float/int to
double vector conversion under switch control, which I will submit shortly.  In
general it restores the performance for sphinx3 on both GCC 4.6 and 4.7, and
improves the performance of a few other benchmarks, though in 4.6 it does have
one regression (tonto is a few percent slower on 32-bit).  However, I suspect
it really needs to be attacked at a higher level.