From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-388246-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 9752 invoked by alias); 8 Apr 2012 23:22:38 -0000
Received: (qmail 9739 invoked by uid 22791); 8 Apr 2012 23:22:35 -0000
X-SWARE-Spam-Status: No, hits=-3.2 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00,TW_DD,TW_DQ,TW_PX,TW_SD,TW_SR,TW_TR,TW_VD,TW_VP
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sun, 08 Apr 2012 23:22:22 +0000
From: "matz at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
Date: Sun, 08 Apr 2012 23:22:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: matz at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-52910-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2012-04/txt/msg00528.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52910

             Bug #: 52910
           Summary: xop-mul-1:f13 miscompiled on bulldozer (-mxop)
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: matz@gcc.gnu.org


Once PR52908 (miscompile of f9 in the same testcase) would be fixed by
the method of deactivating the sse4.1 pattern with TARGET_XOP this one
will be exposed; f13 is miscompiled too.  The function is a straight
mul-add-reduce:

__attribute__((noinline, noclone)) long long
f13 (void)
{
  int i;
  long long r = 0;
  for (i = 0; i < 512; ++i)
    r += (long long) c2[i] * (long long) c3[i];
  return r;
}

This is compiled into (with -O3 -mxop):

f13:
        movl    $0, %eax
        vpxor   %xmm2, %xmm2, %xmm2
.L41:
        vmovdqa c2(%rax), %xmm1
        vmovdqa c3(%rax), %xmm0
        vpmacsdqh       %xmm2, %xmm0, %xmm1, %xmm2
        vpsrldq $4, %xmm1, %xmm1
        vpsrldq $4, %xmm0, %xmm0
        vpmacsdqh       %xmm2, %xmm0, %xmm1, %xmm2
        addq    $16, %rax
        cmpq    $2048, %rax
        jne     .L41
        vpsrldq $8, %xmm2, %xmm0
        vpaddq  %xmm2, %xmm0, %xmm2
        vpextrq $0, %xmm2, %rax
        ret

I think the problem is confusion between how the vector components are numbered
and a mismatch between the sse/avx patterns and the xop patterns.  What
the above loop clearly tries to do when looking at the patterns is to
mult-acc components X and X+2 in the first vpmacsdqh and then components
X+1 and X+3 in the second one.  For that it uses a right bit shift by 32 bits
(the input component size).  For reference here the rough patterns in the
last RTL dump for the two mult-accs and the shifts:

(insn 13 12 14 3 (set (reg:V2DI 23 xmm2 [75])
        (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [73])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ])))
                (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [74])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ]))))
            (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65]))) 1794
{xop_pmacsdq
h}
     (nil))

(insn 14 13 15 3 (set (reg:V1TI 22 xmm1 [76])
        (lshiftrt:V1TI (reg:V1TI 22 xmm1 [73])
            (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
     (nil))

(insn 15 14 16 3 (set (reg:V1TI 21 xmm0 [77])
        (lshiftrt:V1TI (reg:V1TI 21 xmm0 [74])
            (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3}
     (nil))

(insn 17 16 18 3 (set (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65])
        (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22
xm
m1 [76])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ])))
                (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [77])
                        (parallel [
                                (const_int 0 [0])
                                (const_int 2 [0x2])
                            ]))))
            (reg:V2DI 23 xmm2 [75]))) 1794 {xop_pmacsdqh}
     (expr_list:REG_DEAD (reg:V4SI 22 xmm1 [76])
        (expr_list:REG_DEAD (reg:V4SI 21 xmm0 [77])
            (nil))))

So, for this to do the right thing, the mult-acc patterns idea of component
0 and 2 must be so that components 1 and 3 are transformed into 0 and 2 by
a right shift of 32 bits.

Tracing the whole thing in gdb reveals some miscommunication in that idea.
The two input values at start of loop:

% $xmm0.v4_int32 {1215505350, 1491885311, -676627251, -515498245}
% $xmm1.v4_int32 {-1737931417, -1807033263, 1488592681, -1444724238}
% $xmm2.v2_int64 {0, 0}

The inputs from the arrays:

% c2[0]@4 {-1737931417, -1807033263, 1488592681, -1444724238}
% c3[0]@4 {1215505350, 1491885311, -676627251, -515498245}

Now after the first mult-acc:

% $xmm2.v2_int64 {-2695886381558099793, 744752809197962310}

That is, it multiplied components 1 and 3, not 0 and 2 (that is,
c2[1]*c3[1] and c2[3]*c3[3]).  Now the two right shifts come and we have:

% $xmm0.v4_int32 {1491885311, -676627251, -515498245, 0}
% $xmm1.v4_int32 {-1807033263, 1488592681, -1444724238, 0}

Doing again the mult-acc (using component 1 and 3) now leads to wrong results.
The original component 0 is shifted out already, and component 3 is the zero
that was shifted in.

>>From the description of the xop instruction (that talk about second and fourth
component for vpmacsdqh) it seems that the vpmacsdqh pattern should refer
to components 1 and 3 (and hence not be used in this case when the shift stay).

Actually with xop the shifts aren't necessary and both {0,2} and {1,3}
mult-accs could be done with vpmacsdql and vpmacsdqh.