From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 9752 invoked by alias); 8 Apr 2012 23:22:38 -0000 Received: (qmail 9739 invoked by uid 22791); 8 Apr 2012 23:22:35 -0000 X-SWARE-Spam-Status: No, hits=-3.2 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_DD,TW_DQ,TW_PX,TW_SD,TW_SR,TW_TR,TW_VD,TW_VP X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sun, 08 Apr 2012 23:22:22 +0000 From: "matz at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/52910] New: xop-mul-1:f13 miscompiled on bulldozer (-mxop) Date: Sun, 08 Apr 2012 23:22:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: matz at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2012-04/txt/msg00528.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52910 Bug #: 52910 Summary: xop-mul-1:f13 miscompiled on bulldozer (-mxop) Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned@gcc.gnu.org ReportedBy: matz@gcc.gnu.org Once PR52908 (miscompile of f9 in the same testcase) would be fixed by the method of deactivating the sse4.1 pattern with TARGET_XOP this one will be exposed; f13 is miscompiled too. The function is a straight mul-add-reduce: __attribute__((noinline, noclone)) long long f13 (void) { int i; long long r = 0; for (i = 0; i < 512; ++i) r += (long long) c2[i] * (long long) c3[i]; return r; } This is compiled into (with -O3 -mxop): f13: movl $0, %eax vpxor %xmm2, %xmm2, %xmm2 .L41: vmovdqa c2(%rax), %xmm1 vmovdqa c3(%rax), %xmm0 vpmacsdqh %xmm2, %xmm0, %xmm1, %xmm2 vpsrldq $4, %xmm1, %xmm1 vpsrldq $4, %xmm0, %xmm0 vpmacsdqh %xmm2, %xmm0, %xmm1, %xmm2 addq $16, %rax cmpq $2048, %rax jne .L41 vpsrldq $8, %xmm2, %xmm0 vpaddq %xmm2, %xmm0, %xmm2 vpextrq $0, %xmm2, %rax ret I think the problem is confusion between how the vector components are numbered and a mismatch between the sse/avx patterns and the xop patterns. What the above loop clearly tries to do when looking at the patterns is to mult-acc components X and X+2 in the first vpmacsdqh and then components X+1 and X+3 in the second one. For that it uses a right bit shift by 32 bits (the input component size). For reference here the rough patterns in the last RTL dump for the two mult-accs and the shifts: (insn 13 12 14 3 (set (reg:V2DI 23 xmm2 [75]) (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22 xm m1 [73]) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ]))) (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [74]) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ])))) (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65]))) 1794 {xop_pmacsdq h} (nil)) (insn 14 13 15 3 (set (reg:V1TI 22 xmm1 [76]) (lshiftrt:V1TI (reg:V1TI 22 xmm1 [73]) (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3} (nil)) (insn 15 14 16 3 (set (reg:V1TI 21 xmm0 [77]) (lshiftrt:V1TI (reg:V1TI 21 xmm0 [74]) (const_int 32 [0x20]))) 1511 {sse2_lshrv1ti3} (nil)) (insn 17 16 18 3 (set (reg:V2DI 23 xmm2 [orig:65 vect_var_.370 ] [65]) (plus:V2DI (mult:V2DI (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 22 xm m1 [76]) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ]))) (sign_extend:V2DI (vec_select:V2SI (reg:V4SI 21 xmm0 [77]) (parallel [ (const_int 0 [0]) (const_int 2 [0x2]) ])))) (reg:V2DI 23 xmm2 [75]))) 1794 {xop_pmacsdqh} (expr_list:REG_DEAD (reg:V4SI 22 xmm1 [76]) (expr_list:REG_DEAD (reg:V4SI 21 xmm0 [77]) (nil)))) So, for this to do the right thing, the mult-acc patterns idea of component 0 and 2 must be so that components 1 and 3 are transformed into 0 and 2 by a right shift of 32 bits. Tracing the whole thing in gdb reveals some miscommunication in that idea. The two input values at start of loop: % $xmm0.v4_int32 {1215505350, 1491885311, -676627251, -515498245} % $xmm1.v4_int32 {-1737931417, -1807033263, 1488592681, -1444724238} % $xmm2.v2_int64 {0, 0} The inputs from the arrays: % c2[0]@4 {-1737931417, -1807033263, 1488592681, -1444724238} % c3[0]@4 {1215505350, 1491885311, -676627251, -515498245} Now after the first mult-acc: % $xmm2.v2_int64 {-2695886381558099793, 744752809197962310} That is, it multiplied components 1 and 3, not 0 and 2 (that is, c2[1]*c3[1] and c2[3]*c3[3]). Now the two right shifts come and we have: % $xmm0.v4_int32 {1491885311, -676627251, -515498245, 0} % $xmm1.v4_int32 {-1807033263, 1488592681, -1444724238, 0} Doing again the mult-acc (using component 1 and 3) now leads to wrong results. The original component 0 is shifted out already, and component 3 is the zero that was shifted in. >>From the description of the xop instruction (that talk about second and fourth component for vpmacsdqh) it seems that the vpmacsdqh pattern should refer to components 1 and 3 (and hence not be used in this case when the shift stay). Actually with xop the shifts aren't necessary and both {0,2} and {1,3} mult-accs could be done with vpmacsdql and vpmacsdqh.