From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 027A23882053; Thu, 13 Jun 2024 15:27:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 027A23882053 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1718292428; bh=o/OzVwIxXVXJqY/bWZArXH3JFKS/1nOBXTy7PWFYFqE=; h=From:To:Subject:Date:In-Reply-To:References:From; b=tv4i+waSNuINcJ8GZOBy0nTDSQUjFTHyqoF0zG6GgkgCuYvJbnU1Nz46DeZ+//vmE 577pyCjkWWZirCB4XnbW1mY4WW6hIIFdl13Z3KljqAtTh+9j7gsgzyrpTCl9rQwWMm pY8uzaJmbEyWrJjsL37UpYLpDxroJD6OcMk1VidU= From: "carll at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE Date: Thu, 13 Jun 2024 15:27:07 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: wrong-code X-Bugzilla-Severity: normal X-Bugzilla-Who: carll at gcc dot gnu.org X-Bugzilla-Status: WAITING X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115466 --- Comment #4 from Carl Love --- comment 1 Yes, I can confirm if I add the alignment statement to the declarations it works fine.=20=20 I originally tried to use the built-in as part of a re-write of a function = in the Milvus AI source code. The function fvec_L2sqr_batch_4_ref() accounts = for about 90% of the workload run time. The code is: fvec_L2sqr_batch_4_ref_v1(const float* x, const float* y0, const float* y1,= =20 const float* y2, const float* y3, const size_t d, float& dis0, float& di= s1, float& dis2, float& dis3) { float d0 =3D 0; float d1 =3D 0; float d2 =3D 0; float d3 =3D 0; for (size_t i =3D 0; i < d; ++i) { const float q0 =3D x[i] - y0[i]; const float q1 =3D x[i] - y1[i]; const float q2 =3D x[i] - y2[i]; const float q3 =3D x[i] - y3[i]; d0 +=3D q0 * q0; d1 +=3D q1 * q1; d2 +=3D q2 * q2; d3 +=3D q3 * q3; } dis0 =3D d0; dis1 =3D d1; dis2 =3D d2; dis3 =3D d3; } When compiled with -O3, it does generate vsx instructions. But I noticed t= hat it was not using the vsx multiply add instructions. So, I tried rewriting = it to explicitly load a vector with the vec_ld built-in, followed by vec_sub a= nd vec_madd. Which by the way gives a 45% reduction in the execution time for= my standalone test.=20=20 At this point, not sure where the arguments get defined so adding the align= ment to the declaration is not so easy.=20 That said, Peter mentioned the vec_xl built-in which does seem to work. The vec_xl does not require the data to be aligned from what I see in the PVIPR. comment 3, Segher I was looking at the PVIPR document when I chose the built-in for my rewrit= e.=20 Looking at the documentation, it does say "Load a 16-byte vector from the memory address specified by the displacement and the pointer, ignoring the low-order bits of the calculated address." In retrospect, I should have pi= cked up on the ignoring of the low-order bits to imply the addresses needed to = be aligned. It would really be good if the documentation explicitly said the = data must be 16-bye aligned. That said, my bad for not reading/understanding the documentation well enough. The vec_xl documentation does not say anything about ignoring the lower bit= s of the address. So, in my case that is a better load built-in to use so I don= 't have to try and find all the declarations for the arrays that could be pass= ed to the function. It would be great to update the PVIPR to be more explicit about the alignme= nt needs. Sorry everyone for the noise. I think we can close the issue as "User erro= r".=