From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 027A23882053; Thu, 13 Jun 2024 15:27:07 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 027A23882053
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1718292428;
	bh=o/OzVwIxXVXJqY/bWZArXH3JFKS/1nOBXTy7PWFYFqE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=tv4i+waSNuINcJ8GZOBy0nTDSQUjFTHyqoF0zG6GgkgCuYvJbnU1Nz46DeZ+//vmE
	 577pyCjkWWZirCB4XnbW1mY4WW6hIIFdl13Z3KljqAtTh+9j7gsgzyrpTCl9rQwWMm
	 pY8uzaJmbEyWrJjsL37UpYLpDxroJD6OcMk1VidU=
From: "carll at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
Date: Thu, 13 Jun 2024 15:27:07 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: carll at gcc dot gnu.org
X-Bugzilla-Status: WAITING
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-115466-4-mdJ9w3ID6s@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-115466-4@http.gcc.gnu.org/bugzilla/>
References: <bug-115466-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115466
--- Comment #4 from Carl Love <carll at gcc dot gnu.org> ---
comment 1

Yes, I can confirm if I add the alignment statement to the declarations it
works fine.=20=20

I originally tried to use the built-in as part of a re-write of a function =
in
the Milvus AI source code.  The function fvec_L2sqr_batch_4_ref() accounts =
for
about 90% of the workload run time.  The code is:

fvec_L2sqr_batch_4_ref_v1(const float* x, const float* y0, const float* y1,=
=20
   const float* y2, const float* y3, const size_t d, float& dis0, float& di=
s1,
   float& dis2, float& dis3) {

    float d0 =3D 0;
    float d1 =3D 0;
    float d2 =3D 0;
    float d3 =3D 0;
    for (size_t i =3D 0; i < d; ++i) {
        const float q0 =3D x[i] - y0[i];
        const float q1 =3D x[i] - y1[i];
        const float q2 =3D x[i] - y2[i];
        const float q3 =3D x[i] - y3[i];

        d0 +=3D q0 * q0;
        d1 +=3D q1 * q1;
        d2 +=3D q2 * q2;
        d3 +=3D q3 * q3;
    }
    dis0 =3D d0;
    dis1 =3D d1;
    dis2 =3D d2;
    dis3 =3D d3;
}

When compiled with -O3, it does generate vsx instructions.  But I noticed t=
hat
it was not using the vsx multiply add instructions.  So, I tried rewriting =
it
to explicitly load a vector with the vec_ld built-in, followed by vec_sub a=
nd
vec_madd.  Which by the way gives a 45% reduction in the execution time for=
 my
standalone test.=20=20

At this point, not sure where the arguments get defined so adding the align=
ment
to the declaration is not so easy.=20

That said, Peter mentioned the vec_xl built-in which does seem to work.  The
vec_xl does not require the data to be aligned from what I see in the PVIPR.


comment 3, Segher

I was looking at the PVIPR document when I chose the built-in for my rewrit=
e.=20
Looking at the documentation, it does say "Load a 16-byte vector from the
memory address specified by the displacement and the pointer, ignoring the
low-order bits of the calculated address."  In retrospect, I should have pi=
cked
up on the ignoring of the low-order bits to imply the addresses needed  to =
be
aligned.  It would really be good if the documentation explicitly said the =
data
must be 16-bye aligned.  That said, my bad for not reading/understanding the
documentation well enough.

The vec_xl documentation does not say anything about ignoring the lower bit=
s of
the address.  So, in my case that is a better load built-in to use so I don=
't
have to try and find all the declarations for the arrays that could be pass=
ed
to the function.

It would be great to update the PVIPR to be more explicit about the alignme=
nt
needs.


Sorry everyone for the noise.  I think we can close the issue as "User erro=
r".=