[Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE
@ 2024-06-12 21:13 carll at gcc dot gnu.org
  2024-06-12 21:25 ` [Bug target/115466] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: carll at gcc dot gnu.org @ 2024-06-12 21:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

            Bug ID: 115466
           Summary: rs6000 vec_ld built-in works on BE but not LE
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: carll at gcc dot gnu.org
  Target Milestone: ---

Created attachment 58416
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58416&action=edit
vec_ld test program

The attached test program for the IBM rs6000 platform tests the vec_ld built-in
for vector ints and vector floats.  The output is correct on a BE system and
incorrect on an LE system.

gcc vec_ld_test.c -o vec_ld_test

./vec_ld_test

BE output of test program:
 ./vec_ld_test
ia[0-3] = 1, 2, 3, 4
via (should match ia[0-3]) = 1, 2, 3, 4
ia[4-7] = 5, 6, 7, 8
via (should match ia[4-7]) = 5, 6, 7, 8

fa = 10.000000, 20.000000, 30.000000, 40.000000
vfa (should match fa[0-3]) = 10.000000, 20.000000, 30.000000, 40.000000
vfa (should match fa[4-7]) = 50.000000, 60.000000, 70.000000, 80.000000

vec_ld (0, ia) = 1, 2, 3, 4
vec_ld (1, ia) = 1, 2, 3, 4
vec_ld (2, ia) = 1, 2, 3, 4
vec_ld (3, ia) = 1, 2, 3, 4
vec_ld (4, ia) = 1, 2, 3, 4
vec_ld (5, ia) = 1, 2, 3, 4
vec_ld (6, ia) = 1, 2, 3, 4
vec_ld (7, ia) = 1, 2, 3, 4
vec_ld (8, ia) = 1, 2, 3, 4
vec_ld (9, ia) = 1, 2, 3, 4
vec_ld (10, ia) = 1, 2, 3, 4
vec_ld (11, ia) = 1, 2, 3, 4
vec_ld (12, ia) = 1, 2, 3, 4
vec_ld (13, ia) = 1, 2, 3, 4
vec_ld (14, ia) = 1, 2, 3, 4
vec_ld (15, ia) = 1, 2, 3, 4
vec_ld (16, ia) = 5, 6, 7, 8
vec_ld (17, ia) = 5, 6, 7, 8
vec_ld (18, ia) = 5, 6, 7, 8
vec_ld (19, ia) = 5, 6, 7, 8
vec_ld (20, ia) = 5, 6, 7, 8
vec_ld (21, ia) = 5, 6, 7, 8
vec_ld (22, ia) = 5, 6, 7, 8
vec_ld (23, ia) = 5, 6, 7, 8
vec_ld (24, ia) = 5, 6, 7, 8
vec_ld (25, ia) = 5, 6, 7, 8
vec_ld (26, ia) = 5, 6, 7, 8
vec_ld (27, ia) = 5, 6, 7, 8
vec_ld (28, ia) = 5, 6, 7, 8
vec_ld (29, ia) = 5, 6, 7, 8
vec_ld (30, ia) = 5, 6, 7, 8
vec_ld (31, ia) = 5, 6, 7, 8

The same test program compliled and run on an LE machine:

gcc vec_ld_test.c -o vec_ld_test

./vec_ld_test

ia[0-3] = 1, 2, 3, 4
via (should match ia[0-3]) = -552597992, 32767, 1, 2
ia[4-7] = 5, 6, 7, 8
via (should match ia[4-7]) = 3, 4, 5, 6

fa = 10.000000, 20.000000, 30.000000, 40.000000
vfa (should match fa[0-3]) = 0.000000, 0.000000, 10.000000, 20.000000
vfa (should match fa[4-7]) = 30.000000, 40.000000, 50.000000, 60.000000

vec_ld (0, ia) = -552597992, 32767, 1, 2
vec_ld (1, ia) = -552597992, 32767, 1, 2
vec_ld (2, ia) = -552597992, 32767, 1, 2
vec_ld (3, ia) = -552597992, 32767, 1, 2
vec_ld (4, ia) = -552597992, 32767, 1, 2
vec_ld (5, ia) = -552597992, 32767, 1, 2
vec_ld (6, ia) = -552597992, 32767, 1, 2
vec_ld (7, ia) = -552597992, 32767, 1, 2
vec_ld (8, ia) = 3, 4, 5, 6
vec_ld (9, ia) = 3, 4, 5, 6
vec_ld (10, ia) = 3, 4, 5, 6
vec_ld (11, ia) = 3, 4, 5, 6
vec_ld (12, ia) = 3, 4, 5, 6
vec_ld (13, ia) = 3, 4, 5, 6
vec_ld (14, ia) = 3, 4, 5, 6
vec_ld (15, ia) = 3, 4, 5, 6
vec_ld (16, ia) = 3, 4, 5, 6
vec_ld (17, ia) = 3, 4, 5, 6
vec_ld (18, ia) = 3, 4, 5, 6
vec_ld (19, ia) = 3, 4, 5, 6
vec_ld (20, ia) = 3, 4, 5, 6
vec_ld (21, ia) = 3, 4, 5, 6
vec_ld (22, ia) = 3, 4, 5, 6
vec_ld (23, ia) = 3, 4, 5, 6
vec_ld (24, ia) = 7, 8, 1092616192, 1101004800
vec_ld (25, ia) = 7, 8, 1092616192, 1101004800
vec_ld (26, ia) = 7, 8, 1092616192, 1101004800
vec_ld (27, ia) = 7, 8, 1092616192, 1101004800
vec_ld (28, ia) = 7, 8, 1092616192, 1101004800
vec_ld (29, ia) = 7, 8, 1092616192, 1101004800
vec_ld (30, ia) = 7, 8, 1092616192, 1101004800
vec_ld (31, ia) = 7, 8, 1092616192, 1101004800

FYI, I also wrote a test to check the vec_st built-in.  That seems to work fine
on both LE and BE.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
@ 2024-06-12 21:25 ` pinskia at gcc dot gnu.org
  2024-06-13  8:10 ` linkw at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-06-12 21:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-06-12
             Status|UNCONFIRMED                 |WAITING

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>  int ia[8] = {1, 2, 3, 4, 5, 6, 7, 8};
>  float fa[8] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0};

The way vec_ld works is is (a+b)&~0xf is the address which is being loaded. 

Both ia and fa are not specified as being aligned to the 16byte boundary so it
could be loading from before hand.

What happens if you do:
  int ia[8] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8};
  float fa[8] __attribute__((aligned(16))) = {10.0, 20.0, 30.0, 40.0, 50.0,
60.0, 70.0, 80.0};

Instead?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
  2024-06-12 21:25 ` [Bug target/115466] " pinskia at gcc dot gnu.org
@ 2024-06-13  8:10 ` linkw at gcc dot gnu.org
  2024-06-13 13:56 ` segher at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: linkw at gcc dot gnu.org @ 2024-06-13  8:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |linkw at gcc dot gnu.org

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> >  int ia[8] = {1, 2, 3, 4, 5, 6, 7, 8};
> >  float fa[8] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0};
> 
> The way vec_ld works is is (a+b)&~0xf is the address which is being loaded. 
> 
> Both ia and fa are not specified as being aligned to the 16byte boundary so
> it could be loading from before hand.
> 
> What happens if you do:
>   int ia[8] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8};
>   float fa[8] __attribute__((aligned(16))) = {10.0, 20.0, 30.0, 40.0, 50.0,
> 60.0, 70.0, 80.0};
> 
> Instead?

Good point, I can't reproduce the reported issue on two LE machines, so I'd
leave Carl to confirm. But from the output (biasing two elements from the
expected), I believe this is the root cause.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
  2024-06-12 21:25 ` [Bug target/115466] " pinskia at gcc dot gnu.org
  2024-06-13  8:10 ` linkw at gcc dot gnu.org
@ 2024-06-13 13:56 ` segher at gcc dot gnu.org
  2024-06-13 15:27 ` carll at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2024-06-13 13:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |segher at gcc dot gnu.org

--- Comment #3 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Since people still make this mistake, perhaps we need to make people more aware
of this fact in the documentation?  "The memory argument to vec_ld and vec_st
has to be 16-byte aligned", something short like that.  If people see it often
enough it will finally click.

Carl, can you add something maybe?  In some place you would have seen it :-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-06-13 13:56 ` segher at gcc dot gnu.org
@ 2024-06-13 15:27 ` carll at gcc dot gnu.org
  2024-06-13 18:36 ` segher at gcc dot gnu.org
  2024-06-13 18:44 ` carll at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: carll at gcc dot gnu.org @ 2024-06-13 15:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #4 from Carl Love <carll at gcc dot gnu.org> ---
comment 1

Yes, I can confirm if I add the alignment statement to the declarations it
works fine.  

I originally tried to use the built-in as part of a re-write of a function in
the Milvus AI source code.  The function fvec_L2sqr_batch_4_ref() accounts for
about 90% of the workload run time.  The code is:

fvec_L2sqr_batch_4_ref_v1(const float* x, const float* y0, const float* y1, 
   const float* y2, const float* y3, const size_t d, float& dis0, float& dis1,
   float& dis2, float& dis3) {

    float d0 = 0;
    float d1 = 0;
    float d2 = 0;
    float d3 = 0;
    for (size_t i = 0; i < d; ++i) {
        const float q0 = x[i] - y0[i];
        const float q1 = x[i] - y1[i];
        const float q2 = x[i] - y2[i];
        const float q3 = x[i] - y3[i];

        d0 += q0 * q0;
        d1 += q1 * q1;
        d2 += q2 * q2;
        d3 += q3 * q3;
    }
    dis0 = d0;
    dis1 = d1;
    dis2 = d2;
    dis3 = d3;
}

When compiled with -O3, it does generate vsx instructions.  But I noticed that
it was not using the vsx multiply add instructions.  So, I tried rewriting it
to explicitly load a vector with the vec_ld built-in, followed by vec_sub and
vec_madd.  Which by the way gives a 45% reduction in the execution time for my
standalone test.  

At this point, not sure where the arguments get defined so adding the alignment
to the declaration is not so easy. 

That said, Peter mentioned the vec_xl built-in which does seem to work.  The
vec_xl does not require the data to be aligned from what I see in the PVIPR.

comment 3, Segher

I was looking at the PVIPR document when I chose the built-in for my rewrite. 
Looking at the documentation, it does say "Load a 16-byte vector from the
memory address specified by the displacement and the pointer, ignoring the
low-order bits of the calculated address."  In retrospect, I should have picked
up on the ignoring of the low-order bits to imply the addresses needed  to be
aligned.  It would really be good if the documentation explicitly said the data
must be 16-bye aligned.  That said, my bad for not reading/understanding the
documentation well enough.

The vec_xl documentation does not say anything about ignoring the lower bits of
the address.  So, in my case that is a better load built-in to use so I don't
have to try and find all the declarations for the arrays that could be passed
to the function.

It would be great to update the PVIPR to be more explicit about the alignment
needs.

Sorry everyone for the noise.  I think we can close the issue as "User error".

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-06-13 15:27 ` carll at gcc dot gnu.org
@ 2024-06-13 18:36 ` segher at gcc dot gnu.org
  2024-06-13 18:44 ` carll at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: segher at gcc dot gnu.org @ 2024-06-13 18:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #5 from Segher Boessenkool <segher at gcc dot gnu.org> ---
The GCC documentation says

> Note that the 'vec_ld' and 'vec_st' built-in functions always generate
> the AltiVec 'LVX' and 'STVX' instructions even if the VSX instruction
> set is available. 

(which means it explicitly does the align-down thing, the &-16 thing),
which doesn't happen with VSX insns.

This comment is a great place to add an "including address masking" thing, if I
can tempt you :-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/115466] rs6000 vec_ld built-in works on BE but not LE
  2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-06-13 18:36 ` segher at gcc dot gnu.org
@ 2024-06-13 18:44 ` carll at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: carll at gcc dot gnu.org @ 2024-06-13 18:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #6 from Carl Love <carll at gcc dot gnu.org> ---
comment 5, Segher

Yes, I have some PVIPR changes that we have talked about previously.  I will
create a patch to also make the changes discussed here.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-13 18:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-12 21:13 [Bug target/115466] New: rs6000 vec_ld built-in works on BE but not LE carll at gcc dot gnu.org
2024-06-12 21:25 ` [Bug target/115466] " pinskia at gcc dot gnu.org
2024-06-13  8:10 ` linkw at gcc dot gnu.org
2024-06-13 13:56 ` segher at gcc dot gnu.org
2024-06-13 15:27 ` carll at gcc dot gnu.org
2024-06-13 18:36 ` segher at gcc dot gnu.org
2024-06-13 18:44 ` carll at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).