is -O2 breaking sse2 alignment?

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* is -O2 breaking sse2 alignment?
@ 2008-03-12 23:31 JP Fournier
  2008-03-13  0:28 ` Brian Dessent
  0 siblings, 1 reply; 7+ messages in thread
From: JP Fournier @ 2008-03-12 23:31 UTC (permalink / raw)
  To: gcc-help


Hi All.

In the example below, compiling with -O2 results in incorrect output 
from the program.  -O seems OK.  Am I missing something alignment wise 
(or otherwise) or is -O2 breaking my alignment?

If I use  _mm_storeu_si128 then both -O2 and -O work as expected.

Any thoughts appreciated.

jp

------

bash-3.1$ gcc -O -msse2 -o sse2 sse2.c
bash-3.1$ ./sse2
c0=2 c1=2
bash-3.1$ gcc -O2 -msse2 -o sse2 sse2.c
bash-3.1$ ./sse2
c0=0 c1=0
bash-3.1$ gcc --version
gcc (GCC) 4.1.2
bash-3.1$ uname -a
Linux puma 2.6.22.8 #1 SMP Tue Sep 25 20:41:25 BST 2007 x86_64 x86_64 
x86_64 GNU/Linux

sse2.c:

#include <stdio.h>
#include <emmintrin.h>

void test_int() {

        // array of 2 8 byte ints
        long int *a  = _mm_malloc(16, 16);
        long int *b  = _mm_malloc(16, 16);
        long int *c  = _mm_malloc(16, 16);

        __m128i ai __attribute__ ((aligned (16)));
        __m128i bi __attribute__ ((aligned (16)));
        __m128i ci __attribute__ ((aligned (16)));

        a[0] = a[1] = 1;
        b[0] = b[1] = 1;
        c[0] = c[1] = 0;

        ai = _mm_load_si128( (__m128i *) (void*)a );
        bi = _mm_load_si128( (__m128i *) (void*)b );

        ci = _mm_add_epi8( ai, bi );
        _mm_store_si128( (__m128i *) (void*)c, ci );
        printf("c0=%ld c1=%ld\n", c[0], c[1] );
}

int main( int count, char ** args ) {
     test_int();
     return 0;
}






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-12 23:31 is -O2 breaking sse2 alignment? JP Fournier
@ 2008-03-13  0:28 ` Brian Dessent
  2008-03-14  0:13   ` JP Fournier
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Dessent @ 2008-03-13  0:28 UTC (permalink / raw)
  To: JP Fournier; +Cc: gcc-help

JP Fournier wrote:

> In the example below, compiling with -O2 results in incorrect output
> from the program.  -O seems OK.  Am I missing something alignment wise
> (or otherwise) or is -O2 breaking my alignment?

If it was an alignment problem you'd most likely be getting a
segmentation fault.  The __m128i type should already include the proper
alignment so you don't need the __attribute__((aligned (16))) stuff.

>         // array of 2 8 byte ints
>         long int *a  = _mm_malloc(16, 16);
>         long int *b  = _mm_malloc(16, 16);
>         long int *c  = _mm_malloc(16, 16);
> 
>         __m128i ai __attribute__ ((aligned (16)));
>         __m128i bi __attribute__ ((aligned (16)));
>         __m128i ci __attribute__ ((aligned (16)));
> 
>         a[0] = a[1] = 1;
>         b[0] = b[1] = 1;
>         c[0] = c[1] = 0;
> 
>         ai = _mm_load_si128( (__m128i *) (void*)a );
>         bi = _mm_load_si128( (__m128i *) (void*)b );
> 
>         ci = _mm_add_epi8( ai, bi );
>         _mm_store_si128( (__m128i *) (void*)c, ci );
>         printf("c0=%ld c1=%ld\n", c[0], c[1] );
> }

You're violates the C aliasing rules.  You can't store through a casted
pointer like that.  You also don't have to do the load/store, the
compiler know what you want when you use a union instead:

  union { __m128i v; long l[2]; } a, b, c;

   a.l[0] = a.l[1] = 1;
   b.l[0] = b.l[1] = 1;

   c.v = _mm_add_epi8 (a.v, b.v);
   printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]);

There's an even more natural way to do this though using gcc's built-in
vector extensions without any of the Intel mmintrin.h stuff.  This way
will result in code that will vectorize to altivec, sse2, spu, whatever
the machine supports, it's not hardware specific:

  typedef int v4si __attribute__ ((vector_size (16)));

  v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c;

  c = a + b;

You can use all the normal C operators like + and * as if they were
scalars but they will be compiled using the corresponding SIMD
instructions.  See
<http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more.  If
you want access to the individual parts you can again use the union,
e.g.

  union { v4si v; int i[4]; } u;

  u.v = a + b;

  printf ("%d,%d,%d,%d\n", v.i[0], v.i[1], v.i[2], v.i[3]);

Brian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-13  0:28 ` Brian Dessent
@ 2008-03-14  0:13   ` JP Fournier
  2008-03-14 19:31     ` Brian Budge
  2008-03-15  0:10     ` Maximillian Murphy
  0 siblings, 2 replies; 7+ messages in thread
From: JP Fournier @ 2008-03-14  0:13 UTC (permalink / raw)
  To: gcc-help

Brian Dessent wrote:
> 
> You're violates the C aliasing rules.  You can't store through a casted
> pointer like that.  You also don't have to do the load/store, the
> compiler know what you want when you use a union instead:
> 
>   union { __m128i v; long l[2]; } a, b, c;
> 
>    a.l[0] = a.l[1] = 1;
>    b.l[0] = b.l[1] = 1;
> 
>    c.v = _mm_add_epi8 (a.v, b.v);
>    printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]);

Many Thanks Brian.  My little program now behaves better:

bash-3.1$ gcc -O2 -msse2 -o sse2 sse2-1.c
bash-3.1$ ./sse2
c0=2 c1=2

#include <stdio.h>
#include <emmintrin.h>

void test_int() {
        union { __m128i v; long l[2]; } a, b, c;

        a.l[0] = a.l[1] = 1;
        b.l[0] = b.l[1] = 1;
        c.l[0] = c.l[1] = 0;

        c.v = _mm_add_epi8( a.v, b.v );
        printf("c0=%ld c1=%ld\n", c.l[0], c.l[1] );
}
int main( int count, char ** args ) {
     test_int();
     return 0;
}


> 
> There's an even more natural way to do this though using gcc's built-in
> vector extensions without any of the Intel mmintrin.h stuff.  This way
> will result in code that will vectorize to altivec, sse2, spu, whatever
> the machine supports, it's not hardware specific:
> 
>   typedef int v4si __attribute__ ((vector_size (16)));
> 
>   v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c;
> 
>   c = a + b;
> 
> You can use all the normal C operators like + and * as if they were
> scalars but they will be compiled using the corresponding SIMD
> instructions.  See
> <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more.  If
> you want access to the individual parts you can again use the union,

My thinking is that I'd like try to be compiler independent, so by using 
the intel intrinsics I figure I should be able to get gcc and the intel 
compiler to work as a start.

What I am _really_ trying to do is to implement is the addition of 
elements of two arrays.

Is there a more efficient way of doing this than this way?:


#include <stdio.h>
#include <emmintrin.h>

void test_add_long(long * result, long * a, long * b, long size) {
        union { __m128i v; long l[2]; } temp1, temp2, temp3;
        int index=0;

        for( index=0; index < size; index+=2  ) {
            temp1.l[0] = a[index];
            temp1.l[1] = a[index+1];
            temp2.l[0] = b[index];
            temp2.l[1] = b[index + 1];

            temp3.v = _mm_add_epi8( temp1.v, temp2.v );
            result[index] = temp3.l[0];
            result[index+1] = temp3.l[1];

            printf("c0=%ld c1=%ld\n", result[index], result[index+1] );
        }
}

int main( int count, char ** args ) {
     // array of 4 8 byte ints
     long a[]  = { 1, 2, 3, 4};
     long b[]  = { 1, 2, 3, 4};
     long result[]  = {0,0,0,0};

     test_add_long(result, a, b, 4);

     return 0;
}






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-14  0:13   ` JP Fournier
@ 2008-03-14 19:31     ` Brian Budge
  2008-03-15  9:49       ` Andrew Haley
  2008-03-15  0:10     ` Maximillian Murphy
  1 sibling, 1 reply; 7+ messages in thread
From: Brian Budge @ 2008-03-14 19:31 UTC (permalink / raw)
  To: JP Fournier; +Cc: gcc-help

Hi -

I think this might be a little simpler:

void test_add_long1(long * result, long * a, long * b, long size) {
    __m128i *A = (__m128i*)a;
    __m128i *B = (__m128i*)b;
    __m128i *end = A + size/2;
    __m128i *R = (__m128i*)result;

    for(; A < end; ++A, ++B, ++R) {
        *R = _mm_add_epi64(*A, *B);
    }
}

though I believe if you let the the compiler know about the alignment
of result, a, and b, it will properly optimize this:

void test_add_long2(long * result, long * a, long * b, long size) {
    long *end = a + size;
    for(; a < end; ++a, ++b, ++result) {
        *result = *a + *b;
    }
}

  Brian

On Thu, Mar 13, 2008 at 5:12 PM, JP Fournier <jape41@gmail.com> wrote:
> Brian Dessent wrote:
>  >
>  > You're violates the C aliasing rules.  You can't store through a casted
>  > pointer like that.  You also don't have to do the load/store, the
>  > compiler know what you want when you use a union instead:
>  >
>  >   union { __m128i v; long l[2]; } a, b, c;
>  >
>  >    a.l[0] = a.l[1] = 1;
>  >    b.l[0] = b.l[1] = 1;
>  >
>  >    c.v = _mm_add_epi8 (a.v, b.v);
>  >    printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]);
>
>  Many Thanks Brian.  My little program now behaves better:
>
>  bash-3.1$ gcc -O2 -msse2 -o sse2 sse2-1.c
>
> bash-3.1$ ./sse2
>  c0=2 c1=2
>
>
> #include <stdio.h>
>  #include <emmintrin.h>
>
>  void test_int() {
>
>         union { __m128i v; long l[2]; } a, b, c;
>
>         a.l[0] = a.l[1] = 1;
>         b.l[0] = b.l[1] = 1;
>         c.l[0] = c.l[1] = 0;
>
>         c.v = _mm_add_epi8( a.v, b.v );
>         printf("c0=%ld c1=%ld\n", c.l[0], c.l[1] );
>  }
>
> int main( int count, char ** args ) {
>      test_int();
>      return 0;
>  }
>
>
>  >
>
> > There's an even more natural way to do this though using gcc's built-in
>  > vector extensions without any of the Intel mmintrin.h stuff.  This way
>  > will result in code that will vectorize to altivec, sse2, spu, whatever
>  > the machine supports, it's not hardware specific:
>  >
>  >   typedef int v4si __attribute__ ((vector_size (16)));
>  >
>  >   v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c;
>  >
>  >   c = a + b;
>  >
>  > You can use all the normal C operators like + and * as if they were
>  > scalars but they will be compiled using the corresponding SIMD
>  > instructions.  See
>  > <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more.  If
>  > you want access to the individual parts you can again use the union,
>
>  My thinking is that I'd like try to be compiler independent, so by using
>  the intel intrinsics I figure I should be able to get gcc and the intel
>  compiler to work as a start.
>
>  What I am _really_ trying to do is to implement is the addition of
>  elements of two arrays.
>
>  Is there a more efficient way of doing this than this way?:
>
>
>
>  #include <stdio.h>
>  #include <emmintrin.h>
>
>  void test_add_long(long * result, long * a, long * b, long size) {
>         union { __m128i v; long l[2]; } temp1, temp2, temp3;
>         int index=0;
>
>         for( index=0; index < size; index+=2  ) {
>             temp1.l[0] = a[index];
>             temp1.l[1] = a[index+1];
>             temp2.l[0] = b[index];
>             temp2.l[1] = b[index + 1];
>
>             temp3.v = _mm_add_epi8( temp1.v, temp2.v );
>             result[index] = temp3.l[0];
>             result[index+1] = temp3.l[1];
>
>             printf("c0=%ld c1=%ld\n", result[index], result[index+1] );
>
>         }
>  }
>
>  int main( int count, char ** args ) {
>      // array of 4 8 byte ints
>      long a[]  = { 1, 2, 3, 4};
>      long b[]  = { 1, 2, 3, 4};
>      long result[]  = {0,0,0,0};
>
>      test_add_long(result, a, b, 4);
>
>      return 0;
>  }
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-14  0:13   ` JP Fournier
  2008-03-14 19:31     ` Brian Budge
@ 2008-03-15  0:10     ` Maximillian Murphy
  2008-03-15 15:10       ` Maximillian Murphy
  1 sibling, 1 reply; 7+ messages in thread
From: Maximillian Murphy @ 2008-03-15  0:10 UTC (permalink / raw)
  To: gcc-help

> 
> What I am _really_ trying to do is to implement is the addition of 
> elements of two arrays.
> 
> Is there a more efficient way of doing this than this way?:
> 

Question from someone who has just written his first few lines of SSE2 (oh how exciting, but let's not get too excited until we can actually beat the SSE-free standard compile!):  How many SSE2 instructions can be run at the same time?  I would have thought that if there is much optimising to be done it will be in loading up all the registers and doing lots of SSE instructions in parallel.  Presumably the challenge will be organising traffic to and from the registers so that we don't get spikes from loading registers simultaneously.  Rather we'd have to load one pair of registers whilst simultaneously adding together another pair whilst simultaneously writing out the result of a third.  That kind of thing.  Am I on the right track or am I way off the mark?

Regards, Max.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-14 19:31     ` Brian Budge
@ 2008-03-15  9:49       ` Andrew Haley
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Haley @ 2008-03-15  9:49 UTC (permalink / raw)
  To: Brian Budge; +Cc: JP Fournier, gcc-help

Brian Budge wrote:
> Hi -
> 
> I think this might be a little simpler:
> 
> void test_add_long1(long * result, long * a, long * b, long size) {
>     __m128i *A = (__m128i*)a;
>     __m128i *B = (__m128i*)b;
>     __m128i *end = A + size/2;
>     __m128i *R = (__m128i*)result;
> 
>     for(; A < end; ++A, ++B, ++R) {
>         *R = _mm_add_epi64(*A, *B);
>     }
> }

This has the same aliasing problems.

Andrew.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: is -O2 breaking sse2 alignment?
  2008-03-15  0:10     ` Maximillian Murphy
@ 2008-03-15 15:10       ` Maximillian Murphy
  0 siblings, 0 replies; 7+ messages in thread
From: Maximillian Murphy @ 2008-03-15 15:10 UTC (permalink / raw)
  To: gcc-help

> > 
> > What I am _really_ trying to do is to implement is the addition of 
> > elements of two arrays.
> > 
> > Is there a more efficient way of doing this than this way?:
> > 
> 

Dear All,

Regarding the earlier code: the vector add is 8 bit wide, (_mm_add_epi8) even though we are loading the registers with 64 bit longs.  If we only want to add bytes we can do eight times as many at a shot, if we want to add longs the command needs to change to _mm_add_epi64.

I toyed with 64 bit adds and your code.  One of your codes.  I created three arrays of 10000000 longs and added them together, first without SSE, then using just one SSE load, add, unload, then using two load, add, unloads in parallel.  The answers that you ladies and gentlemen have been waiting for are:

Without SSE:
      3  In 1.600000e+05 jiffies
     97  In 1.700000e+05 jiffies
    181  In 1.800000e+05 jiffies
     23  In 1.900000e+05 jiffies
      1  In 2.000000e+05 jiffies
With one SSE load add cycle:
     51  In 2.200000e+05 jiffies
    204  In 2.300000e+05 jiffies
     50  In 2.400000e+05 jiffies
With two SSE load add cycles:
      1  In 2.000000e+05 jiffies
     56  In 2.100000e+05 jiffies
    177  In 2.200000e+05 jiffies
     69  In 2.300000e+05 jiffies
      1  In 2.400000e+05 jiffies

As we have eight registers we could have four add operations going on simultaneously, however it's clearly not going to beat vanilla code that ignores the SSE. (On my machine anyway and with one particular code.  If anyone can do better, please speak up so that we can compare notes!)

As you can tell, the clock is fairly coarse.  Repeating the tests makes up for that though.

Doing masses of 16 bit multiplies I can beat the standard gcc compiled code by a small factor.

I'm curious as to what is limiting the SSE computation.  Is it load time, in which case it's only worth using the SSE if there are several operations to do, or is it the width of the compute engine?  The latter seems unlikely, after all width is what SSE is all about.

Regards, A.N. Ewbie.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-03-15 15:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-12 23:31 is -O2 breaking sse2 alignment? JP Fournier
2008-03-13  0:28 ` Brian Dessent
2008-03-14  0:13   ` JP Fournier
2008-03-14 19:31     ` Brian Budge
2008-03-15  9:49       ` Andrew Haley
2008-03-15  0:10     ` Maximillian Murphy
2008-03-15 15:10       ` Maximillian Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).