* is -O2 breaking sse2 alignment? @ 2008-03-12 23:31 JP Fournier 2008-03-13 0:28 ` Brian Dessent 0 siblings, 1 reply; 7+ messages in thread From: JP Fournier @ 2008-03-12 23:31 UTC (permalink / raw) To: gcc-help Hi All. In the example below, compiling with -O2 results in incorrect output from the program. -O seems OK. Am I missing something alignment wise (or otherwise) or is -O2 breaking my alignment? If I use _mm_storeu_si128 then both -O2 and -O work as expected. Any thoughts appreciated. jp ------ bash-3.1$ gcc -O -msse2 -o sse2 sse2.c bash-3.1$ ./sse2 c0=2 c1=2 bash-3.1$ gcc -O2 -msse2 -o sse2 sse2.c bash-3.1$ ./sse2 c0=0 c1=0 bash-3.1$ gcc --version gcc (GCC) 4.1.2 bash-3.1$ uname -a Linux puma 2.6.22.8 #1 SMP Tue Sep 25 20:41:25 BST 2007 x86_64 x86_64 x86_64 GNU/Linux sse2.c: #include <stdio.h> #include <emmintrin.h> void test_int() { // array of 2 8 byte ints long int *a = _mm_malloc(16, 16); long int *b = _mm_malloc(16, 16); long int *c = _mm_malloc(16, 16); __m128i ai __attribute__ ((aligned (16))); __m128i bi __attribute__ ((aligned (16))); __m128i ci __attribute__ ((aligned (16))); a[0] = a[1] = 1; b[0] = b[1] = 1; c[0] = c[1] = 0; ai = _mm_load_si128( (__m128i *) (void*)a ); bi = _mm_load_si128( (__m128i *) (void*)b ); ci = _mm_add_epi8( ai, bi ); _mm_store_si128( (__m128i *) (void*)c, ci ); printf("c0=%ld c1=%ld\n", c[0], c[1] ); } int main( int count, char ** args ) { test_int(); return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-12 23:31 is -O2 breaking sse2 alignment? JP Fournier @ 2008-03-13 0:28 ` Brian Dessent 2008-03-14 0:13 ` JP Fournier 0 siblings, 1 reply; 7+ messages in thread From: Brian Dessent @ 2008-03-13 0:28 UTC (permalink / raw) To: JP Fournier; +Cc: gcc-help JP Fournier wrote: > In the example below, compiling with -O2 results in incorrect output > from the program. -O seems OK. Am I missing something alignment wise > (or otherwise) or is -O2 breaking my alignment? If it was an alignment problem you'd most likely be getting a segmentation fault. The __m128i type should already include the proper alignment so you don't need the __attribute__((aligned (16))) stuff. > // array of 2 8 byte ints > long int *a = _mm_malloc(16, 16); > long int *b = _mm_malloc(16, 16); > long int *c = _mm_malloc(16, 16); > > __m128i ai __attribute__ ((aligned (16))); > __m128i bi __attribute__ ((aligned (16))); > __m128i ci __attribute__ ((aligned (16))); > > a[0] = a[1] = 1; > b[0] = b[1] = 1; > c[0] = c[1] = 0; > > ai = _mm_load_si128( (__m128i *) (void*)a ); > bi = _mm_load_si128( (__m128i *) (void*)b ); > > ci = _mm_add_epi8( ai, bi ); > _mm_store_si128( (__m128i *) (void*)c, ci ); > printf("c0=%ld c1=%ld\n", c[0], c[1] ); > } You're violates the C aliasing rules. You can't store through a casted pointer like that. You also don't have to do the load/store, the compiler know what you want when you use a union instead: union { __m128i v; long l[2]; } a, b, c; a.l[0] = a.l[1] = 1; b.l[0] = b.l[1] = 1; c.v = _mm_add_epi8 (a.v, b.v); printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]); There's an even more natural way to do this though using gcc's built-in vector extensions without any of the Intel mmintrin.h stuff. This way will result in code that will vectorize to altivec, sse2, spu, whatever the machine supports, it's not hardware specific: typedef int v4si __attribute__ ((vector_size (16))); v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c; c = a + b; You can use all the normal C operators like + and * as if they were scalars but they will be compiled using the corresponding SIMD instructions. See <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more. If you want access to the individual parts you can again use the union, e.g. union { v4si v; int i[4]; } u; u.v = a + b; printf ("%d,%d,%d,%d\n", v.i[0], v.i[1], v.i[2], v.i[3]); Brian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-13 0:28 ` Brian Dessent @ 2008-03-14 0:13 ` JP Fournier 2008-03-14 19:31 ` Brian Budge 2008-03-15 0:10 ` Maximillian Murphy 0 siblings, 2 replies; 7+ messages in thread From: JP Fournier @ 2008-03-14 0:13 UTC (permalink / raw) To: gcc-help Brian Dessent wrote: > > You're violates the C aliasing rules. You can't store through a casted > pointer like that. You also don't have to do the load/store, the > compiler know what you want when you use a union instead: > > union { __m128i v; long l[2]; } a, b, c; > > a.l[0] = a.l[1] = 1; > b.l[0] = b.l[1] = 1; > > c.v = _mm_add_epi8 (a.v, b.v); > printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]); Many Thanks Brian. My little program now behaves better: bash-3.1$ gcc -O2 -msse2 -o sse2 sse2-1.c bash-3.1$ ./sse2 c0=2 c1=2 #include <stdio.h> #include <emmintrin.h> void test_int() { union { __m128i v; long l[2]; } a, b, c; a.l[0] = a.l[1] = 1; b.l[0] = b.l[1] = 1; c.l[0] = c.l[1] = 0; c.v = _mm_add_epi8( a.v, b.v ); printf("c0=%ld c1=%ld\n", c.l[0], c.l[1] ); } int main( int count, char ** args ) { test_int(); return 0; } > > There's an even more natural way to do this though using gcc's built-in > vector extensions without any of the Intel mmintrin.h stuff. This way > will result in code that will vectorize to altivec, sse2, spu, whatever > the machine supports, it's not hardware specific: > > typedef int v4si __attribute__ ((vector_size (16))); > > v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c; > > c = a + b; > > You can use all the normal C operators like + and * as if they were > scalars but they will be compiled using the corresponding SIMD > instructions. See > <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more. If > you want access to the individual parts you can again use the union, My thinking is that I'd like try to be compiler independent, so by using the intel intrinsics I figure I should be able to get gcc and the intel compiler to work as a start. What I am _really_ trying to do is to implement is the addition of elements of two arrays. Is there a more efficient way of doing this than this way?: #include <stdio.h> #include <emmintrin.h> void test_add_long(long * result, long * a, long * b, long size) { union { __m128i v; long l[2]; } temp1, temp2, temp3; int index=0; for( index=0; index < size; index+=2 ) { temp1.l[0] = a[index]; temp1.l[1] = a[index+1]; temp2.l[0] = b[index]; temp2.l[1] = b[index + 1]; temp3.v = _mm_add_epi8( temp1.v, temp2.v ); result[index] = temp3.l[0]; result[index+1] = temp3.l[1]; printf("c0=%ld c1=%ld\n", result[index], result[index+1] ); } } int main( int count, char ** args ) { // array of 4 8 byte ints long a[] = { 1, 2, 3, 4}; long b[] = { 1, 2, 3, 4}; long result[] = {0,0,0,0}; test_add_long(result, a, b, 4); return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-14 0:13 ` JP Fournier @ 2008-03-14 19:31 ` Brian Budge 2008-03-15 9:49 ` Andrew Haley 2008-03-15 0:10 ` Maximillian Murphy 1 sibling, 1 reply; 7+ messages in thread From: Brian Budge @ 2008-03-14 19:31 UTC (permalink / raw) To: JP Fournier; +Cc: gcc-help Hi - I think this might be a little simpler: void test_add_long1(long * result, long * a, long * b, long size) { __m128i *A = (__m128i*)a; __m128i *B = (__m128i*)b; __m128i *end = A + size/2; __m128i *R = (__m128i*)result; for(; A < end; ++A, ++B, ++R) { *R = _mm_add_epi64(*A, *B); } } though I believe if you let the the compiler know about the alignment of result, a, and b, it will properly optimize this: void test_add_long2(long * result, long * a, long * b, long size) { long *end = a + size; for(; a < end; ++a, ++b, ++result) { *result = *a + *b; } } Brian On Thu, Mar 13, 2008 at 5:12 PM, JP Fournier <jape41@gmail.com> wrote: > Brian Dessent wrote: > > > > You're violates the C aliasing rules. You can't store through a casted > > pointer like that. You also don't have to do the load/store, the > > compiler know what you want when you use a union instead: > > > > union { __m128i v; long l[2]; } a, b, c; > > > > a.l[0] = a.l[1] = 1; > > b.l[0] = b.l[1] = 1; > > > > c.v = _mm_add_epi8 (a.v, b.v); > > printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]); > > Many Thanks Brian. My little program now behaves better: > > bash-3.1$ gcc -O2 -msse2 -o sse2 sse2-1.c > > bash-3.1$ ./sse2 > c0=2 c1=2 > > > #include <stdio.h> > #include <emmintrin.h> > > void test_int() { > > union { __m128i v; long l[2]; } a, b, c; > > a.l[0] = a.l[1] = 1; > b.l[0] = b.l[1] = 1; > c.l[0] = c.l[1] = 0; > > c.v = _mm_add_epi8( a.v, b.v ); > printf("c0=%ld c1=%ld\n", c.l[0], c.l[1] ); > } > > int main( int count, char ** args ) { > test_int(); > return 0; > } > > > > > > > There's an even more natural way to do this though using gcc's built-in > > vector extensions without any of the Intel mmintrin.h stuff. This way > > will result in code that will vectorize to altivec, sse2, spu, whatever > > the machine supports, it's not hardware specific: > > > > typedef int v4si __attribute__ ((vector_size (16))); > > > > v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c; > > > > c = a + b; > > > > You can use all the normal C operators like + and * as if they were > > scalars but they will be compiled using the corresponding SIMD > > instructions. See > > <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more. If > > you want access to the individual parts you can again use the union, > > My thinking is that I'd like try to be compiler independent, so by using > the intel intrinsics I figure I should be able to get gcc and the intel > compiler to work as a start. > > What I am _really_ trying to do is to implement is the addition of > elements of two arrays. > > Is there a more efficient way of doing this than this way?: > > > > #include <stdio.h> > #include <emmintrin.h> > > void test_add_long(long * result, long * a, long * b, long size) { > union { __m128i v; long l[2]; } temp1, temp2, temp3; > int index=0; > > for( index=0; index < size; index+=2 ) { > temp1.l[0] = a[index]; > temp1.l[1] = a[index+1]; > temp2.l[0] = b[index]; > temp2.l[1] = b[index + 1]; > > temp3.v = _mm_add_epi8( temp1.v, temp2.v ); > result[index] = temp3.l[0]; > result[index+1] = temp3.l[1]; > > printf("c0=%ld c1=%ld\n", result[index], result[index+1] ); > > } > } > > int main( int count, char ** args ) { > // array of 4 8 byte ints > long a[] = { 1, 2, 3, 4}; > long b[] = { 1, 2, 3, 4}; > long result[] = {0,0,0,0}; > > test_add_long(result, a, b, 4); > > return 0; > } > > > > > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-14 19:31 ` Brian Budge @ 2008-03-15 9:49 ` Andrew Haley 0 siblings, 0 replies; 7+ messages in thread From: Andrew Haley @ 2008-03-15 9:49 UTC (permalink / raw) To: Brian Budge; +Cc: JP Fournier, gcc-help Brian Budge wrote: > Hi - > > I think this might be a little simpler: > > void test_add_long1(long * result, long * a, long * b, long size) { > __m128i *A = (__m128i*)a; > __m128i *B = (__m128i*)b; > __m128i *end = A + size/2; > __m128i *R = (__m128i*)result; > > for(; A < end; ++A, ++B, ++R) { > *R = _mm_add_epi64(*A, *B); > } > } This has the same aliasing problems. Andrew. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-14 0:13 ` JP Fournier 2008-03-14 19:31 ` Brian Budge @ 2008-03-15 0:10 ` Maximillian Murphy 2008-03-15 15:10 ` Maximillian Murphy 1 sibling, 1 reply; 7+ messages in thread From: Maximillian Murphy @ 2008-03-15 0:10 UTC (permalink / raw) To: gcc-help > > What I am _really_ trying to do is to implement is the addition of > elements of two arrays. > > Is there a more efficient way of doing this than this way?: > Question from someone who has just written his first few lines of SSE2 (oh how exciting, but let's not get too excited until we can actually beat the SSE-free standard compile!): How many SSE2 instructions can be run at the same time? I would have thought that if there is much optimising to be done it will be in loading up all the registers and doing lots of SSE instructions in parallel. Presumably the challenge will be organising traffic to and from the registers so that we don't get spikes from loading registers simultaneously. Rather we'd have to load one pair of registers whilst simultaneously adding together another pair whilst simultaneously writing out the result of a third. That kind of thing. Am I on the right track or am I way off the mark? Regards, Max. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: is -O2 breaking sse2 alignment? 2008-03-15 0:10 ` Maximillian Murphy @ 2008-03-15 15:10 ` Maximillian Murphy 0 siblings, 0 replies; 7+ messages in thread From: Maximillian Murphy @ 2008-03-15 15:10 UTC (permalink / raw) To: gcc-help > > > > What I am _really_ trying to do is to implement is the addition of > > elements of two arrays. > > > > Is there a more efficient way of doing this than this way?: > > > Dear All, Regarding the earlier code: the vector add is 8 bit wide, (_mm_add_epi8) even though we are loading the registers with 64 bit longs. If we only want to add bytes we can do eight times as many at a shot, if we want to add longs the command needs to change to _mm_add_epi64. I toyed with 64 bit adds and your code. One of your codes. I created three arrays of 10000000 longs and added them together, first without SSE, then using just one SSE load, add, unload, then using two load, add, unloads in parallel. The answers that you ladies and gentlemen have been waiting for are: Without SSE: 3 In 1.600000e+05 jiffies 97 In 1.700000e+05 jiffies 181 In 1.800000e+05 jiffies 23 In 1.900000e+05 jiffies 1 In 2.000000e+05 jiffies With one SSE load add cycle: 51 In 2.200000e+05 jiffies 204 In 2.300000e+05 jiffies 50 In 2.400000e+05 jiffies With two SSE load add cycles: 1 In 2.000000e+05 jiffies 56 In 2.100000e+05 jiffies 177 In 2.200000e+05 jiffies 69 In 2.300000e+05 jiffies 1 In 2.400000e+05 jiffies As we have eight registers we could have four add operations going on simultaneously, however it's clearly not going to beat vanilla code that ignores the SSE. (On my machine anyway and with one particular code. If anyone can do better, please speak up so that we can compare notes!) As you can tell, the clock is fairly coarse. Repeating the tests makes up for that though. Doing masses of 16 bit multiplies I can beat the standard gcc compiled code by a small factor. I'm curious as to what is limiting the SSE computation. Is it load time, in which case it's only worth using the SSE if there are several operations to do, or is it the width of the compute engine? The latter seems unlikely, after all width is what SSE is all about. Regards, A.N. Ewbie. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-03-15 15:10 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-03-12 23:31 is -O2 breaking sse2 alignment? JP Fournier 2008-03-13 0:28 ` Brian Dessent 2008-03-14 0:13 ` JP Fournier 2008-03-14 19:31 ` Brian Budge 2008-03-15 9:49 ` Andrew Haley 2008-03-15 0:10 ` Maximillian Murphy 2008-03-15 15:10 ` Maximillian Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).