* SSE SIMD enhanced code 4x slower than regular code
@ 2012-01-18 11:13 Boris Hollas
2012-01-18 11:23 ` Marc Glisse
0 siblings, 1 reply; 3+ messages in thread
From: Boris Hollas @ 2012-01-18 11:13 UTC (permalink / raw)
To: gcc-help
Hello,
I have a function iter1 that iterates a sequence of complex numbers. I
redesigned this function, using SSE intrinsics such as _mm_mul_pd, to obtain
iter0. Nonetheless, iter0 is 4x slower than iter1:
iter0 (with SSE intrinsics):
$ gcc -O -march=core2 t.c && time ./a.out
257829745
real 0m7.912s
user 0m7.908s
sys 0m0.000s
iter1 (w/o SSE intrinsics):
$ gcc -O -march=core2 t.c && time ./a.out
257829745
real 0m2.075s
user 0m2.076s
sys 0m0.000s
The size of a.out ist 7.1K in both cases. I use gcc version 4.4.5 and the
CPU is an Intel Core 2 Duo.
The code is below. iter0 and iter1 give the same numerical results.
#include <pmmintrin.h>
#include <stdio.h>
#define sqr(x) ((x)*(x))
typedef union {
__m128d m;
double v[2]; // v[0] low, v[1] up
} v2df;
int iter0(v2df z, v2df c, int n, int bound) {
v2df z2, z2r, z2r_addsub, z_;
z2.m = _mm_mul_pd(z.m, z.m); // z_re^2, z_im^2
z2r.v[1] = z2.v[0];
z2r.v[0] = z2.v[1];
z2r_addsub.m = _mm_addsub_pd(z2r.m, z2.m); // z_re^2 + z_im^2, z_re^2 -
z_im^2
if(z2r_addsub.v[1] > 4.0 || n == bound) return n;
else {
z_.v[1] = z2r_addsub.v[0];
z_.v[0] = 2.0 * z.v[1] * z.v[0];
z_.m = _mm_add_pd(z_.m, c.m); // z_re^2 - z_im^2 + c_re, 2 * z_re * z_im +
c_im
return iter0(z_, c, n+1, bound);
}
}
int iter1(double z_re, double z_im, double c_re, double c_im, int n, int
bound) {
double zre2 = sqr(z_re);
double zim2 = sqr(z_im);
if(zre2 + zim2 > 4.0 || n == bound) return n;
else return iter1(zre2 - zim2 + c_re, 2.0 * z_re * z_im + c_im, c_re,
c_im, n+1, bound);
}
#define sse
int main() {
v2df z, c;
long n = 0;
z.v[1] = 0.0; z.v[0] = 0.0;
for(c.v[1] = -2.0; c.v[1] < 1.0; c.v[1] += 3.0/1000.0) {
for(c.v[0] = -1.0; c.v[0] < 1.0; c.v[0] += 2.0/1000.0) {
#ifdef sse
n += iter0(z, c, 0, 1000);
#else
n += iter1(0.0, 0.0, c.v[1], c.v[0], 0, 1000);
#endif
}
}
printf("%ld\n", n);
return 0;
}
--
View this message in context: http://old.nabble.com/SSE-SIMD-enhanced-code-4x-slower-than-regular-code-tp33159404p33159404.html
Sent from the gcc - Help mailing list archive at Nabble.com.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: SSE SIMD enhanced code 4x slower than regular code
2012-01-18 11:13 SSE SIMD enhanced code 4x slower than regular code Boris Hollas
@ 2012-01-18 11:23 ` Marc Glisse
2012-01-20 17:30 ` Boris Hollas
0 siblings, 1 reply; 3+ messages in thread
From: Marc Glisse @ 2012-01-18 11:23 UTC (permalink / raw)
To: Boris Hollas; +Cc: gcc-help
On Tue, 17 Jan 2012, Boris Hollas wrote:
> I have a function iter1 that iterates a sequence of complex numbers. I
> redesigned this function, using SSE intrinsics such as _mm_mul_pd, to obtain
> iter0. Nonetheless, iter0 is 4x slower than iter1:
That is not surprising at all and happens to most codes using double when
people try converting them to SSE.
> $ gcc -O -march=core2 t.c && time ./a.out
Maybe use at least -O2Â ?
> The size of a.out ist 7.1K in both cases. I use gcc version 4.4.5 and the
> CPU is an Intel Core 2 Duo.
You may want to try newer versions of gcc (note the plural, results
between 4.4, 4.5, 4.6 and the future 4.7 can vary a lot, and not always in
the right direction).
> #include <pmmintrin.h>
> #include <stdio.h>
> #define sqr(x) ((x)*(x))
>
> typedef union {
> __m128d m;
> double v[2]; // v[0] low, v[1] up
> } v2df;
>
> int iter0(v2df z, v2df c, int n, int bound) {
> v2df z2, z2r, z2r_addsub, z_;
> z2.m = _mm_mul_pd(z.m, z.m); // z_re^2, z_im^2
> z2r.v[1] = z2.v[0];
> z2r.v[0] = z2.v[1];
You may want to try _mm_shuffle_pd or __builtin_shuffle.
> z2r_addsub.m = _mm_addsub_pd(z2r.m, z2.m); // z_re^2 + z_im^2, z_re^2 -
> z_im^2
>
> if(z2r_addsub.v[1] > 4.0 || n == bound) return n;
> else {
> z_.v[1] = z2r_addsub.v[0];
> z_.v[0] = 2.0 * z.v[1] * z.v[0];
> z_.m = _mm_add_pd(z_.m, c.m); // z_re^2 - z_im^2 + c_re, 2 * z_re * z_im +
> c_im
> return iter0(z_, c, n+1, bound);
> }
> }
Did you take a look at the generated code (use flag -S and read the
generated t.s)? Going back and forth between packed and unpacked through a
union often generates plenty of mov instructions. If you manually use
_mm_cvtsd_f64 and _mm_unpackhi_pd you may be able to save a bit. Note that
with the latest gcc, you can use the [] notation directly on your __m128d.
> int iter1(double z_re, double z_im, double c_re, double c_im, int n, int
> bound) {
> double zre2 = sqr(z_re);
> double zim2 = sqr(z_im);
>
> if(zre2 + zim2 > 4.0 || n == bound) return n;
> else return iter1(zre2 - zim2 + c_re, 2.0 * z_re * z_im + c_im, c_re,
> c_im, n+1, bound);
> }
>
> #define sse
>
> int main() {
> v2df z, c;
> long n = 0;
> z.v[1] = 0.0; z.v[0] = 0.0;
>
> for(c.v[1] = -2.0; c.v[1] < 1.0; c.v[1] += 3.0/1000.0) {
> for(c.v[0] = -1.0; c.v[0] < 1.0; c.v[0] += 2.0/1000.0) {
> #ifdef sse
> n += iter0(z, c, 0, 1000);
> #else
> n += iter1(0.0, 0.0, c.v[1], c.v[0], 0, 1000);
> #endif
> }
> }
> printf("%ld\n", n);
> return 0;
> }
I'd be surprised if you managed any gain on this thanks to __m128d.
--
Marc Glisse
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: SSE SIMD enhanced code 4x slower than regular code
2012-01-18 11:23 ` Marc Glisse
@ 2012-01-20 17:30 ` Boris Hollas
0 siblings, 0 replies; 3+ messages in thread
From: Boris Hollas @ 2012-01-20 17:30 UTC (permalink / raw)
To: gcc-help
>Maybe use at least -O2 ?
no difference.
> You may want to try _mm_shuffle_pd or __builtin_shuffle.
indeed, this reduces the runtime by 2 s. Still 3x slower than iter1.
>Did you take a look at the generated code (use flag -S and read the
>generated t.s)? Going back and forth between packed and unpacked through a
>union often generates plenty of mov instructions. If you manually use
>_mm_cvtsd_f64 and _mm_unpackhi_pd you may be able to save a bit. Note that
>with the latest gcc, you can use the [] notation directly on your __m128d.
yes, there are lots of move-instructions. It seems this adds a lot of
runtime.
> I'd be surprised if you managed any gain on this thanks to __m128d.
well, maybe SIMD isn't very well suited for these calculations. So, I'll use
iter1 until I have a better idea.
thanks for your hints!
-Boris
--
View this message in context: http://old.nabble.com/SSE-SIMD-enhanced-code-4x-slower-than-regular-code-tp33159404p33161396.html
Sent from the gcc - Help mailing list archive at Nabble.com.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2012-01-18 14:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-18 11:13 SSE SIMD enhanced code 4x slower than regular code Boris Hollas
2012-01-18 11:23 ` Marc Glisse
2012-01-20 17:30 ` Boris Hollas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).