From: Maxime Boissonneault <maxime.boissonneault@calculquebec.ca>
To: Rhys Ulerich <rhys.ulerich@gmail.com>
Cc: gsl-discuss@sourceware.org
Subject: Re: Adding OpenMP support for some of the GSL functions
Date: Thu, 13 Dec 2012 21:14:00 -0000 [thread overview]
Message-ID: <50CA452B.6070601@calculquebec.ca> (raw)
In-Reply-To: <50CA4313.30708@calculquebec.ca>
Hi again,
Since I noticed that the OpenMP version did not have a balanced workload
amongst thread (which may have to do with data locality, memory
affinity, etc.), I added the clause "schedule(runtime)" to each of my
pragmas.
At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I
got the slightly better runtime of 57s with the OpenMP version.
Maxime
Le 2012-12-13 16:05, Maxime Boissonneault a écrit :
> Hi Rhys,
> I did the comparison you requested. Here is the load balancing
> information for my test run :
> These were all obtained with gcc 4.7.2. The total times for the runs
> were :
> - Without vectorization or OpenMP : 1m21s
> - With vectorization : 1m14s
> - With OpenMP : 1m01s
>
> Here is load-balance information obtained with OpenSpeedshop.
> Without vectorization or OpenMP :
> Max Posix ThreadId Min Posix ThreadId Average
> Function (defining location)
> Exclusive of Max Exclusive of Min Exclusive
> Time Across Time Across Time Across
> Posix Posix Posix
> ThreadIds(s) ThreadIds(s) ThreadIds(s)
>
> 28.085714 140384673677232 28.085714 140384673677232 28.085714
> rkf45_apply (libgsl.so.0.16.0)
>
> With OpenMP :
> Max Posix ThreadId Min Posix ThreadId Average
> Function (defining location)
> Exclusive of Max Exclusive of Min Exclusive
> Time Across Time Across Time Across
> Posix Posix Posix
> ThreadIds(s) ThreadIds(s) ThreadIds(s)
>
> [...]
> 2.800000 1146448192 1.228571 1088166208 1.778571
> rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
> 2.171429 1146448192 0.971429 1129662784 1.460714
> rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
> 2.142857 1146448192 1.400000 140073673240496 1.671429
> rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
> 2.085714 1146448192 1.257143 1112877376 1.725000
> rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
> 2.085714 1146448192 1.171429 1088166208 1.578571
> rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
> 1.600000 1146448192 0.885714 140073673240496 1.139286
> rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
> 0.714286 1146448192 0.457143 1129662784 0.557143
> rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
> [...]
> The 7 loop make a total of about ~12s.
>
>
> With vectorization :
> Max Posix ThreadId Min Posix ThreadId Average
> Function (defining location)
> Exclusive of Max Exclusive of Min Exclusive
> Time Across Time Across Time Across
> Posix Posix Posix
> ThreadIds(s) ThreadIds(s) ThreadIds(s)
>
> 24.542857 140440801677232 24.542857 140440801677232 24.542857
> rkf45_apply (libgsl.so.0.16.0)
>
>
> This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total
> of 8 threads. To summarize :
> - Vectorization makes a difference, but it is minor (gained 4s out of
> the 28s used by the normal rkf45_apply ).
> - OpenMP is the clear winner (gained 20s out of 28s).
>
>
> Maxime
>
>
>
>
>
> Le 2012-12-13 10:51, Rhys Ulerich a écrit :
>>> I am doing that too, but any gain we can get is an important one,
>>> and it
>>> turns out that by parallelizing rkf45_apply, my simulation runs 30%
>>> faster
>>> on 8 cores.
>> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
>> 0.70 time units)). This feels like you're getting a small
>> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
>> operations by using multiple cores but the cores are otherwise not
>> being used effectively. I say this because a state vector 1e6 doubles
>> long will not generally fit in cache. Adding more cores increases the
>> amount of cache available.
>>
>>> I will have a deeper look at vectorization of GSL, but in my
>>> understanding,
>>> vectorizing can only be done with simple operations, while
>>> algorithms like
>>> RKF45 involve about 10 operations per loop iterations.
>> The compilers are generally very good. Intel's icc 11.1 has to be
>> told that the last four loops you annotated are vectorizable. GCC
>> nails it out of the box.
>>
>> On GCC 4.4.3 with something like
>> CFLAGS="-g -O2 -march=native -mtune=native
>> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
>> shows every one of those 6 loops vectorizing. You can check this by
>> configuring with those options, running make and waiting for the build
>> to finish, and then cd-ing into ode-initval2 and running
>> rm rkf45*o && make
>> and observing all those beautiful
>> LOOP VECTORIZED
>> messages. Better yet, with those options, 'make check' passes for me
>> on the 'ode-initval2' subdirectory.
>>
>> Try ripping out your OpenMP pragmas in GSL, building with
>> vectorization against stock GSL as I suggested, and then seeing how
>> fast your code runs with GSL vectorized on 1 core versus GSL's
>> unvectorized rkf45_apply parallelized over 8 cores. I suspect it will
>> be comparable.
>>
>> - Rhys
>
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
prev parent reply other threads:[~2012-12-13 21:14 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-11 20:04 Maxime Boissonneault
2012-12-12 16:35 ` Frank Reininghaus
2012-12-12 21:11 ` Maxime Boissonneault
2012-12-12 21:41 ` Rhys Ulerich
2012-12-12 23:32 ` Maxime Boissonneault
2012-12-12 17:05 ` Rhys Ulerich
2012-12-12 21:13 ` Maxime Boissonneault
2012-12-12 21:36 ` Rhys Ulerich
2012-12-12 23:35 ` Maxime Boissonneault
2012-12-13 2:29 ` Rhys Ulerich
2012-12-13 13:22 ` Maxime Boissonneault
2012-12-13 15:53 ` Rhys Ulerich
2012-12-13 16:44 ` Rhys Ulerich
2012-12-13 21:07 ` Maxime Boissonneault
2012-12-13 21:05 ` Maxime Boissonneault
2012-12-13 21:14 ` Maxime Boissonneault [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50CA452B.6070601@calculquebec.ca \
--to=maxime.boissonneault@calculquebec.ca \
--cc=gsl-discuss@sourceware.org \
--cc=rhys.ulerich@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).