Re: Adding OpenMP support for some of the GSL functions

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

From: Maxime Boissonneault <maxime.boissonneault@calculquebec.ca>
To: Rhys Ulerich <rhys.ulerich@gmail.com>
Cc: gsl-discuss@sourceware.org
Subject: Re: Adding OpenMP support for some of the GSL functions
Date: Thu, 13 Dec 2012 21:14:00 -0000	[thread overview]
Message-ID: <50CA452B.6070601@calculquebec.ca> (raw)
In-Reply-To: <50CA4313.30708@calculquebec.ca>

Hi again,
Since I noticed that the OpenMP version did not have a balanced workload 
amongst thread (which may have to do with data locality, memory 
affinity, etc.), I added the clause "schedule(runtime)" to each of my 
pragmas.

At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I 
got the slightly better runtime of 57s with the OpenMP version.

Maxime


Le 2012-12-13 16:05, Maxime Boissonneault a Ã©crit :
> Hi Rhys,
> I did the comparison you requested. Here is the load balancing 
> information for my test run :
> These were all obtained with gcc 4.7.2. The total times for the runs 
> were :
> - Without vectorization or OpenMP : 1m21s
> - With vectorization : 1m14s
> - With OpenMP : 1m01s
>
> Here is load-balance information obtained with OpenSpeedshop.
> Without vectorization or OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   28.085714  140384673677232    28.085714  140384673677232 28.085714  
> rkf45_apply (libgsl.so.0.16.0)
>
> With OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
> [...]
>    2.800000       1146448192     1.228571       1088166208 1.778571  
> rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
>    2.171429       1146448192     0.971429       1129662784 1.460714  
> rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
>    2.142857       1146448192     1.400000  140073673240496 1.671429  
> rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.257143       1112877376 1.725000  
> rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.171429       1088166208 1.578571  
> rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
>    1.600000       1146448192     0.885714  140073673240496 1.139286  
> rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
>    0.714286       1146448192     0.457143       1129662784 0.557143  
> rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
> [...]
> The 7 loop make a total of about ~12s.
>
>
> With vectorization :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   24.542857  140440801677232    24.542857  140440801677232 24.542857  
> rkf45_apply (libgsl.so.0.16.0)
>
>
> This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total 
> of 8 threads. To summarize :
> - Vectorization makes a difference, but it is minor (gained 4s out of 
> the 28s used by the normal rkf45_apply ).
> - OpenMP is the clear winner (gained 20s out of 28s).
>
>
> Maxime
>
>
>
>
>
> Le 2012-12-13 10:51, Rhys Ulerich a Ã©crit :
>>> I am doing that too, but any gain we can get is an important one, 
>>> and it
>>> turns out that by parallelizing rkf45_apply, my simulation runs 30% 
>>> faster
>>> on 8 cores.
>> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
>> 0.70 time units)).  This feels like you're getting a small
>> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
>> operations by using multiple cores but the cores are otherwise not
>> being used effectively.  I say this because a state vector 1e6 doubles
>> long will not generally fit in cache.  Adding more cores increases the
>> amount of cache available.
>>
>>> I will have a deeper look at vectorization of GSL, but in my 
>>> understanding,
>>> vectorizing can only be done with simple operations, while 
>>> algorithms like
>>> RKF45 involve about 10 operations per loop iterations.
>> The compilers are generally very good.  Intel's icc 11.1 has to be
>> told that the last four loops you annotated are vectorizable. GCC
>> nails it out of the box.
>>
>> On GCC 4.4.3 with something like
>>      CFLAGS="-g -O2 -march=native -mtune=native
>> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
>> shows every one of those 6 loops vectorizing.  You can check this by
>> configuring with those options, running make and waiting for the build
>> to finish, and then cd-ing into ode-initval2 and running
>>      rm rkf45*o && make
>> and observing all those beautiful
>>      LOOP VECTORIZED
>> messages.  Better yet, with those options, 'make check' passes for me
>> on the 'ode-initval2' subdirectory.
>>
>> Try ripping out your OpenMP pragmas in GSL, building with
>> vectorization against stock GSL as I suggested, and then seeing how
>> fast your code runs with GSL vectorized on 1 core versus GSL's
>> unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
>> be comparable.
>>
>> - Rhys
>


-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul QuÃ©bec, UniversitÃ© Laval
Ph. D. en physique

     prev parent reply	other threads:[~2012-12-13 21:14 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-11 20:04 Maxime Boissonneault
2012-12-12 16:35 ` Frank Reininghaus
2012-12-12 21:11   ` Maxime Boissonneault
2012-12-12 21:41     ` Rhys Ulerich
2012-12-12 23:32       ` Maxime Boissonneault
2012-12-12 17:05 ` Rhys Ulerich
2012-12-12 21:13   ` Maxime Boissonneault
2012-12-12 21:36     ` Rhys Ulerich
2012-12-12 23:35       ` Maxime Boissonneault
2012-12-13  2:29         ` Rhys Ulerich
2012-12-13 13:22           ` Maxime Boissonneault
2012-12-13 15:53             ` Rhys Ulerich
2012-12-13 16:44               ` Rhys Ulerich
2012-12-13 21:07                 ` Maxime Boissonneault
2012-12-13 21:05               ` Maxime Boissonneault
2012-12-13 21:14                 ` Maxime Boissonneault [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50CA452B.6070601@calculquebec.ca \
    --to=maxime.boissonneault@calculquebec.ca \
    --cc=gsl-discuss@sourceware.org \
    --cc=rhys.ulerich@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).