Re: Adding OpenMP support for some of the GSL functions

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

From: Maxime Boissonneault <maxime.boissonneault@calculquebec.ca>
To: Rhys Ulerich <rhys.ulerich@gmail.com>
Cc: gsl-discuss@sourceware.org
Subject: Re: Adding OpenMP support for some of the GSL functions
Date: Thu, 13 Dec 2012 21:05:00 -0000	[thread overview]
Message-ID: <50CA4313.30708@calculquebec.ca> (raw)
In-Reply-To: <CAKDqugQRt+q-V+Sv=AxmREwZAq1jFrGzU+xFmZi1YRQDSnXmfA@mail.gmail.com>

Hi Rhys,
I did the comparison you requested. Here is the load balancing 
information for my test run :
These were all obtained with gcc 4.7.2. The total times for the runs were :
- Without vectorization or OpenMP : 1m21s
- With vectorization : 1m14s
- With OpenMP : 1m01s

Here is load-balance information obtained with OpenSpeedshop.
Without vectorization or OpenMP :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

   28.085714  140384673677232    28.085714  140384673677232 28.085714  
rkf45_apply (libgsl.so.0.16.0)

With OpenMP :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

[...]
    2.800000       1146448192     1.228571       1088166208 1.778571  
rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
    2.171429       1146448192     0.971429       1129662784 1.460714  
rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
    2.142857       1146448192     1.400000  140073673240496 1.671429  
rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
    2.085714       1146448192     1.257143       1112877376 1.725000  
rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
    2.085714       1146448192     1.171429       1088166208 1.578571  
rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
    1.600000       1146448192     0.885714  140073673240496 1.139286  
rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
    0.714286       1146448192     0.457143       1129662784 0.557143  
rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
[...]
The 7 loop make a total of about ~12s.


With vectorization :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

   24.542857  140440801677232    24.542857  140440801677232 24.542857  
rkf45_apply (libgsl.so.0.16.0)


This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total of 
8 threads. To summarize :
- Vectorization makes a difference, but it is minor (gained 4s out of 
the 28s used by the normal rkf45_apply ).
- OpenMP is the clear winner (gained 20s out of 28s).


Maxime





Le 2012-12-13 10:51, Rhys Ulerich a Ã©crit :
>> I am doing that too, but any gain we can get is an important one, and it
>> turns out that by parallelizing rkf45_apply, my simulation runs 30% faster
>> on 8 cores.
> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
> 0.70 time units)).  This feels like you're getting a small
> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
> operations by using multiple cores but the cores are otherwise not
> being used effectively.  I say this because a state vector 1e6 doubles
> long will not generally fit in cache.  Adding more cores increases the
> amount of cache available.
>
>> I will have a deeper look at vectorization of GSL, but in my understanding,
>> vectorizing can only be done with simple operations, while algorithms like
>> RKF45 involve about 10 operations per loop iterations.
> The compilers are generally very good.  Intel's icc 11.1 has to be
> told that the last four loops you annotated are vectorizable.  GCC
> nails it out of the box.
>
> On GCC 4.4.3 with something like
>      CFLAGS="-g -O2 -march=native -mtune=native
> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
> shows every one of those 6 loops vectorizing.  You can check this by
> configuring with those options, running make and waiting for the build
> to finish, and then cd-ing into ode-initval2 and running
>      rm rkf45*o && make
> and observing all those beautiful
>      LOOP VECTORIZED
> messages.  Better yet, with those options, 'make check' passes for me
> on the 'ode-initval2' subdirectory.
>
> Try ripping out your OpenMP pragmas in GSL, building with
> vectorization against stock GSL as I suggested, and then seeing how
> fast your code runs with GSL vectorized on 1 core versus GSL's
> unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
> be comparable.
>
> - Rhys

next prev parent reply	other threads:[~2012-12-13 21:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-11 20:04 Maxime Boissonneault
2012-12-12 16:35 ` Frank Reininghaus
2012-12-12 21:11   ` Maxime Boissonneault
2012-12-12 21:41     ` Rhys Ulerich
2012-12-12 23:32       ` Maxime Boissonneault
2012-12-12 17:05 ` Rhys Ulerich
2012-12-12 21:13   ` Maxime Boissonneault
2012-12-12 21:36     ` Rhys Ulerich
2012-12-12 23:35       ` Maxime Boissonneault
2012-12-13  2:29         ` Rhys Ulerich
2012-12-13 13:22           ` Maxime Boissonneault
2012-12-13 15:53             ` Rhys Ulerich
2012-12-13 16:44               ` Rhys Ulerich
2012-12-13 21:07                 ` Maxime Boissonneault
2012-12-13 21:05               ` Maxime Boissonneault [this message]
2012-12-13 21:14                 ` Maxime Boissonneault

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50CA4313.30708@calculquebec.ca \
    --to=maxime.boissonneault@calculquebec.ca \
    --cc=gsl-discuss@sourceware.org \
    --cc=rhys.ulerich@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).