Adding OpenMP support for some of the GSL functions

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

* Adding OpenMP support for some of the GSL functions
@ 2012-12-11 20:04 Maxime Boissonneault
  2012-12-12 16:35 ` Frank Reininghaus
  2012-12-12 17:05 ` Rhys Ulerich
  0 siblings, 2 replies; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-11 20:04 UTC (permalink / raw)
  To: gsl-discuss

Hi,
I am using GSL from another library of my own to perform numerical 
integration of vectorial differential equations. After optimizing and 
parallelizing most of my library, I ended up with the conclusion that 
GSL is a major bottle neck in my computation, simply because it is not 
parallelized to exploit multi-core achitectures.

I would like to submit patches to add support for OpenMP within GSL, 
allowing easy parallelization on shared-memory architectures. How should 
I proceed to do so ?

For background information on myself, I have a PhD in computational 
physics and I am now a High Performance Computing specialist for Calcul 
QuÃ©bec.

-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul QuÃ©bec, UniversitÃ© Laval
Ph. D. en physique

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault
@ 2012-12-12 16:35 ` Frank Reininghaus
  2012-12-12 21:11   ` Maxime Boissonneault
  2012-12-12 17:05 ` Rhys Ulerich
  1 sibling, 1 reply; 16+ messages in thread
From: Frank Reininghaus @ 2012-12-12 16:35 UTC (permalink / raw)
  To: gsl-discuss

Hi,

2012/12/11 Maxime Boissonneault:
> Hi,
> I am using GSL from another library of my own to perform numerical
> integration of vectorial differential equations. After optimizing and
> parallelizing most of my library, I ended up with the conclusion that GSL is
> a major bottle neck in my computation, simply because it is not parallelized
> to exploit multi-core achitectures.
>
> I would like to submit patches to add support for OpenMP within GSL,
> allowing easy parallelization on shared-memory architectures. How should I
> proceed to do so ?

I think the most straightforward approach would be to use OpenMP
inside the function that calculates the r.h.s. of your differential
equation. This does not require any modifications to GSL at all.

Reorganising the ODE solver code in GSL such that it can make use of
many cores might not be easily possible because often the parameters
passed to the r.h.s. function depend on the results of earlier calls.

Best regards,
Frank

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 16:35 ` Frank Reininghaus
@ 2012-12-12 21:11   ` Maxime Boissonneault
  2012-12-12 21:41     ` Rhys Ulerich
  0 siblings, 1 reply; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-12 21:11 UTC (permalink / raw)
  To: Frank Reininghaus; +Cc: gsl-discuss

[-- Attachment #1: Type: text/plain, Size: 1630 bytes --]

Hi Frank,
The more intensive function is within rkf45_apply in my case. I simply 
added a few pragmas to the loops, and it speed it up quite a lot.

To put things in context, I am solving millions of differential 
equations, so the cost of rkf45_apply in itself is quite important. It 
is in fact the same order as the rest of the code outside the function 
(which is parallelized).

Attached is the modified rk45.c file.

Maxime

Le 2012-12-12 11:35, Frank Reininghaus a Ã©crit :
> Hi,
>
> 2012/12/11 Maxime Boissonneault:
>> Hi,
>> I am using GSL from another library of my own to perform numerical
>> integration of vectorial differential equations. After optimizing and
>> parallelizing most of my library, I ended up with the conclusion that GSL is
>> a major bottle neck in my computation, simply because it is not parallelized
>> to exploit multi-core achitectures.
>>
>> I would like to submit patches to add support for OpenMP within GSL,
>> allowing easy parallelization on shared-memory architectures. How should I
>> proceed to do so ?
> I think the most straightforward approach would be to use OpenMP
> inside the function that calculates the r.h.s. of your differential
> equation. This does not require any modifications to GSL at all.
>
> Reorganising the ODE solver code in GSL such that it can make use of
> many cores might not be easily possible because often the parameters
> passed to the r.h.s. function depend on the results of earlier calls.
>
> Best regards,
> Frank


-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul QuÃ©bec, UniversitÃ© Laval
Ph. D. en physique


[-- Attachment #2: rkf45.c --]
[-- Type: text/plain, Size: 9024 bytes --]

/* ode-initval/rkf45.c
 * 
 * Copyright (C) 2001, 2004, 2007 Brian Gough
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 */

/* Runge-Kutta-Fehlberg 4(5)*/

/* Reference eg. Hairer, E., Norsett S.P., Wanner, G. Solving ordinary
   differential equations I, Nonstiff Problems, 2nd revised edition,
   Springer, 2000.
*/

#include <config.h>
#include <stdlib.h>
#include <string.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_odeiv.h>

#include "odeiv_util.h"

/* Runge-Kutta-Fehlberg coefficients. Zero elements left out */

static const double ah[] = { 1.0/4.0, 3.0/8.0, 12.0/13.0, 1.0, 1.0/2.0 };
static const double b3[] = { 3.0/32.0, 9.0/32.0 };
static const double b4[] = { 1932.0/2197.0, -7200.0/2197.0, 7296.0/2197.0};
static const double b5[] = { 8341.0/4104.0, -32832.0/4104.0, 29440.0/4104.0, -845.0/4104.0};
static const double b6[] = { -6080.0/20520.0, 41040.0/20520.0, -28352.0/20520.0, 9295.0/20520.0, -5643.0/20520.0};

static const double c1 = 902880.0/7618050.0;
static const double c3 = 3953664.0/7618050.0;
static const double c4 = 3855735.0/7618050.0;
static const double c5 = -1371249.0/7618050.0;
static const double c6 = 277020.0/7618050.0;

/* These are the differences of fifth and fourth order coefficients
   for error estimation */

static const double ec[] = { 0.0,
  1.0 / 360.0,
  0.0,
  -128.0 / 4275.0,
  -2197.0 / 75240.0,
  1.0 / 50.0,
  2.0 / 55.0
};

typedef struct
{
  double *k1;
  double *k2;
  double *k3;
  double *k4;
  double *k5;
  double *k6;
  double *y0;
  double *ytmp;
}
rkf45_state_t;

static void *
rkf45_alloc (size_t dim)
{
  rkf45_state_t *state = (rkf45_state_t *) malloc (sizeof (rkf45_state_t));

  if (state == 0)
    {
      GSL_ERROR_NULL ("failed to allocate space for rkf45_state", GSL_ENOMEM);
    }

  state->k1 = (double *) malloc (dim * sizeof (double));

  if (state->k1 == 0)
    {
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k1", GSL_ENOMEM);
    }

  state->k2 = (double *) malloc (dim * sizeof (double));

  if (state->k2 == 0)
    {
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k2", GSL_ENOMEM);
    }

  state->k3 = (double *) malloc (dim * sizeof (double));

  if (state->k3 == 0)
    {
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k3", GSL_ENOMEM);
    }

  state->k4 = (double *) malloc (dim * sizeof (double));

  if (state->k4 == 0)
    {
      free (state->k3);
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k4", GSL_ENOMEM);
    }

  state->k5 = (double *) malloc (dim * sizeof (double));

  if (state->k5 == 0)
    {
      free (state->k4);
      free (state->k3);
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k5", GSL_ENOMEM);
    }

  state->k6 = (double *) malloc (dim * sizeof (double));

  if (state->k6 == 0)
    {
      free (state->k5);
      free (state->k4);
      free (state->k3);
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k6", GSL_ENOMEM);
    }

  state->y0 = (double *) malloc (dim * sizeof (double));

  if (state->y0 == 0)
    {
      free (state->k6);
      free (state->k5);
      free (state->k4);
      free (state->k3);
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for y0", GSL_ENOMEM);
    }

  state->ytmp = (double *) malloc (dim * sizeof (double));

  if (state->ytmp == 0)
    {
      free (state->y0);
      free (state->k6);
      free (state->k5);
      free (state->k4);
      free (state->k3);
      free (state->k2);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for ytmp", GSL_ENOMEM);
    }

  return state;
}


static int
rkf45_apply (void *vstate,
            size_t dim,
            double t,
            double h,
            double y[],
            double yerr[],
            const double dydt_in[],
            double dydt_out[], const gsl_odeiv_system * sys)
{
  rkf45_state_t *state = (rkf45_state_t *) vstate;

  size_t i;

  double *const k1 = state->k1;
  double *const k2 = state->k2;
  double *const k3 = state->k3;
  double *const k4 = state->k4;
  double *const k5 = state->k5;
  double *const k6 = state->k6;
  double *const ytmp = state->ytmp;
  double *const y0 = state->y0;

  DBL_MEMCPY (y0, y, dim);

  /* k1 step */
  if (dydt_in != NULL)
    {
      DBL_MEMCPY (k1, dydt_in, dim);
    }
  else
    {
      int s = GSL_ODEIV_FN_EVAL (sys, t, y, k1);
      
      if (s != GSL_SUCCESS)
	{
	  return s;
	}
    }
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    ytmp[i] = y[i] +  ah[0] * h * k1[i];

  /* k2 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + ah[0] * h, ytmp, k2);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }
  
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    ytmp[i] = y[i] + h * (b3[0] * k1[i] + b3[1] * k2[i]);

  /* k3 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + ah[1] * h, ytmp, k3);
    
    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }
  
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    ytmp[i] = y[i] + h * (b4[0] * k1[i] + b4[1] * k2[i] + b4[2] * k3[i]);

  /* k4 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + ah[2] * h, ytmp, k4);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }
  
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    ytmp[i] =
      y[i] + h * (b5[0] * k1[i] + b5[1] * k2[i] + b5[2] * k3[i] +
                  b5[3] * k4[i]);

  /* k5 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + ah[3] * h, ytmp, k5);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }
  
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    ytmp[i] =
      y[i] + h * (b6[0] * k1[i] + b6[1] * k2[i] + b6[2] * k3[i] +
                  b6[3] * k4[i] + b6[4] * k5[i]);

  /* k6 step and final sum */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + ah[4] * h, ytmp, k6);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }
  
#pragma omp parallel for shared(y,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    {
      const double d_i = c1 * k1[i] + c3 * k3[i] + c4 * k4[i] + c5 * k5[i] + c6 * k6[i];
      y[i] += h * d_i;
    }

  /* Derivatives at output */

  if (dydt_out != NULL)
    {
      int s = GSL_ODEIV_FN_EVAL (sys, t + h, y, dydt_out);
      
      if (s != GSL_SUCCESS)
	{
	  /* Restore initial values */
	  DBL_MEMCPY (y, y0, dim);

	  return s;
	}
    }
  
  /* difference between 4th and 5th order */
#pragma omp parallel for shared(y,yerr,h,dim) private(i) default(none)
  for (i = 0; i < dim; i++)
    {
      yerr[i] = h * (ec[1] * k1[i] + ec[3] * k3[i] + ec[4] * k4[i] 
                     + ec[5] * k5[i] + ec[6] * k6[i]);
    }
     
  return GSL_SUCCESS;
}


static int
rkf45_reset (void *vstate, size_t dim)
{
  rkf45_state_t *state = (rkf45_state_t *) vstate;

  DBL_ZERO_MEMSET (state->k1, dim);
  DBL_ZERO_MEMSET (state->k2, dim);
  DBL_ZERO_MEMSET (state->k3, dim);
  DBL_ZERO_MEMSET (state->k4, dim);
  DBL_ZERO_MEMSET (state->k5, dim);
  DBL_ZERO_MEMSET (state->k6, dim);
  DBL_ZERO_MEMSET (state->ytmp, dim);
  DBL_ZERO_MEMSET (state->y0, dim);

  return GSL_SUCCESS;
}

static unsigned int
rkf45_order (void *vstate)
{
  rkf45_state_t *state = (rkf45_state_t *) vstate;
  state = 0; /* prevent warnings about unused parameters */
  return 5;
}

static void
rkf45_free (void *vstate)
{
  rkf45_state_t *state = (rkf45_state_t *) vstate;

  free (state->ytmp);
  free (state->y0);
  free (state->k6);
  free (state->k5);
  free (state->k4);
  free (state->k3);
  free (state->k2);
  free (state->k1);
  free (state);
}

static const gsl_odeiv_step_type rkf45_type = { "rkf45",        /* name */
  1,                            /* can use dydt_in */
  0,                            /* gives exact dydt_out */
  &rkf45_alloc,
  &rkf45_apply,
  &rkf45_reset,
  &rkf45_order,
  &rkf45_free
};

const gsl_odeiv_step_type *gsl_odeiv_step_rkf45 = &rkf45_type;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 21:11   ` Maxime Boissonneault
@ 2012-12-12 21:41     ` Rhys Ulerich
  2012-12-12 23:32       ` Maxime Boissonneault
  0 siblings, 1 reply; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-12 21:41 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> The more intensive function is within rkf45_apply in my case. I simply added
> a few pragmas to the loops, and it speed it up quite a lot.

Having looked at where you've placed the #pragma omp parallels, have
you tried enabling vectorization to see if the time spent in those
axpy-like operations could be improved?  A good SSE-ready optimizer
should nail those.

I may have misunderstood your "millions of differential equations"
statement.  Are you rather solving one problem with a million degrees
of freedom?

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 21:41     ` Rhys Ulerich
@ 2012-12-12 23:32       ` Maxime Boissonneault
  0 siblings, 0 replies; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-12 23:32 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi Rhys,
I will have a deeper look at vectorization of GSL, but in my 
understanding, vectorizing can only be done with simple operations, 
while algorithms like RKF45 involve about 10 operations per loop iterations.

You are correct, I meant one problem with a million degrees of freedom.

Maxime

Le 2012-12-12 16:41, Rhys Ulerich a Ã©crit :
>> The more intensive function is within rkf45_apply in my case. I simply added
>> a few pragmas to the loops, and it speed it up quite a lot.
> Having looked at where you've placed the #pragma omp parallels, have
> you tried enabling vectorization to see if the time spent in those
> axpy-like operations could be improved?  A good SSE-ready optimizer
> should nail those.
>
> I may have misunderstood your "millions of differential equations"
> statement.  Are you rather solving one problem with a million degrees
> of freedom?
>
> - Rhys


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault
  2012-12-12 16:35 ` Frank Reininghaus
@ 2012-12-12 17:05 ` Rhys Ulerich
  2012-12-12 21:13   ` Maxime Boissonneault
  1 sibling, 1 reply; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-12 17:05 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> I am using GSL from another library of my own to perform numerical
> integration of vectorial differential equations. After optimizing and
> parallelizing most of my library, I ended up with the conclusion that GSL is
> a major bottle neck in my computation, simply because it is not parallelized
> to exploit multi-core achitectures.

Have you tried solving many such problems in parallel using
threadprivate GSL workspaces?  This permits you to put a parallel
section outside of many GSL invocations by ensuring GSL's working
storage is kept local to each thread.

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 17:05 ` Rhys Ulerich
@ 2012-12-12 21:13   ` Maxime Boissonneault
  2012-12-12 21:36     ` Rhys Ulerich
  0 siblings, 1 reply; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-12 21:13 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi Rhys,
I have not, since it is much simpler to add 6 pragmas to the loops of 
rk45_apply to achieve the same goal.

Maxime

Le 2012-12-12 12:05, Rhys Ulerich a Ã©crit :
>> I am using GSL from another library of my own to perform numerical
>> integration of vectorial differential equations. After optimizing and
>> parallelizing most of my library, I ended up with the conclusion that GSL is
>> a major bottle neck in my computation, simply because it is not parallelized
>> to exploit multi-core achitectures.
> Have you tried solving many such problems in parallel using
> threadprivate GSL workspaces?  This permits you to put a parallel
> section outside of many GSL invocations by ensuring GSL's working
> storage is kept local to each thread.
>
> - Rhys


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 21:13   ` Maxime Boissonneault
@ 2012-12-12 21:36     ` Rhys Ulerich
  2012-12-12 23:35       ` Maxime Boissonneault
  0 siblings, 1 reply; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-12 21:36 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> I have not, since it is much simpler to add 6 pragmas to the loops of
> rk45_apply to achieve the same goal.

I suspect, from glancing at the first example at
http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html,
you could achieve parallelization atop stock GSL builds with fewer
than 6 pragmas.

I also suspect that by using OpenMP tasks over the coarser granularity
concept (1 ODE) instead of the parallel-fors over the finer
granularity method (1 step) you may see better overall performance.

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 21:36     ` Rhys Ulerich
@ 2012-12-12 23:35       ` Maxime Boissonneault
  2012-12-13  2:29         ` Rhys Ulerich
  0 siblings, 1 reply; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-12 23:35 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi Rhys,
I am not sure if you mean parallelizing the for loop in this code, but 
this would not work, as each iteration is not independent from the next 
one. Differential equation solving is intrinsically serial.

I can not either just split the problem into smaller chunks (lets say 
100k istead of 1M), because they are coupled, i.e. func can only be 
defined for the whole system, not for some of its parts.


Thanks,

Maxime


Le 2012-12-12 16:36, Rhys Ulerich a Ã©crit :
>> I have not, since it is much simpler to add 6 pragmas to the loops of
>> rk45_apply to achieve the same goal.
> I suspect, from glancing at the first example at
> http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html,
> you could achieve parallelization atop stock GSL builds with fewer
> than 6 pragmas.
>
> I also suspect that by using OpenMP tasks over the coarser granularity
> concept (1 ODE) instead of the parallel-fors over the finer
> granularity method (1 step) you may see better overall performance.
>
> - Rhys


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-12 23:35       ` Maxime Boissonneault
@ 2012-12-13  2:29         ` Rhys Ulerich
  2012-12-13 13:22           ` Maxime Boissonneault
  0 siblings, 1 reply; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-13  2:29 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> I am not sure if you mean parallelizing the for loop in this code, but this
> would not work, as each iteration is not independent from the next one.
> Differential equation solving is intrinsically serial.

True, solving one differential equation is serial.  I'd thought you
were performing a parameter sweep.

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13  2:29         ` Rhys Ulerich
@ 2012-12-13 13:22           ` Maxime Boissonneault
  2012-12-13 15:53             ` Rhys Ulerich
  0 siblings, 1 reply; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-13 13:22 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

I am doing that too, but any gain we can get is an important one, and it 
turns out that by parallelizing rkf45_apply, my simulation runs 30% 
faster on 8 cores.

Maxime



Le 2012-12-12 21:28, Rhys Ulerich a Ã©crit :
>> I am not sure if you mean parallelizing the for loop in this code, but this
>> would not work, as each iteration is not independent from the next one.
>> Differential equation solving is intrinsically serial.
> True, solving one differential equation is serial.  I'd thought you
> were performing a parameter sweep.
>
> - Rhys


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13 13:22           ` Maxime Boissonneault
@ 2012-12-13 15:53             ` Rhys Ulerich
  2012-12-13 16:44               ` Rhys Ulerich
  2012-12-13 21:05               ` Maxime Boissonneault
  0 siblings, 2 replies; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-13 15:53 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> I am doing that too, but any gain we can get is an important one, and it
> turns out that by parallelizing rkf45_apply, my simulation runs 30% faster
> on 8 cores.

That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
0.70 time units)).  This feels like you're getting a small
memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
operations by using multiple cores but the cores are otherwise not
being used effectively.  I say this because a state vector 1e6 doubles
long will not generally fit in cache.  Adding more cores increases the
amount of cache available.

> I will have a deeper look at vectorization of GSL, but in my understanding,
> vectorizing can only be done with simple operations, while algorithms like
> RKF45 involve about 10 operations per loop iterations.

The compilers are generally very good.  Intel's icc 11.1 has to be
told that the last four loops you annotated are vectorizable.  GCC
nails it out of the box.

On GCC 4.4.3 with something like
    CFLAGS="-g -O2 -march=native -mtune=native
-ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
shows every one of those 6 loops vectorizing.  You can check this by
configuring with those options, running make and waiting for the build
to finish, and then cd-ing into ode-initval2 and running
    rm rkf45*o && make
and observing all those beautiful
    LOOP VECTORIZED
messages.  Better yet, with those options, 'make check' passes for me
on the 'ode-initval2' subdirectory.

Try ripping out your OpenMP pragmas in GSL, building with
vectorization against stock GSL as I suggested, and then seeing how
fast your code runs with GSL vectorized on 1 core versus GSL's
unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
be comparable.

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13 15:53             ` Rhys Ulerich
@ 2012-12-13 16:44               ` Rhys Ulerich
  2012-12-13 21:07                 ` Maxime Boissonneault
  2012-12-13 21:05               ` Maxime Boissonneault
  1 sibling, 1 reply; 16+ messages in thread
From: Rhys Ulerich @ 2012-12-13 16:44 UTC (permalink / raw)
  To: Maxime Boissonneault; +Cc: gsl-discuss

> This feels like you're getting a small
> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
> operations by using multiple cores but the cores are otherwise not
> being used effectively.  I say this because a state vector 1e6 doubles
> long will not generally fit in cache.  Adding more cores increases the
> amount of cache available.

Hmm... I tentatively take this back on re-thinking how you've added
the #pragma omp lines to the rkf45.c file you attached elsewhere in
this thread.  Try using a single
    #pragma omp parallel
and then individual lines like
    #pragma omp for
at each for loop.  Using
    #pragma omp parallel for
repeatedly as you've done can introduce excess overhead, depending on
your compiler, because it may incur unnecessary overhead.

- Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13 16:44               ` Rhys Ulerich
@ 2012-12-13 21:07                 ` Maxime Boissonneault
  0 siblings, 0 replies; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-13 21:07 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi Rhys,
While that is true in theory, it is not applicable in practice, since 
there can be no "return" within parallel sections. We need one parallel 
section for each loop in this case.

Maxime

Le 2012-12-13 11:44, Rhys Ulerich a Ã©crit :
>> This feels like you're getting a small
>> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
>> operations by using multiple cores but the cores are otherwise not
>> being used effectively.  I say this because a state vector 1e6 doubles
>> long will not generally fit in cache.  Adding more cores increases the
>> amount of cache available.
> Hmm... I tentatively take this back on re-thinking how you've added
> the #pragma omp lines to the rkf45.c file you attached elsewhere in
> this thread.  Try using a single
>      #pragma omp parallel
> and then individual lines like
>      #pragma omp for
> at each for loop.  Using
>      #pragma omp parallel for
> repeatedly as you've done can introduce excess overhead, depending on
> your compiler, because it may incur unnecessary overhead.
>
> - Rhys


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13 15:53             ` Rhys Ulerich
  2012-12-13 16:44               ` Rhys Ulerich
@ 2012-12-13 21:05               ` Maxime Boissonneault
  2012-12-13 21:14                 ` Maxime Boissonneault
  1 sibling, 1 reply; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-13 21:05 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi Rhys,
I did the comparison you requested. Here is the load balancing 
information for my test run :
These were all obtained with gcc 4.7.2. The total times for the runs were :
- Without vectorization or OpenMP : 1m21s
- With vectorization : 1m14s
- With OpenMP : 1m01s

Here is load-balance information obtained with OpenSpeedshop.
Without vectorization or OpenMP :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

   28.085714  140384673677232    28.085714  140384673677232 28.085714  
rkf45_apply (libgsl.so.0.16.0)

With OpenMP :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

[...]
    2.800000       1146448192     1.228571       1088166208 1.778571  
rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
    2.171429       1146448192     0.971429       1129662784 1.460714  
rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
    2.142857       1146448192     1.400000  140073673240496 1.671429  
rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
    2.085714       1146448192     1.257143       1112877376 1.725000  
rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
    2.085714       1146448192     1.171429       1088166208 1.578571  
rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
    1.600000       1146448192     0.885714  140073673240496 1.139286  
rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
    0.714286       1146448192     0.457143       1129662784 0.557143  
rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
[...]
The 7 loop make a total of about ~12s.


With vectorization :
         Max   Posix ThreadId          Min   Posix ThreadId Average  
Function (defining location)
   Exclusive           of Max    Exclusive           of Min Exclusive
Time Across                   Time Across                   Time Across
       Posix                         Posix Posix
ThreadIds(s)                   ThreadIds(s) ThreadIds(s)

   24.542857  140440801677232    24.542857  140440801677232 24.542857  
rkf45_apply (libgsl.so.0.16.0)


This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total of 
8 threads. To summarize :
- Vectorization makes a difference, but it is minor (gained 4s out of 
the 28s used by the normal rkf45_apply ).
- OpenMP is the clear winner (gained 20s out of 28s).


Maxime





Le 2012-12-13 10:51, Rhys Ulerich a Ã©crit :
>> I am doing that too, but any gain we can get is an important one, and it
>> turns out that by parallelizing rkf45_apply, my simulation runs 30% faster
>> on 8 cores.
> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
> 0.70 time units)).  This feels like you're getting a small
> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
> operations by using multiple cores but the cores are otherwise not
> being used effectively.  I say this because a state vector 1e6 doubles
> long will not generally fit in cache.  Adding more cores increases the
> amount of cache available.
>
>> I will have a deeper look at vectorization of GSL, but in my understanding,
>> vectorizing can only be done with simple operations, while algorithms like
>> RKF45 involve about 10 operations per loop iterations.
> The compilers are generally very good.  Intel's icc 11.1 has to be
> told that the last four loops you annotated are vectorizable.  GCC
> nails it out of the box.
>
> On GCC 4.4.3 with something like
>      CFLAGS="-g -O2 -march=native -mtune=native
> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
> shows every one of those 6 loops vectorizing.  You can check this by
> configuring with those options, running make and waiting for the build
> to finish, and then cd-ing into ode-initval2 and running
>      rm rkf45*o && make
> and observing all those beautiful
>      LOOP VECTORIZED
> messages.  Better yet, with those options, 'make check' passes for me
> on the 'ode-initval2' subdirectory.
>
> Try ripping out your OpenMP pragmas in GSL, building with
> vectorization against stock GSL as I suggested, and then seeing how
> fast your code runs with GSL vectorized on 1 core versus GSL's
> unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
> be comparable.
>
> - Rhys

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Adding OpenMP support for some of the GSL functions
  2012-12-13 21:05               ` Maxime Boissonneault
@ 2012-12-13 21:14                 ` Maxime Boissonneault
  0 siblings, 0 replies; 16+ messages in thread
From: Maxime Boissonneault @ 2012-12-13 21:14 UTC (permalink / raw)
  To: Rhys Ulerich; +Cc: gsl-discuss

Hi again,
Since I noticed that the OpenMP version did not have a balanced workload 
amongst thread (which may have to do with data locality, memory 
affinity, etc.), I added the clause "schedule(runtime)" to each of my 
pragmas.

At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I 
got the slightly better runtime of 57s with the OpenMP version.

Maxime


Le 2012-12-13 16:05, Maxime Boissonneault a Ã©crit :
> Hi Rhys,
> I did the comparison you requested. Here is the load balancing 
> information for my test run :
> These were all obtained with gcc 4.7.2. The total times for the runs 
> were :
> - Without vectorization or OpenMP : 1m21s
> - With vectorization : 1m14s
> - With OpenMP : 1m01s
>
> Here is load-balance information obtained with OpenSpeedshop.
> Without vectorization or OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   28.085714  140384673677232    28.085714  140384673677232 28.085714  
> rkf45_apply (libgsl.so.0.16.0)
>
> With OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
> [...]
>    2.800000       1146448192     1.228571       1088166208 1.778571  
> rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
>    2.171429       1146448192     0.971429       1129662784 1.460714  
> rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
>    2.142857       1146448192     1.400000  140073673240496 1.671429  
> rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.257143       1112877376 1.725000  
> rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.171429       1088166208 1.578571  
> rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
>    1.600000       1146448192     0.885714  140073673240496 1.139286  
> rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
>    0.714286       1146448192     0.457143       1129662784 0.557143  
> rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
> [...]
> The 7 loop make a total of about ~12s.
>
>
> With vectorization :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   24.542857  140440801677232    24.542857  140440801677232 24.542857  
> rkf45_apply (libgsl.so.0.16.0)
>
>
> This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total 
> of 8 threads. To summarize :
> - Vectorization makes a difference, but it is minor (gained 4s out of 
> the 28s used by the normal rkf45_apply ).
> - OpenMP is the clear winner (gained 20s out of 28s).
>
>
> Maxime
>
>
>
>
>
> Le 2012-12-13 10:51, Rhys Ulerich a Ã©crit :
>>> I am doing that too, but any gain we can get is an important one, 
>>> and it
>>> turns out that by parallelizing rkf45_apply, my simulation runs 30% 
>>> faster
>>> on 8 cores.
>> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
>> 0.70 time units)).  This feels like you're getting a small
>> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
>> operations by using multiple cores but the cores are otherwise not
>> being used effectively.  I say this because a state vector 1e6 doubles
>> long will not generally fit in cache.  Adding more cores increases the
>> amount of cache available.
>>
>>> I will have a deeper look at vectorization of GSL, but in my 
>>> understanding,
>>> vectorizing can only be done with simple operations, while 
>>> algorithms like
>>> RKF45 involve about 10 operations per loop iterations.
>> The compilers are generally very good.  Intel's icc 11.1 has to be
>> told that the last four loops you annotated are vectorizable. GCC
>> nails it out of the box.
>>
>> On GCC 4.4.3 with something like
>>      CFLAGS="-g -O2 -march=native -mtune=native
>> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
>> shows every one of those 6 loops vectorizing.  You can check this by
>> configuring with those options, running make and waiting for the build
>> to finish, and then cd-ing into ode-initval2 and running
>>      rm rkf45*o && make
>> and observing all those beautiful
>>      LOOP VECTORIZED
>> messages.  Better yet, with those options, 'make check' passes for me
>> on the 'ode-initval2' subdirectory.
>>
>> Try ripping out your OpenMP pragmas in GSL, building with
>> vectorization against stock GSL as I suggested, and then seeing how
>> fast your code runs with GSL vectorized on 1 core versus GSL's
>> unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
>> be comparable.
>>
>> - Rhys
>


-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul QuÃ©bec, UniversitÃ© Laval
Ph. D. en physique

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-13 21:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault
2012-12-12 16:35 ` Frank Reininghaus
2012-12-12 21:11   ` Maxime Boissonneault
2012-12-12 21:41     ` Rhys Ulerich
2012-12-12 23:32       ` Maxime Boissonneault
2012-12-12 17:05 ` Rhys Ulerich
2012-12-12 21:13   ` Maxime Boissonneault
2012-12-12 21:36     ` Rhys Ulerich
2012-12-12 23:35       ` Maxime Boissonneault
2012-12-13  2:29         ` Rhys Ulerich
2012-12-13 13:22           ` Maxime Boissonneault
2012-12-13 15:53             ` Rhys Ulerich
2012-12-13 16:44               ` Rhys Ulerich
2012-12-13 21:07                 ` Maxime Boissonneault
2012-12-13 21:05               ` Maxime Boissonneault
2012-12-13 21:14                 ` Maxime Boissonneault

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).