* Adding OpenMP support for some of the GSL functions @ 2012-12-11 20:04 Maxime Boissonneault 2012-12-12 16:35 ` Frank Reininghaus 2012-12-12 17:05 ` Rhys Ulerich 0 siblings, 2 replies; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-11 20:04 UTC (permalink / raw) To: gsl-discuss Hi, I am using GSL from another library of my own to perform numerical integration of vectorial differential equations. After optimizing and parallelizing most of my library, I ended up with the conclusion that GSL is a major bottle neck in my computation, simply because it is not parallelized to exploit multi-core achitectures. I would like to submit patches to add support for OpenMP within GSL, allowing easy parallelization on shared-memory architectures. How should I proceed to do so ? For background information on myself, I have a PhD in computational physics and I am now a High Performance Computing specialist for Calcul Québec. -- --------------------------------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault @ 2012-12-12 16:35 ` Frank Reininghaus 2012-12-12 21:11 ` Maxime Boissonneault 2012-12-12 17:05 ` Rhys Ulerich 1 sibling, 1 reply; 16+ messages in thread From: Frank Reininghaus @ 2012-12-12 16:35 UTC (permalink / raw) To: gsl-discuss Hi, 2012/12/11 Maxime Boissonneault: > Hi, > I am using GSL from another library of my own to perform numerical > integration of vectorial differential equations. After optimizing and > parallelizing most of my library, I ended up with the conclusion that GSL is > a major bottle neck in my computation, simply because it is not parallelized > to exploit multi-core achitectures. > > I would like to submit patches to add support for OpenMP within GSL, > allowing easy parallelization on shared-memory architectures. How should I > proceed to do so ? I think the most straightforward approach would be to use OpenMP inside the function that calculates the r.h.s. of your differential equation. This does not require any modifications to GSL at all. Reorganising the ODE solver code in GSL such that it can make use of many cores might not be easily possible because often the parameters passed to the r.h.s. function depend on the results of earlier calls. Best regards, Frank ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 16:35 ` Frank Reininghaus @ 2012-12-12 21:11 ` Maxime Boissonneault 2012-12-12 21:41 ` Rhys Ulerich 0 siblings, 1 reply; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-12 21:11 UTC (permalink / raw) To: Frank Reininghaus; +Cc: gsl-discuss [-- Attachment #1: Type: text/plain, Size: 1630 bytes --] Hi Frank, The more intensive function is within rkf45_apply in my case. I simply added a few pragmas to the loops, and it speed it up quite a lot. To put things in context, I am solving millions of differential equations, so the cost of rkf45_apply in itself is quite important. It is in fact the same order as the rest of the code outside the function (which is parallelized). Attached is the modified rk45.c file. Maxime Le 2012-12-12 11:35, Frank Reininghaus a écrit : > Hi, > > 2012/12/11 Maxime Boissonneault: >> Hi, >> I am using GSL from another library of my own to perform numerical >> integration of vectorial differential equations. After optimizing and >> parallelizing most of my library, I ended up with the conclusion that GSL is >> a major bottle neck in my computation, simply because it is not parallelized >> to exploit multi-core achitectures. >> >> I would like to submit patches to add support for OpenMP within GSL, >> allowing easy parallelization on shared-memory architectures. How should I >> proceed to do so ? > I think the most straightforward approach would be to use OpenMP > inside the function that calculates the r.h.s. of your differential > equation. This does not require any modifications to GSL at all. > > Reorganising the ODE solver code in GSL such that it can make use of > many cores might not be easily possible because often the parameters > passed to the r.h.s. function depend on the results of earlier calls. > > Best regards, > Frank -- --------------------------------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique [-- Attachment #2: rkf45.c --] [-- Type: text/plain, Size: 9024 bytes --] /* ode-initval/rkf45.c * * Copyright (C) 2001, 2004, 2007 Brian Gough * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 3 of the License, or (at * your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ /* Runge-Kutta-Fehlberg 4(5)*/ /* Reference eg. Hairer, E., Norsett S.P., Wanner, G. Solving ordinary differential equations I, Nonstiff Problems, 2nd revised edition, Springer, 2000. */ #include <config.h> #include <stdlib.h> #include <string.h> #include <gsl/gsl_errno.h> #include <gsl/gsl_odeiv.h> #include "odeiv_util.h" /* Runge-Kutta-Fehlberg coefficients. Zero elements left out */ static const double ah[] = { 1.0/4.0, 3.0/8.0, 12.0/13.0, 1.0, 1.0/2.0 }; static const double b3[] = { 3.0/32.0, 9.0/32.0 }; static const double b4[] = { 1932.0/2197.0, -7200.0/2197.0, 7296.0/2197.0}; static const double b5[] = { 8341.0/4104.0, -32832.0/4104.0, 29440.0/4104.0, -845.0/4104.0}; static const double b6[] = { -6080.0/20520.0, 41040.0/20520.0, -28352.0/20520.0, 9295.0/20520.0, -5643.0/20520.0}; static const double c1 = 902880.0/7618050.0; static const double c3 = 3953664.0/7618050.0; static const double c4 = 3855735.0/7618050.0; static const double c5 = -1371249.0/7618050.0; static const double c6 = 277020.0/7618050.0; /* These are the differences of fifth and fourth order coefficients for error estimation */ static const double ec[] = { 0.0, 1.0 / 360.0, 0.0, -128.0 / 4275.0, -2197.0 / 75240.0, 1.0 / 50.0, 2.0 / 55.0 }; typedef struct { double *k1; double *k2; double *k3; double *k4; double *k5; double *k6; double *y0; double *ytmp; } rkf45_state_t; static void * rkf45_alloc (size_t dim) { rkf45_state_t *state = (rkf45_state_t *) malloc (sizeof (rkf45_state_t)); if (state == 0) { GSL_ERROR_NULL ("failed to allocate space for rkf45_state", GSL_ENOMEM); } state->k1 = (double *) malloc (dim * sizeof (double)); if (state->k1 == 0) { free (state); GSL_ERROR_NULL ("failed to allocate space for k1", GSL_ENOMEM); } state->k2 = (double *) malloc (dim * sizeof (double)); if (state->k2 == 0) { free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for k2", GSL_ENOMEM); } state->k3 = (double *) malloc (dim * sizeof (double)); if (state->k3 == 0) { free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for k3", GSL_ENOMEM); } state->k4 = (double *) malloc (dim * sizeof (double)); if (state->k4 == 0) { free (state->k3); free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for k4", GSL_ENOMEM); } state->k5 = (double *) malloc (dim * sizeof (double)); if (state->k5 == 0) { free (state->k4); free (state->k3); free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for k5", GSL_ENOMEM); } state->k6 = (double *) malloc (dim * sizeof (double)); if (state->k6 == 0) { free (state->k5); free (state->k4); free (state->k3); free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for k6", GSL_ENOMEM); } state->y0 = (double *) malloc (dim * sizeof (double)); if (state->y0 == 0) { free (state->k6); free (state->k5); free (state->k4); free (state->k3); free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for y0", GSL_ENOMEM); } state->ytmp = (double *) malloc (dim * sizeof (double)); if (state->ytmp == 0) { free (state->y0); free (state->k6); free (state->k5); free (state->k4); free (state->k3); free (state->k2); free (state->k1); free (state); GSL_ERROR_NULL ("failed to allocate space for ytmp", GSL_ENOMEM); } return state; } static int rkf45_apply (void *vstate, size_t dim, double t, double h, double y[], double yerr[], const double dydt_in[], double dydt_out[], const gsl_odeiv_system * sys) { rkf45_state_t *state = (rkf45_state_t *) vstate; size_t i; double *const k1 = state->k1; double *const k2 = state->k2; double *const k3 = state->k3; double *const k4 = state->k4; double *const k5 = state->k5; double *const k6 = state->k6; double *const ytmp = state->ytmp; double *const y0 = state->y0; DBL_MEMCPY (y0, y, dim); /* k1 step */ if (dydt_in != NULL) { DBL_MEMCPY (k1, dydt_in, dim); } else { int s = GSL_ODEIV_FN_EVAL (sys, t, y, k1); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) ytmp[i] = y[i] + ah[0] * h * k1[i]; /* k2 step */ { int s = GSL_ODEIV_FN_EVAL (sys, t + ah[0] * h, ytmp, k2); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) ytmp[i] = y[i] + h * (b3[0] * k1[i] + b3[1] * k2[i]); /* k3 step */ { int s = GSL_ODEIV_FN_EVAL (sys, t + ah[1] * h, ytmp, k3); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) ytmp[i] = y[i] + h * (b4[0] * k1[i] + b4[1] * k2[i] + b4[2] * k3[i]); /* k4 step */ { int s = GSL_ODEIV_FN_EVAL (sys, t + ah[2] * h, ytmp, k4); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) ytmp[i] = y[i] + h * (b5[0] * k1[i] + b5[1] * k2[i] + b5[2] * k3[i] + b5[3] * k4[i]); /* k5 step */ { int s = GSL_ODEIV_FN_EVAL (sys, t + ah[3] * h, ytmp, k5); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) ytmp[i] = y[i] + h * (b6[0] * k1[i] + b6[1] * k2[i] + b6[2] * k3[i] + b6[3] * k4[i] + b6[4] * k5[i]); /* k6 step and final sum */ { int s = GSL_ODEIV_FN_EVAL (sys, t + ah[4] * h, ytmp, k6); if (s != GSL_SUCCESS) { return s; } } #pragma omp parallel for shared(y,h,dim) private(i) default(none) for (i = 0; i < dim; i++) { const double d_i = c1 * k1[i] + c3 * k3[i] + c4 * k4[i] + c5 * k5[i] + c6 * k6[i]; y[i] += h * d_i; } /* Derivatives at output */ if (dydt_out != NULL) { int s = GSL_ODEIV_FN_EVAL (sys, t + h, y, dydt_out); if (s != GSL_SUCCESS) { /* Restore initial values */ DBL_MEMCPY (y, y0, dim); return s; } } /* difference between 4th and 5th order */ #pragma omp parallel for shared(y,yerr,h,dim) private(i) default(none) for (i = 0; i < dim; i++) { yerr[i] = h * (ec[1] * k1[i] + ec[3] * k3[i] + ec[4] * k4[i] + ec[5] * k5[i] + ec[6] * k6[i]); } return GSL_SUCCESS; } static int rkf45_reset (void *vstate, size_t dim) { rkf45_state_t *state = (rkf45_state_t *) vstate; DBL_ZERO_MEMSET (state->k1, dim); DBL_ZERO_MEMSET (state->k2, dim); DBL_ZERO_MEMSET (state->k3, dim); DBL_ZERO_MEMSET (state->k4, dim); DBL_ZERO_MEMSET (state->k5, dim); DBL_ZERO_MEMSET (state->k6, dim); DBL_ZERO_MEMSET (state->ytmp, dim); DBL_ZERO_MEMSET (state->y0, dim); return GSL_SUCCESS; } static unsigned int rkf45_order (void *vstate) { rkf45_state_t *state = (rkf45_state_t *) vstate; state = 0; /* prevent warnings about unused parameters */ return 5; } static void rkf45_free (void *vstate) { rkf45_state_t *state = (rkf45_state_t *) vstate; free (state->ytmp); free (state->y0); free (state->k6); free (state->k5); free (state->k4); free (state->k3); free (state->k2); free (state->k1); free (state); } static const gsl_odeiv_step_type rkf45_type = { "rkf45", /* name */ 1, /* can use dydt_in */ 0, /* gives exact dydt_out */ &rkf45_alloc, &rkf45_apply, &rkf45_reset, &rkf45_order, &rkf45_free }; const gsl_odeiv_step_type *gsl_odeiv_step_rkf45 = &rkf45_type; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 21:11 ` Maxime Boissonneault @ 2012-12-12 21:41 ` Rhys Ulerich 2012-12-12 23:32 ` Maxime Boissonneault 0 siblings, 1 reply; 16+ messages in thread From: Rhys Ulerich @ 2012-12-12 21:41 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > The more intensive function is within rkf45_apply in my case. I simply added > a few pragmas to the loops, and it speed it up quite a lot. Having looked at where you've placed the #pragma omp parallels, have you tried enabling vectorization to see if the time spent in those axpy-like operations could be improved? A good SSE-ready optimizer should nail those. I may have misunderstood your "millions of differential equations" statement. Are you rather solving one problem with a million degrees of freedom? - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 21:41 ` Rhys Ulerich @ 2012-12-12 23:32 ` Maxime Boissonneault 0 siblings, 0 replies; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-12 23:32 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi Rhys, I will have a deeper look at vectorization of GSL, but in my understanding, vectorizing can only be done with simple operations, while algorithms like RKF45 involve about 10 operations per loop iterations. You are correct, I meant one problem with a million degrees of freedom. Maxime Le 2012-12-12 16:41, Rhys Ulerich a écrit : >> The more intensive function is within rkf45_apply in my case. I simply added >> a few pragmas to the loops, and it speed it up quite a lot. > Having looked at where you've placed the #pragma omp parallels, have > you tried enabling vectorization to see if the time spent in those > axpy-like operations could be improved? A good SSE-ready optimizer > should nail those. > > I may have misunderstood your "millions of differential equations" > statement. Are you rather solving one problem with a million degrees > of freedom? > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault 2012-12-12 16:35 ` Frank Reininghaus @ 2012-12-12 17:05 ` Rhys Ulerich 2012-12-12 21:13 ` Maxime Boissonneault 1 sibling, 1 reply; 16+ messages in thread From: Rhys Ulerich @ 2012-12-12 17:05 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > I am using GSL from another library of my own to perform numerical > integration of vectorial differential equations. After optimizing and > parallelizing most of my library, I ended up with the conclusion that GSL is > a major bottle neck in my computation, simply because it is not parallelized > to exploit multi-core achitectures. Have you tried solving many such problems in parallel using threadprivate GSL workspaces? This permits you to put a parallel section outside of many GSL invocations by ensuring GSL's working storage is kept local to each thread. - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 17:05 ` Rhys Ulerich @ 2012-12-12 21:13 ` Maxime Boissonneault 2012-12-12 21:36 ` Rhys Ulerich 0 siblings, 1 reply; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-12 21:13 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi Rhys, I have not, since it is much simpler to add 6 pragmas to the loops of rk45_apply to achieve the same goal. Maxime Le 2012-12-12 12:05, Rhys Ulerich a écrit : >> I am using GSL from another library of my own to perform numerical >> integration of vectorial differential equations. After optimizing and >> parallelizing most of my library, I ended up with the conclusion that GSL is >> a major bottle neck in my computation, simply because it is not parallelized >> to exploit multi-core achitectures. > Have you tried solving many such problems in parallel using > threadprivate GSL workspaces? This permits you to put a parallel > section outside of many GSL invocations by ensuring GSL's working > storage is kept local to each thread. > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 21:13 ` Maxime Boissonneault @ 2012-12-12 21:36 ` Rhys Ulerich 2012-12-12 23:35 ` Maxime Boissonneault 0 siblings, 1 reply; 16+ messages in thread From: Rhys Ulerich @ 2012-12-12 21:36 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > I have not, since it is much simpler to add 6 pragmas to the loops of > rk45_apply to achieve the same goal. I suspect, from glancing at the first example at http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html, you could achieve parallelization atop stock GSL builds with fewer than 6 pragmas. I also suspect that by using OpenMP tasks over the coarser granularity concept (1 ODE) instead of the parallel-fors over the finer granularity method (1 step) you may see better overall performance. - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 21:36 ` Rhys Ulerich @ 2012-12-12 23:35 ` Maxime Boissonneault 2012-12-13 2:29 ` Rhys Ulerich 0 siblings, 1 reply; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-12 23:35 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi Rhys, I am not sure if you mean parallelizing the for loop in this code, but this would not work, as each iteration is not independent from the next one. Differential equation solving is intrinsically serial. I can not either just split the problem into smaller chunks (lets say 100k istead of 1M), because they are coupled, i.e. func can only be defined for the whole system, not for some of its parts. Thanks, Maxime Le 2012-12-12 16:36, Rhys Ulerich a écrit : >> I have not, since it is much simpler to add 6 pragmas to the loops of >> rk45_apply to achieve the same goal. > I suspect, from glancing at the first example at > http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html, > you could achieve parallelization atop stock GSL builds with fewer > than 6 pragmas. > > I also suspect that by using OpenMP tasks over the coarser granularity > concept (1 ODE) instead of the parallel-fors over the finer > granularity method (1 step) you may see better overall performance. > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-12 23:35 ` Maxime Boissonneault @ 2012-12-13 2:29 ` Rhys Ulerich 2012-12-13 13:22 ` Maxime Boissonneault 0 siblings, 1 reply; 16+ messages in thread From: Rhys Ulerich @ 2012-12-13 2:29 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > I am not sure if you mean parallelizing the for loop in this code, but this > would not work, as each iteration is not independent from the next one. > Differential equation solving is intrinsically serial. True, solving one differential equation is serial. I'd thought you were performing a parameter sweep. - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 2:29 ` Rhys Ulerich @ 2012-12-13 13:22 ` Maxime Boissonneault 2012-12-13 15:53 ` Rhys Ulerich 0 siblings, 1 reply; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-13 13:22 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss I am doing that too, but any gain we can get is an important one, and it turns out that by parallelizing rkf45_apply, my simulation runs 30% faster on 8 cores. Maxime Le 2012-12-12 21:28, Rhys Ulerich a écrit : >> I am not sure if you mean parallelizing the for loop in this code, but this >> would not work, as each iteration is not independent from the next one. >> Differential equation solving is intrinsically serial. > True, solving one differential equation is serial. I'd thought you > were performing a parameter sweep. > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 13:22 ` Maxime Boissonneault @ 2012-12-13 15:53 ` Rhys Ulerich 2012-12-13 16:44 ` Rhys Ulerich 2012-12-13 21:05 ` Maxime Boissonneault 0 siblings, 2 replies; 16+ messages in thread From: Rhys Ulerich @ 2012-12-13 15:53 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > I am doing that too, but any gain we can get is an important one, and it > turns out that by parallelizing rkf45_apply, my simulation runs 30% faster > on 8 cores. That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * 0.70 time units)). This feels like you're getting a small memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like operations by using multiple cores but the cores are otherwise not being used effectively. I say this because a state vector 1e6 doubles long will not generally fit in cache. Adding more cores increases the amount of cache available. > I will have a deeper look at vectorization of GSL, but in my understanding, > vectorizing can only be done with simple operations, while algorithms like > RKF45 involve about 10 operations per loop iterations. The compilers are generally very good. Intel's icc 11.1 has to be told that the last four loops you annotated are vectorizable. GCC nails it out of the box. On GCC 4.4.3 with something like CFLAGS="-g -O2 -march=native -mtune=native -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make shows every one of those 6 loops vectorizing. You can check this by configuring with those options, running make and waiting for the build to finish, and then cd-ing into ode-initval2 and running rm rkf45*o && make and observing all those beautiful LOOP VECTORIZED messages. Better yet, with those options, 'make check' passes for me on the 'ode-initval2' subdirectory. Try ripping out your OpenMP pragmas in GSL, building with vectorization against stock GSL as I suggested, and then seeing how fast your code runs with GSL vectorized on 1 core versus GSL's unvectorized rkf45_apply parallelized over 8 cores. I suspect it will be comparable. - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 15:53 ` Rhys Ulerich @ 2012-12-13 16:44 ` Rhys Ulerich 2012-12-13 21:07 ` Maxime Boissonneault 2012-12-13 21:05 ` Maxime Boissonneault 1 sibling, 1 reply; 16+ messages in thread From: Rhys Ulerich @ 2012-12-13 16:44 UTC (permalink / raw) To: Maxime Boissonneault; +Cc: gsl-discuss > This feels like you're getting a small > memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like > operations by using multiple cores but the cores are otherwise not > being used effectively. I say this because a state vector 1e6 doubles > long will not generally fit in cache. Adding more cores increases the > amount of cache available. Hmm... I tentatively take this back on re-thinking how you've added the #pragma omp lines to the rkf45.c file you attached elsewhere in this thread. Try using a single #pragma omp parallel and then individual lines like #pragma omp for at each for loop. Using #pragma omp parallel for repeatedly as you've done can introduce excess overhead, depending on your compiler, because it may incur unnecessary overhead. - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 16:44 ` Rhys Ulerich @ 2012-12-13 21:07 ` Maxime Boissonneault 0 siblings, 0 replies; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-13 21:07 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi Rhys, While that is true in theory, it is not applicable in practice, since there can be no "return" within parallel sections. We need one parallel section for each loop in this case. Maxime Le 2012-12-13 11:44, Rhys Ulerich a écrit : >> This feels like you're getting a small >> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like >> operations by using multiple cores but the cores are otherwise not >> being used effectively. I say this because a state vector 1e6 doubles >> long will not generally fit in cache. Adding more cores increases the >> amount of cache available. > Hmm... I tentatively take this back on re-thinking how you've added > the #pragma omp lines to the rkf45.c file you attached elsewhere in > this thread. Try using a single > #pragma omp parallel > and then individual lines like > #pragma omp for > at each for loop. Using > #pragma omp parallel for > repeatedly as you've done can introduce excess overhead, depending on > your compiler, because it may incur unnecessary overhead. > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 15:53 ` Rhys Ulerich 2012-12-13 16:44 ` Rhys Ulerich @ 2012-12-13 21:05 ` Maxime Boissonneault 2012-12-13 21:14 ` Maxime Boissonneault 1 sibling, 1 reply; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-13 21:05 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi Rhys, I did the comparison you requested. Here is the load balancing information for my test run : These were all obtained with gcc 4.7.2. The total times for the runs were : - Without vectorization or OpenMP : 1m21s - With vectorization : 1m14s - With OpenMP : 1m01s Here is load-balance information obtained with OpenSpeedshop. Without vectorization or OpenMP : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) 28.085714 140384673677232 28.085714 140384673677232 28.085714 rkf45_apply (libgsl.so.0.16.0) With OpenMP : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) [...] 2.800000 1146448192 1.228571 1088166208 1.778571 rkf45_apply._omp_fn.4 (libgsl.so.0.16.0) 2.171429 1146448192 0.971429 1129662784 1.460714 rkf45_apply._omp_fn.6 (libgsl.so.0.16.0) 2.142857 1146448192 1.400000 140073673240496 1.671429 rkf45_apply._omp_fn.3 (libgsl.so.0.16.0) 2.085714 1146448192 1.257143 1112877376 1.725000 rkf45_apply._omp_fn.5 (libgsl.so.0.16.0) 2.085714 1146448192 1.171429 1088166208 1.578571 rkf45_apply._omp_fn.2 (libgsl.so.0.16.0) 1.600000 1146448192 0.885714 140073673240496 1.139286 rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233) 0.714286 1146448192 0.457143 1129662784 0.557143 rkf45_apply._omp_fn.0 (libgsl.so.0.16.0) [...] The 7 loop make a total of about ~12s. With vectorization : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) 24.542857 140440801677232 24.542857 140440801677232 24.542857 rkf45_apply (libgsl.so.0.16.0) This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total of 8 threads. To summarize : - Vectorization makes a difference, but it is minor (gained 4s out of the 28s used by the normal rkf45_apply ). - OpenMP is the clear winner (gained 20s out of 28s). Maxime Le 2012-12-13 10:51, Rhys Ulerich a écrit : >> I am doing that too, but any gain we can get is an important one, and it >> turns out that by parallelizing rkf45_apply, my simulation runs 30% faster >> on 8 cores. > That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * > 0.70 time units)). This feels like you're getting a small > memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like > operations by using multiple cores but the cores are otherwise not > being used effectively. I say this because a state vector 1e6 doubles > long will not generally fit in cache. Adding more cores increases the > amount of cache available. > >> I will have a deeper look at vectorization of GSL, but in my understanding, >> vectorizing can only be done with simple operations, while algorithms like >> RKF45 involve about 10 operations per loop iterations. > The compilers are generally very good. Intel's icc 11.1 has to be > told that the last four loops you annotated are vectorizable. GCC > nails it out of the box. > > On GCC 4.4.3 with something like > CFLAGS="-g -O2 -march=native -mtune=native > -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make > shows every one of those 6 loops vectorizing. You can check this by > configuring with those options, running make and waiting for the build > to finish, and then cd-ing into ode-initval2 and running > rm rkf45*o && make > and observing all those beautiful > LOOP VECTORIZED > messages. Better yet, with those options, 'make check' passes for me > on the 'ode-initval2' subdirectory. > > Try ripping out your OpenMP pragmas in GSL, building with > vectorization against stock GSL as I suggested, and then seeing how > fast your code runs with GSL vectorized on 1 core versus GSL's > unvectorized rkf45_apply parallelized over 8 cores. I suspect it will > be comparable. > > - Rhys ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Adding OpenMP support for some of the GSL functions 2012-12-13 21:05 ` Maxime Boissonneault @ 2012-12-13 21:14 ` Maxime Boissonneault 0 siblings, 0 replies; 16+ messages in thread From: Maxime Boissonneault @ 2012-12-13 21:14 UTC (permalink / raw) To: Rhys Ulerich; +Cc: gsl-discuss Hi again, Since I noticed that the OpenMP version did not have a balanced workload amongst thread (which may have to do with data locality, memory affinity, etc.), I added the clause "schedule(runtime)" to each of my pragmas. At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I got the slightly better runtime of 57s with the OpenMP version. Maxime Le 2012-12-13 16:05, Maxime Boissonneault a écrit : > Hi Rhys, > I did the comparison you requested. Here is the load balancing > information for my test run : > These were all obtained with gcc 4.7.2. The total times for the runs > were : > - Without vectorization or OpenMP : 1m21s > - With vectorization : 1m14s > - With OpenMP : 1m01s > > Here is load-balance information obtained with OpenSpeedshop. > Without vectorization or OpenMP : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > 28.085714 140384673677232 28.085714 140384673677232 28.085714 > rkf45_apply (libgsl.so.0.16.0) > > With OpenMP : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > [...] > 2.800000 1146448192 1.228571 1088166208 1.778571 > rkf45_apply._omp_fn.4 (libgsl.so.0.16.0) > 2.171429 1146448192 0.971429 1129662784 1.460714 > rkf45_apply._omp_fn.6 (libgsl.so.0.16.0) > 2.142857 1146448192 1.400000 140073673240496 1.671429 > rkf45_apply._omp_fn.3 (libgsl.so.0.16.0) > 2.085714 1146448192 1.257143 1112877376 1.725000 > rkf45_apply._omp_fn.5 (libgsl.so.0.16.0) > 2.085714 1146448192 1.171429 1088166208 1.578571 > rkf45_apply._omp_fn.2 (libgsl.so.0.16.0) > 1.600000 1146448192 0.885714 140073673240496 1.139286 > rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233) > 0.714286 1146448192 0.457143 1129662784 0.557143 > rkf45_apply._omp_fn.0 (libgsl.so.0.16.0) > [...] > The 7 loop make a total of about ~12s. > > > With vectorization : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > 24.542857 140440801677232 24.542857 140440801677232 24.542857 > rkf45_apply (libgsl.so.0.16.0) > > > This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total > of 8 threads. To summarize : > - Vectorization makes a difference, but it is minor (gained 4s out of > the 28s used by the normal rkf45_apply ). > - OpenMP is the clear winner (gained 20s out of 28s). > > > Maxime > > > > > > Le 2012-12-13 10:51, Rhys Ulerich a écrit : >>> I am doing that too, but any gain we can get is an important one, >>> and it >>> turns out that by parallelizing rkf45_apply, my simulation runs 30% >>> faster >>> on 8 cores. >> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * >> 0.70 time units)). This feels like you're getting a small >> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like >> operations by using multiple cores but the cores are otherwise not >> being used effectively. I say this because a state vector 1e6 doubles >> long will not generally fit in cache. Adding more cores increases the >> amount of cache available. >> >>> I will have a deeper look at vectorization of GSL, but in my >>> understanding, >>> vectorizing can only be done with simple operations, while >>> algorithms like >>> RKF45 involve about 10 operations per loop iterations. >> The compilers are generally very good. Intel's icc 11.1 has to be >> told that the last four loops you annotated are vectorizable. GCC >> nails it out of the box. >> >> On GCC 4.4.3 with something like >> CFLAGS="-g -O2 -march=native -mtune=native >> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make >> shows every one of those 6 loops vectorizing. You can check this by >> configuring with those options, running make and waiting for the build >> to finish, and then cd-ing into ode-initval2 and running >> rm rkf45*o && make >> and observing all those beautiful >> LOOP VECTORIZED >> messages. Better yet, with those options, 'make check' passes for me >> on the 'ode-initval2' subdirectory. >> >> Try ripping out your OpenMP pragmas in GSL, building with >> vectorization against stock GSL as I suggested, and then seeing how >> fast your code runs with GSL vectorized on 1 core versus GSL's >> unvectorized rkf45_apply parallelized over 8 cores. I suspect it will >> be comparable. >> >> - Rhys > -- --------------------------------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-12-13 21:14 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-12-11 20:04 Adding OpenMP support for some of the GSL functions Maxime Boissonneault 2012-12-12 16:35 ` Frank Reininghaus 2012-12-12 21:11 ` Maxime Boissonneault 2012-12-12 21:41 ` Rhys Ulerich 2012-12-12 23:32 ` Maxime Boissonneault 2012-12-12 17:05 ` Rhys Ulerich 2012-12-12 21:13 ` Maxime Boissonneault 2012-12-12 21:36 ` Rhys Ulerich 2012-12-12 23:35 ` Maxime Boissonneault 2012-12-13 2:29 ` Rhys Ulerich 2012-12-13 13:22 ` Maxime Boissonneault 2012-12-13 15:53 ` Rhys Ulerich 2012-12-13 16:44 ` Rhys Ulerich 2012-12-13 21:07 ` Maxime Boissonneault 2012-12-13 21:05 ` Maxime Boissonneault 2012-12-13 21:14 ` Maxime Boissonneault
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).