From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 24527 invoked by alias); 13 Dec 2012 15:53:39 -0000 Received: (qmail 24518 invoked by uid 22791); 13 Dec 2012 15:53:38 -0000 X-SWARE-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE X-Spam-Check-By: sourceware.org Received: from mail-wg0-f45.google.com (HELO mail-wg0-f45.google.com) (74.125.82.45) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 15:53:27 +0000 Received: by mail-wg0-f45.google.com with SMTP id dq12so866519wgb.12 for ; Thu, 13 Dec 2012 07:53:26 -0800 (PST) Received: by 10.180.109.166 with SMTP id ht6mr4192712wib.7.1355413906092; Thu, 13 Dec 2012 07:51:46 -0800 (PST) MIME-Version: 1.0 Received: by 10.194.90.231 with HTTP; Thu, 13 Dec 2012 07:51:20 -0800 (PST) In-Reply-To: <50C9D68A.1050903@calculquebec.ca> References: <50C791BB.4060303@calculquebec.ca> <50C8F364.2010109@calculquebec.ca> <50C914C4.2060102@calculquebec.ca> <50C9D68A.1050903@calculquebec.ca> From: Rhys Ulerich Date: Thu, 13 Dec 2012 15:53:00 -0000 Message-ID: Subject: Re: Adding OpenMP support for some of the GSL functions To: Maxime Boissonneault Cc: gsl-discuss@sourceware.org Content-Type: text/plain; charset=ISO-8859-1 Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org X-SW-Source: 2012-q4/txt/msg00011.txt.bz2 > I am doing that too, but any gain we can get is an important one, and it > turns out that by parallelizing rkf45_apply, my simulation runs 30% faster > on 8 cores. That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * 0.70 time units)). This feels like you're getting a small memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like operations by using multiple cores but the cores are otherwise not being used effectively. I say this because a state vector 1e6 doubles long will not generally fit in cache. Adding more cores increases the amount of cache available. > I will have a deeper look at vectorization of GSL, but in my understanding, > vectorizing can only be done with simple operations, while algorithms like > RKF45 involve about 10 operations per loop iterations. The compilers are generally very good. Intel's icc 11.1 has to be told that the last four loops you annotated are vectorizable. GCC nails it out of the box. On GCC 4.4.3 with something like CFLAGS="-g -O2 -march=native -mtune=native -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make shows every one of those 6 loops vectorizing. You can check this by configuring with those options, running make and waiting for the build to finish, and then cd-ing into ode-initval2 and running rm rkf45*o && make and observing all those beautiful LOOP VECTORIZED messages. Better yet, with those options, 'make check' passes for me on the 'ode-initval2' subdirectory. Try ripping out your OpenMP pragmas in GSL, building with vectorization against stock GSL as I suggested, and then seeing how fast your code runs with GSL vectorized on 1 core versus GSL's unvectorized rkf45_apply parallelized over 8 cores. I suspect it will be comparable. - Rhys