From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gsl-discuss-return-5763-listarch-gsl-discuss=sources.redhat.com@sourceware.org>
Received: (qmail 12875 invoked by alias); 13 Dec 2012 21:14:23 -0000
Received: (qmail 12866 invoked by uid 22791); 13 Dec 2012 21:14:22 -0000
X-SWARE-Spam-Status: No, hits=-3.6 required=5.0
	tests=BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE
X-Spam-Check-By: sourceware.org
Received: from mail-vb0-f41.google.com (HELO mail-vb0-f41.google.com) (209.85.212.41)
    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 21:14:16 +0000
Received: by mail-vb0-f41.google.com with SMTP id l22so3021213vbn.0
        for <gsl-discuss@sourceware.org>; Thu, 13 Dec 2012 13:14:16 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20120113;
        h=message-id:date:from:user-agent:mime-version:to:cc:subject
         :references:in-reply-to:content-type:content-transfer-encoding
         :x-gm-message-state;
        bh=P67vbeT+4X+jVeNm6/3mxdeVUSdHUgBNWWmABTFt9HM=;
        b=ZZ6nnmvsPN/D2isXt8KE0Y07MWkOx4iLcA9H6IMBzMy3KtZxDWw0LSbOIBsSdTpd11
         lriKh3bFTELCf/wzgB3eLxiDk6ZaSZ0jOdjrS9LKL3QmRXVUdD0Mecg/Ktk6N5qiNVA6
         SaXOPdGVIrpzbLs8n7yxM15lYzxj9VGShE4CzH6xXpOtKmXfhPZega4S/5U+4vhGcKQk
         OMhurrlIZezO7GwxGCU212nMr30rP0mIRIcFdTh6l1k8EC1rEpx28lcQExzmERTHtTB4
         5DSdxwnlashWofXOXCx3u++hrkcgnOmOIXiKYlf5e6TPwer09Vqh+B0p/azKDwSaMlkG
         gymA==
Received: by 10.52.98.98 with SMTP id eh2mr5061274vdb.64.1355433255996;
        Thu, 13 Dec 2012 13:14:15 -0800 (PST)
Received: from MacBook-Pro-de-Maxime.local (modemcable184.28-176-173.mc.videotron.ca. [173.176.28.184])
        by mx.google.com with ESMTPS id gl6sm1809748vec.4.2012.12.13.13.14.14
        (version=SSLv3 cipher=OTHER);
        Thu, 13 Dec 2012 13:14:15 -0800 (PST)
Message-ID: <50CA452B.6070601@calculquebec.ca>
Date: Thu, 13 Dec 2012 21:14:00 -0000
From: Maxime Boissonneault <maxime.boissonneault@calculquebec.ca>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: Rhys Ulerich <rhys.ulerich@gmail.com>
CC: gsl-discuss@sourceware.org
Subject: Re: Adding OpenMP support for some of the GSL functions
References: <50C791BB.4060303@calculquebec.ca> <CAKDqugSjvphQBsVYejauNAO8T01GFfnZe8yKfOGFd9+iQAr_qw@mail.gmail.com> <50C8F364.2010109@calculquebec.ca> <CAKDqugQKdtuQJrgXGSFXZ2_mqQ+0e_8O24JaqmkGdjR4WYKy8g@mail.gmail.com> <50C914C4.2060102@calculquebec.ca> <CAKDqugR_w3kSrnnOwJESv+MKvK0fVv3JXgcraW=hycYtA00jSA@mail.gmail.com> <50C9D68A.1050903@calculquebec.ca> <CAKDqugQRt+q-V+Sv=AxmREwZAq1jFrGzU+xFmZi1YRQDSnXmfA@mail.gmail.com> <50CA4313.30708@calculquebec.ca>
In-Reply-To: <50CA4313.30708@calculquebec.ca>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
X-Gm-Message-State: ALoCoQnYq53XNc9MHHG9FJNj3wcj9jZf9GQM2JEBlJ7gIuOq00lQh8mXQDzVV8vPjb8WLZ+TxZc9
Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gsl-discuss.sourceware.org>
List-Subscribe: <mailto:gsl-discuss-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gsl-discuss/>
List-Post: <mailto:gsl-discuss@sourceware.org>
List-Help: <mailto:gsl-discuss-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gsl-discuss-owner@sourceware.org
X-SW-Source: 2012-q4/txt/msg00015.txt.bz2

Hi again,
Since I noticed that the OpenMP version did not have a balanced workload 
amongst thread (which may have to do with data locality, memory 
affinity, etc.), I added the clause "schedule(runtime)" to each of my 
pragmas.

At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I 
got the slightly better runtime of 57s with the OpenMP version.

Maxime


Le 2012-12-13 16:05, Maxime Boissonneault a écrit :
> Hi Rhys,
> I did the comparison you requested. Here is the load balancing 
> information for my test run :
> These were all obtained with gcc 4.7.2. The total times for the runs 
> were :
> - Without vectorization or OpenMP : 1m21s
> - With vectorization : 1m14s
> - With OpenMP : 1m01s
>
> Here is load-balance information obtained with OpenSpeedshop.
> Without vectorization or OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   28.085714  140384673677232    28.085714  140384673677232 28.085714  
> rkf45_apply (libgsl.so.0.16.0)
>
> With OpenMP :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
> [...]
>    2.800000       1146448192     1.228571       1088166208 1.778571  
> rkf45_apply._omp_fn.4 (libgsl.so.0.16.0)
>    2.171429       1146448192     0.971429       1129662784 1.460714  
> rkf45_apply._omp_fn.6 (libgsl.so.0.16.0)
>    2.142857       1146448192     1.400000  140073673240496 1.671429  
> rkf45_apply._omp_fn.3 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.257143       1112877376 1.725000  
> rkf45_apply._omp_fn.5 (libgsl.so.0.16.0)
>    2.085714       1146448192     1.171429       1088166208 1.578571  
> rkf45_apply._omp_fn.2 (libgsl.so.0.16.0)
>    1.600000       1146448192     0.885714  140073673240496 1.139286  
> rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233)
>    0.714286       1146448192     0.457143       1129662784 0.557143  
> rkf45_apply._omp_fn.0 (libgsl.so.0.16.0)
> [...]
> The 7 loop make a total of about ~12s.
>
>
> With vectorization :
>         Max   Posix ThreadId          Min   Posix ThreadId Average  
> Function (defining location)
>   Exclusive           of Max    Exclusive           of Min Exclusive
> Time Across                   Time Across                   Time Across
>       Posix                         Posix Posix
> ThreadIds(s)                   ThreadIds(s) ThreadIds(s)
>
>   24.542857  140440801677232    24.542857  140440801677232 24.542857  
> rkf45_apply (libgsl.so.0.16.0)
>
>
> This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total 
> of 8 threads. To summarize :
> - Vectorization makes a difference, but it is minor (gained 4s out of 
> the 28s used by the normal rkf45_apply ).
> - OpenMP is the clear winner (gained 20s out of 28s).
>
>
> Maxime
>
>
>
>
>
> Le 2012-12-13 10:51, Rhys Ulerich a écrit :
>>> I am doing that too, but any gain we can get is an important one, 
>>> and it
>>> turns out that by parallelizing rkf45_apply, my simulation runs 30% 
>>> faster
>>> on 8 cores.
>> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores *
>> 0.70 time units)).  This feels like you're getting a small
>> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
>> operations by using multiple cores but the cores are otherwise not
>> being used effectively.  I say this because a state vector 1e6 doubles
>> long will not generally fit in cache.  Adding more cores increases the
>> amount of cache available.
>>
>>> I will have a deeper look at vectorization of GSL, but in my 
>>> understanding,
>>> vectorizing can only be done with simple operations, while 
>>> algorithms like
>>> RKF45 involve about 10 operations per loop iterations.
>> The compilers are generally very good.  Intel's icc 11.1 has to be
>> told that the last four loops you annotated are vectorizable. GCC
>> nails it out of the box.
>>
>> On GCC 4.4.3 with something like
>>      CFLAGS="-g -O2 -march=native -mtune=native
>> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make
>> shows every one of those 6 loops vectorizing.  You can check this by
>> configuring with those options, running make and waiting for the build
>> to finish, and then cd-ing into ode-initval2 and running
>>      rm rkf45*o && make
>> and observing all those beautiful
>>      LOOP VECTORIZED
>> messages.  Better yet, with those options, 'make check' passes for me
>> on the 'ode-initval2' subdirectory.
>>
>> Try ripping out your OpenMP pragmas in GSL, building with
>> vectorization against stock GSL as I suggested, and then seeing how
>> fast your code runs with GSL vectorized on 1 core versus GSL's
>> unvectorized rkf45_apply parallelized over 8 cores.  I suspect it will
>> be comparable.
>>
>> - Rhys
>


-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique