From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12875 invoked by alias); 13 Dec 2012 21:14:23 -0000 Received: (qmail 12866 invoked by uid 22791); 13 Dec 2012 21:14:22 -0000 X-SWARE-Spam-Status: No, hits=-3.6 required=5.0 tests=BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE X-Spam-Check-By: sourceware.org Received: from mail-vb0-f41.google.com (HELO mail-vb0-f41.google.com) (209.85.212.41) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 21:14:16 +0000 Received: by mail-vb0-f41.google.com with SMTP id l22so3021213vbn.0 for ; Thu, 13 Dec 2012 13:14:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding :x-gm-message-state; bh=P67vbeT+4X+jVeNm6/3mxdeVUSdHUgBNWWmABTFt9HM=; b=ZZ6nnmvsPN/D2isXt8KE0Y07MWkOx4iLcA9H6IMBzMy3KtZxDWw0LSbOIBsSdTpd11 lriKh3bFTELCf/wzgB3eLxiDk6ZaSZ0jOdjrS9LKL3QmRXVUdD0Mecg/Ktk6N5qiNVA6 SaXOPdGVIrpzbLs8n7yxM15lYzxj9VGShE4CzH6xXpOtKmXfhPZega4S/5U+4vhGcKQk OMhurrlIZezO7GwxGCU212nMr30rP0mIRIcFdTh6l1k8EC1rEpx28lcQExzmERTHtTB4 5DSdxwnlashWofXOXCx3u++hrkcgnOmOIXiKYlf5e6TPwer09Vqh+B0p/azKDwSaMlkG gymA== Received: by 10.52.98.98 with SMTP id eh2mr5061274vdb.64.1355433255996; Thu, 13 Dec 2012 13:14:15 -0800 (PST) Received: from MacBook-Pro-de-Maxime.local (modemcable184.28-176-173.mc.videotron.ca. [173.176.28.184]) by mx.google.com with ESMTPS id gl6sm1809748vec.4.2012.12.13.13.14.14 (version=SSLv3 cipher=OTHER); Thu, 13 Dec 2012 13:14:15 -0800 (PST) Message-ID: <50CA452B.6070601@calculquebec.ca> Date: Thu, 13 Dec 2012 21:14:00 -0000 From: Maxime Boissonneault User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Rhys Ulerich CC: gsl-discuss@sourceware.org Subject: Re: Adding OpenMP support for some of the GSL functions References: <50C791BB.4060303@calculquebec.ca> <50C8F364.2010109@calculquebec.ca> <50C914C4.2060102@calculquebec.ca> <50C9D68A.1050903@calculquebec.ca> <50CA4313.30708@calculquebec.ca> In-Reply-To: <50CA4313.30708@calculquebec.ca> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Gm-Message-State: ALoCoQnYq53XNc9MHHG9FJNj3wcj9jZf9GQM2JEBlJ7gIuOq00lQh8mXQDzVV8vPjb8WLZ+TxZc9 Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org X-SW-Source: 2012-q4/txt/msg00015.txt.bz2 Hi again, Since I noticed that the OpenMP version did not have a balanced workload amongst thread (which may have to do with data locality, memory affinity, etc.), I added the clause "schedule(runtime)" to each of my pragmas. At runtime, defining the environment variable OMP_SCHEDULE="guided,1", I got the slightly better runtime of 57s with the OpenMP version. Maxime Le 2012-12-13 16:05, Maxime Boissonneault a écrit : > Hi Rhys, > I did the comparison you requested. Here is the load balancing > information for my test run : > These were all obtained with gcc 4.7.2. The total times for the runs > were : > - Without vectorization or OpenMP : 1m21s > - With vectorization : 1m14s > - With OpenMP : 1m01s > > Here is load-balance information obtained with OpenSpeedshop. > Without vectorization or OpenMP : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > 28.085714 140384673677232 28.085714 140384673677232 28.085714 > rkf45_apply (libgsl.so.0.16.0) > > With OpenMP : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > [...] > 2.800000 1146448192 1.228571 1088166208 1.778571 > rkf45_apply._omp_fn.4 (libgsl.so.0.16.0) > 2.171429 1146448192 0.971429 1129662784 1.460714 > rkf45_apply._omp_fn.6 (libgsl.so.0.16.0) > 2.142857 1146448192 1.400000 140073673240496 1.671429 > rkf45_apply._omp_fn.3 (libgsl.so.0.16.0) > 2.085714 1146448192 1.257143 1112877376 1.725000 > rkf45_apply._omp_fn.5 (libgsl.so.0.16.0) > 2.085714 1146448192 1.171429 1088166208 1.578571 > rkf45_apply._omp_fn.2 (libgsl.so.0.16.0) > 1.600000 1146448192 0.885714 140073673240496 1.139286 > rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233) > 0.714286 1146448192 0.457143 1129662784 0.557143 > rkf45_apply._omp_fn.0 (libgsl.so.0.16.0) > [...] > The 7 loop make a total of about ~12s. > > > With vectorization : > Max Posix ThreadId Min Posix ThreadId Average > Function (defining location) > Exclusive of Max Exclusive of Min Exclusive > Time Across Time Across Time Across > Posix Posix Posix > ThreadIds(s) ThreadIds(s) ThreadIds(s) > > 24.542857 140440801677232 24.542857 140440801677232 24.542857 > rkf45_apply (libgsl.so.0.16.0) > > > This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total > of 8 threads. To summarize : > - Vectorization makes a difference, but it is minor (gained 4s out of > the 28s used by the normal rkf45_apply ). > - OpenMP is the clear winner (gained 20s out of 28s). > > > Maxime > > > > > > Le 2012-12-13 10:51, Rhys Ulerich a écrit : >>> I am doing that too, but any gain we can get is an important one, >>> and it >>> turns out that by parallelizing rkf45_apply, my simulation runs 30% >>> faster >>> on 8 cores. >> That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * >> 0.70 time units)). This feels like you're getting a small >> memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like >> operations by using multiple cores but the cores are otherwise not >> being used effectively. I say this because a state vector 1e6 doubles >> long will not generally fit in cache. Adding more cores increases the >> amount of cache available. >> >>> I will have a deeper look at vectorization of GSL, but in my >>> understanding, >>> vectorizing can only be done with simple operations, while >>> algorithms like >>> RKF45 involve about 10 operations per loop iterations. >> The compilers are generally very good. Intel's icc 11.1 has to be >> told that the last four loops you annotated are vectorizable. GCC >> nails it out of the box. >> >> On GCC 4.4.3 with something like >> CFLAGS="-g -O2 -march=native -mtune=native >> -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make >> shows every one of those 6 loops vectorizing. You can check this by >> configuring with those options, running make and waiting for the build >> to finish, and then cd-ing into ode-initval2 and running >> rm rkf45*o && make >> and observing all those beautiful >> LOOP VECTORIZED >> messages. Better yet, with those options, 'make check' passes for me >> on the 'ode-initval2' subdirectory. >> >> Try ripping out your OpenMP pragmas in GSL, building with >> vectorization against stock GSL as I suggested, and then seeing how >> fast your code runs with GSL vectorized on 1 core versus GSL's >> unvectorized rkf45_apply parallelized over 8 cores. I suspect it will >> be comparable. >> >> - Rhys > -- --------------------------------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique