From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 7469 invoked by alias); 13 Dec 2012 21:05:29 -0000 Received: (qmail 7451 invoked by uid 22791); 13 Dec 2012 21:05:28 -0000 X-SWARE-Spam-Status: No, hits=-3.6 required=5.0 tests=BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE X-Spam-Check-By: sourceware.org Received: from mail-vb0-f41.google.com (HELO mail-vb0-f41.google.com) (209.85.212.41) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 13 Dec 2012 21:05:21 +0000 Received: by mail-vb0-f41.google.com with SMTP id l22so3009675vbn.0 for ; Thu, 13 Dec 2012 13:05:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding :x-gm-message-state; bh=ggqF13iH0ruxHuPUtizgFtjDAI1rryEJA9U/eOjpPk4=; b=fMsQrxtF+laDNF0zr4TEwEGRBboLP/fXRFKe1VGXTwwbli37yEiv9B7QvJu0CgzOch bLCzpaVeSH9mSFigVIgIEiCgyOS9PPFosLhzhv3xn73DOPGTKuqUPL2Xjtti8L0lXqqD zmMKvow2puyvJAh9fEPApa04BAfgvWVtgH4XPB+q3k7O6tQAfpKmwH9pmWTMFD20lCJQ mpPU4sZoRGB0dmDIbBDV+JfvU60zQMWRTEgKVtjaKtQDAVpHOV3IlUUxEFaTlelAt/ho dhfhVOKHs4645/kJtGYFTy4vl5Hl7jd2w5y6YCTfAVQJZBZYXqe9RCDC+BzhOWzMZLbn YnjQ== Received: by 10.52.175.106 with SMTP id bz10mr4874235vdc.125.1355432721036; Thu, 13 Dec 2012 13:05:21 -0800 (PST) Received: from MacBook-Pro-de-Maxime.local (modemcable184.28-176-173.mc.videotron.ca. [173.176.28.184]) by mx.google.com with ESMTPS id ey7sm1808776ved.0.2012.12.13.13.05.18 (version=SSLv3 cipher=OTHER); Thu, 13 Dec 2012 13:05:20 -0800 (PST) Message-ID: <50CA4313.30708@calculquebec.ca> Date: Thu, 13 Dec 2012 21:05:00 -0000 From: Maxime Boissonneault User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Rhys Ulerich CC: gsl-discuss@sourceware.org Subject: Re: Adding OpenMP support for some of the GSL functions References: <50C791BB.4060303@calculquebec.ca> <50C8F364.2010109@calculquebec.ca> <50C914C4.2060102@calculquebec.ca> <50C9D68A.1050903@calculquebec.ca> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Gm-Message-State: ALoCoQmqSwVGhS8jB1frF+CiEmgDP4Ua0/jHLigF7IeYR4sLhYSMkRRKMfzNqB8vQbFqyLenbChA Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gsl-discuss-owner@sourceware.org X-SW-Source: 2012-q4/txt/msg00013.txt.bz2 Hi Rhys, I did the comparison you requested. Here is the load balancing information for my test run : These were all obtained with gcc 4.7.2. The total times for the runs were : - Without vectorization or OpenMP : 1m21s - With vectorization : 1m14s - With OpenMP : 1m01s Here is load-balance information obtained with OpenSpeedshop. Without vectorization or OpenMP : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) 28.085714 140384673677232 28.085714 140384673677232 28.085714 rkf45_apply (libgsl.so.0.16.0) With OpenMP : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) [...] 2.800000 1146448192 1.228571 1088166208 1.778571 rkf45_apply._omp_fn.4 (libgsl.so.0.16.0) 2.171429 1146448192 0.971429 1129662784 1.460714 rkf45_apply._omp_fn.6 (libgsl.so.0.16.0) 2.142857 1146448192 1.400000 140073673240496 1.671429 rkf45_apply._omp_fn.3 (libgsl.so.0.16.0) 2.085714 1146448192 1.257143 1112877376 1.725000 rkf45_apply._omp_fn.5 (libgsl.so.0.16.0) 2.085714 1146448192 1.171429 1088166208 1.578571 rkf45_apply._omp_fn.2 (libgsl.so.0.16.0) 1.600000 1146448192 0.885714 140073673240496 1.139286 rkf45_apply._omp_fn.1 (libgsl.so.0.16.0: rkf45.c,233) 0.714286 1146448192 0.457143 1129662784 0.557143 rkf45_apply._omp_fn.0 (libgsl.so.0.16.0) [...] The 7 loop make a total of about ~12s. With vectorization : Max Posix ThreadId Min Posix ThreadId Average Function (defining location) Exclusive of Max Exclusive of Min Exclusive Time Across Time Across Time Across Posix Posix Posix ThreadIds(s) ThreadIds(s) ThreadIds(s) 24.542857 140440801677232 24.542857 140440801677232 24.542857 rkf45_apply (libgsl.so.0.16.0) This was obtained with gcc 4.7.2, with 2 quad-core CPUs, for a total of 8 threads. To summarize : - Vectorization makes a difference, but it is minor (gained 4s out of the 28s used by the normal rkf45_apply ). - OpenMP is the clear winner (gained 20s out of 28s). Maxime Le 2012-12-13 10:51, Rhys Ulerich a écrit : >> I am doing that too, but any gain we can get is an important one, and it >> turns out that by parallelizing rkf45_apply, my simulation runs 30% faster >> on 8 cores. > That's a parallel efficiency of 0.18% ( = 1 time unit / (8 cores * > 0.70 time units)). This feels like you're getting a small > memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like > operations by using multiple cores but the cores are otherwise not > being used effectively. I say this because a state vector 1e6 doubles > long will not generally fit in cache. Adding more cores increases the > amount of cache available. > >> I will have a deeper look at vectorization of GSL, but in my understanding, >> vectorizing can only be done with simple operations, while algorithms like >> RKF45 involve about 10 operations per loop iterations. > The compilers are generally very good. Intel's icc 11.1 has to be > told that the last four loops you annotated are vectorizable. GCC > nails it out of the box. > > On GCC 4.4.3 with something like > CFLAGS="-g -O2 -march=native -mtune=native > -ftree-vectorizer-verbose=2 -ftree-vectorize" ../gsl/configure && make > shows every one of those 6 loops vectorizing. You can check this by > configuring with those options, running make and waiting for the build > to finish, and then cd-ing into ode-initval2 and running > rm rkf45*o && make > and observing all those beautiful > LOOP VECTORIZED > messages. Better yet, with those options, 'make check' passes for me > on the 'ode-initval2' subdirectory. > > Try ripping out your OpenMP pragmas in GSL, building with > vectorization against stock GSL as I suggested, and then seeing how > fast your code runs with GSL vectorized on 1 core versus GSL's > unvectorized rkf45_apply parallelized over 8 cores. I suspect it will > be comparable. > > - Rhys