From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 69339 invoked by alias); 3 Aug 2016 15:53:02 -0000 Mailing-List: contact fortran-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: fortran-owner@gcc.gnu.org Received: (qmail 69321 invoked by uid 89); 3 Aug 2016 15:53:01 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,LOTS_OF_MONEY,RCVD_IN_DNSWL_LOW,SPF_PASS,T_MONEY_PERCENT autolearn=no version=3.3.2 spammy=translate, Singh, singh, U*vikramsingh001 X-HELO: mail-oi0-f51.google.com Received: from mail-oi0-f51.google.com (HELO mail-oi0-f51.google.com) (209.85.218.51) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Wed, 03 Aug 2016 15:52:51 +0000 Received: by mail-oi0-f51.google.com with SMTP id 4so76316706oih.2 for ; Wed, 03 Aug 2016 08:52:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=GYslnkBPAkJ84Qab8zSEQ04+eXiFVXnicsDBCFlOAPE=; b=PYRjHdPSYm89myISDk84lzelNAk/D8LuH2D+87CFzsHQ8kgEGKdb/aeua/0+Up8Rz2 LBUSBECLLrx48gcaSyMUI7RruAbkVsYEMR4sIc9A6Q8gI+2kW2NuWljFm3qEYqm/IWrT OqvVMLpUwEPZuyOFdt6JhTjgQ5evZKyykRFGNCnxgB7nanJmkB3oNa34IK/pKLvoZ7Nh Y+oubsCYjbpFThn8fAXlPEqUOLJ9057+bN/PvoS9Ggv0piNTwNp/ZQBUJryI+fZe24Wf /MRdfkfnUQlb+3hu4VvUG+rjr+maAr6uXS69i4aG+FGjOC2Tj8iI8swfqfyGVjurZZ7k otgA== X-Gm-Message-State: AEkoouv7G1v8kxiSxBzN40ZTkRJyQbaCayzrARZ5nUDDt3msMFcWY7rO2EdKZDq94l7XdqFZ5DKg/DnfKxDLcw== X-Received: by 10.202.230.16 with SMTP id d16mr42042075oih.30.1470239569425; Wed, 03 Aug 2016 08:52:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.40.41 with HTTP; Wed, 3 Aug 2016 08:52:09 -0700 (PDT) In-Reply-To: References: <8737qn5rd4.fsf@kepler.schwinge.homeip.net> <8760vj45bi.fsf@hertz.schwinge.homeip.net> <2b4f59d5-be38-2814-27bb-73aa7ffb4b8f@codesourcery.com> <878u0o6wwj.fsf@kepler.schwinge.homeip.net> <87inyjuw6b.fsf@kepler.schwinge.homeip.net> From: Vikram Singh Date: Wed, 03 Aug 2016 15:53:00 -0000 Message-ID: Subject: Re: OpenACC-Library-Interoperability To: Thomas Schwinge Cc: Salvatore Filippone , =?UTF-8?Q?Vladim=C3=ADr_Fuka?= , James Norris , Chung-Lin Tang , Fortran List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SW-Source: 2016-08/txt/msg00011.txt.bz2 Sorry, I realized I have to send it as plain text. I finally found some time to repeat my previous attempts, but with gcc 6.1. I'll try to be a little detailed so that people don't have to waste time in doing stuff I did. I used the same script as before https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-of= fload-openacc-build-install.sh And modified it for the paths for gcc6.1. For some reason x86_64-pc-linux-gnu-accel-nvptx-none-gcc that builds in install/bin does not look for directories in $LD_LIBRARY_PATH. I found this using the -print-search-dirs option. x86_64-pc-linux-gnu-accel-nvptx-none-gcc -print-search-dirs So, I got a few errors for libraries not found for libraries that were present in install/lib64. I just copied those libraries to install/nvptx-none/lib/. With that I can now use gcc6. Onto the next task. I made a simple testcase. I'll copy paste the whole thing here. Lets call this test.f90 program example_dgemm use iso_c_binding implicit none integer :: N =3D 8 real(c_double), allocatable :: A(:,:), B(:, :), C(:, :) integer :: size_of_real, i, j integer*8 :: devPtrA, devPtrB, devPtrC size_of_real =3D 16 allocate(A(N, N)) allocate(B(N, N)) allocate(C(N, N)) !$ACC PARALLEL COPY(A) do i =3D 1, N do j =3D 1, N A(i, j) =3D i + j end do end do !$ACC END PARALLEL !$ACC PARALLEL COPY(B) do i =3D 1, N do j =3D 1, N B(i, j) =3D j end do end do !$ACC END PARALLEL call cublas_Alloc(N*N, size_of_real, devPtrA) call cublas_Alloc(N*N, size_of_real, devPtrB) call cublas_Alloc(N*N, size_of_real, devPtrC) ! Copy Fixed Data to the GPU call cublas_Set_Matrix(N, N, size_of_real, A, N, devPtrA, N) call cublas_Set_Matrix(N, N, size_of_real, C, N, devPtrC, N) ! Copy data to the GPU call cublas_Set_Matrix(N, N, size_of_real, B, N, devPtrB, N) ! Do DGEMM on the GPU call cublas_DGEMM('N', 'N', N, N, N, & 1.0_c_double, devPtrA, N, devPtrB, N, 0.0_c_double, devPtrC, N) ! Copy data from the GPU call cublas_Get_matrix(N, N, size_of_real, devPtrC, N, C, N) call cublas_Free(devPtrA) call cublas_Free(devPtrB) call cublas_Free(devPtrC) deallocate(A) deallocate(B) deallocate(C) end program example_dgemm Basically, I can test both CUBLAS and OPENACC using this simple code. But, using CUBLAS needs some extra stuff. So what I need to do is copy fortran.h, fortran_common.h and fortran.c from /usr/local/cuda/src. Now, I do ./rungcc6.sh gcc -Wall -g -I/usr/local/cuda/include -I/usr/local/cuda/src -DCUBLAS_GFORTRAN -c fortran.c Finally, I do ./rungcc6.sh gfortran -Wall -g test.f90 fortran.o -fopenacc -foffload=3Dnvptx-none -foffload=3D-O3 -O3 -o gpu.x -L/usr/local/cuda/lib64 -lcublas -lcudart And finally nvprof ./gpu.x gives output =3D=3D9155=3D=3D Profiling application: ./gpu.x =3D=3D9155=3D=3D Profiling result: Time(%) Time Calls Avg Min Max Name 23.60% 13.952us 1 13.952us 13.952us 13.952us MAIN__$_omp_fn$0 23.55% 13.920us 1 13.920us 13.920us 13.920us MAIN__$_omp_fn$1 18.57% 10.976us 16 686ns 576ns 1.0880us [CUDA memcpy Hto= D] 13.72% 8.1080us 2 4.0540us 2.0430us 6.0650us [CUDA memcpy Hto= H] 12.07% 7.1360us 1 7.1360us 7.1360us 7.1360us void gemm_kernel2x2_core(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int) 8.50% 5.0240us 3 1.6740us 1.5680us 1.8880us [CUDA memcpy Dto= H] Clearly openacc loops run on GPU and DGEMM kernels also run on GPU. So the only thing left now, is to put host_data so that the whole process of cublas_alloc, cublas_set_matrix, cublas_get_matrix need not be done. If you can tell me how the patch is applied, I can test it with this simple example. Regards, Vikram On Mon, Aug 1, 2016 at 6:58 PM, Vikram Singh wro= te: > I finally found some time to repeat my previous attempts, but with gcc 6.= 1. > I'll try to be a little detailed so that people don't have to waste time = in > doing stuff I did. > > I used the same script as before > > https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-= offload-openacc-build-install.sh > > And modified it for the paths for gcc6.1. For some reason > > x86_64-pc-linux-gnu-accel-nvptx-none-gcc > > that builds in install/bin > > does not look for directories in $LD_LIBRARY_PATH. I found this using the > -print-search-dirs option. > > x86_64-pc-linux-gnu-accel-nvptx-none-gcc -print-search-dirs > > So, I got a few errors for libraries not found for libraries that were > present in install/lib64. I just copied those libraries to > install/nvptx-none/lib/. > > With that I can now use gcc6. > > Onto the next task. I made a simple testcase. I'll copy paste the whole > thing here. Lets call this test.f90 > > > program example_dgemm > > use iso_c_binding > implicit none > > integer :: N =3D 8 > real(c_double), allocatable :: A(:,:), B(:, :), C(:, :) > integer :: size_of_real, i, j > integer*8 :: devPtrA, devPtrB, devPtrC > > size_of_real =3D 16 > > allocate(A(N, N)) > allocate(B(N, N)) > allocate(C(N, N)) > > !$ACC PARALLEL COPY(A) > do i =3D 1, N > do j =3D 1, N > A(i, j) =3D i + j > end do > end do > !$ACC END PARALLEL > !$ACC PARALLEL COPY(B) > do i =3D 1, N > do j =3D 1, N > B(i, j) =3D j > end do > end do > !$ACC END PARALLEL > > call cublas_Alloc(N*N, size_of_real, devPtrA) > call cublas_Alloc(N*N, size_of_real, devPtrB) > call cublas_Alloc(N*N, size_of_real, devPtrC) > > ! Copy Fixed Data to the GPU > call cublas_Set_Matrix(N, N, size_of_real, A, N, devPtrA, N) > call cublas_Set_Matrix(N, N, size_of_real, C, N, devPtrC, N) > > ! Copy data to the GPU > call cublas_Set_Matrix(N, N, size_of_real, B, N, devPtrB, N) > > ! Do DGEMM on the GPU > call cublas_DGEMM('N', 'N', N, N, N, & > 1.0_c_double, devPtrA, N, devPtrB, N, 0.0_c_double, devPtrC, N) > > ! Copy data from the GPU > call cublas_Get_matrix(N, N, size_of_real, devPtrC, N, C, N) > > call cublas_Free(devPtrA) > call cublas_Free(devPtrB) > call cublas_Free(devPtrC) > > deallocate(A) > deallocate(B) > deallocate(C) > > end program example_dgemm > > Basically, I can test both CUBLAS and OPENACC using this simple code. But, > using CUBLAS needs some extra stuff. So what I need to do is copy fortran= .h, > fortran_common.h and fortran.c from /usr/local/cuda/src. > > Now, I do > > ./rungcc6.sh gcc -Wall -g -I/usr/local/cuda/include -I/usr/local/cuda/src > -DCUBLAS_GFORTRAN -c fortran.c > > Finally, I do > > ./rungcc6.sh gfortran -Wall -g test.f90 fortran.o -fopenacc > -foffload=3Dnvptx-none -foffload=3D-O3 -O3 -o gpu.x -L/usr/local/cuda/lib= 64 > -lcublas -lcudart > > And finally nvprof ./gpu.x gives output > > =3D=3D9155=3D=3D Profiling application: ./gpu.x > =3D=3D9155=3D=3D Profiling result: > Time(%) Time Calls Avg Min Max Name > 23.60% 13.952us 1 13.952us 13.952us 13.952us MAIN__$_omp_fn= $0 > 23.55% 13.920us 1 13.920us 13.920us 13.920us MAIN__$_omp_fn= $1 > 18.57% 10.976us 16 686ns 576ns 1.0880us [CUDA memcpy > HtoD] > 13.72% 8.1080us 2 4.0540us 2.0430us 6.0650us [CUDA memcpy > HtoH] > 12.07% 7.1360us 1 7.1360us 7.1360us 7.1360us void > gemm_kernel2x2_core(double*, > double const *, double const *, int, int, int, int, int, int, double*, > double*, double, double, int) > 8.50% 5.0240us 3 1.6740us 1.5680us 1.8880us [CUDA memcpy > DtoH] > > Clearly openacc loops run on GPU and DGEMM kernels also run on GPU. > > So the only thing left now, is to put host_data so that the whole process= of > cublas_alloc, cublas_set_matrix, cublas_get_matrix need not be done. > > If you can tell me how the patch is applied, I can test it with this simp= le > example. > > Regards, > Vikram > > > On Thu, May 12, 2016 at 6:34 PM, Thomas Schwinge > wrote: >> >> Hi! >> >> On Fri, 15 Apr 2016 13:59:53 +0300, Vikram Singh >> wrote: >> > I checked libgomp.oacc-c-c++-common/context-*.c, and they seem to be >> > exactly what I needed to start with. >> > >> > If I understand it correctly, I'll anyway be trying to implement >> > something similar with PGI compilers, so I'll have something as a >> > starting point. >> >> Any progress on that already? I had a very quick look myself, but it's >> not as easy as I thought... A "courageous" use of "use cublas" in >> Fortran code compiled with gfortran results in: "Fatal Error: Can't open >> module file 'cublas.mod' for reading". The problem is: Fortran >> interfacing to C libraries (which cuBLAS is). >> >> >> has some instructions how to do it. Anyone got that to work already? >> >> > But again, it will need the OpenACC host_data construct to be setup >> > for gfortran to test. >> >> In >> >> , >> Chung-Lin has now posted a patch (pending review) that should make the >> OpenACC host_data construct usable in GCC Fortran. (Problem discussed in >> >> >> before.) >> >> For reference: >> >> > On Fri, Apr 15, 2016 at 11:57 AM, Thomas Schwinge >> > wrote: >> > > On Fri, 15 Apr 2016 11:35:06 +0300, Vikram Singh >> > > wrote: >> > >> Yes, I came to the conclusion that host_data >> > >> would be the only way to do it in fortran. >> > >> >> > >> On the other hand, I though there were no plans to implement it in >> > >> gfortran 6 either >> > > >> > > I still hope we'll get this (that is, ) >> > > fixed in time for the GCC 6.1 release. I'll keep you posted. >> > > >> > > >> > >> > I'm copying Jim, who is the author of this chapter in the >> > >> > documentation >> > >> > as well as the >> > >> > libgomp/testsuite/libgomp.oacc-c-c++-common/context-*.c >> > >> > test cases, and much of the relevant libgomp code, too, and who >> > >> > should >> > >> > please correct me if I'm wrong. I'll make a note for later, that >> > >> > we >> > >> > should translate the libgomp.oacc-c-c++-common/context-*.c test >> > >> > cases to >> > >> > Fortran, and also replicate them using the OpenACC host_data >> > >> > construct >> > >> > (like in >> > >> > libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c), and >> > >> > the same for the documentation you referenced. (Vikram, please >> > >> > tell if >> > >> > you're interested to work on these items.) >> > >> >> > >> I am not sure I understand what you want me to work on exactly. I am >> > >> not really much of a C programmer, so I wouldn't be good at it. I >> > >> would still like to help. >> > > >> > > Sorry for being unclear. My idea/question has been whether you're >> > > interested in helping by translating the documentation as well as the >> > > libgomp.oacc-c-c++-common/context-*.c test cases from C to Fortran >> > > (using >> > > the OpenACC host_data construct instead of the acc_* functions). If >> > > yes, >> > > then that's great, if not, then one of us will do it at some point. >> >> >> Gr=C3=BC=C3=9Fe >> Thomas > >