From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 130380 invoked by alias); 7 Jun 2016 08:26:12 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 130239 invoked by uid 89); 7 Jun 2016 08:26:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=upto, believed, a15, wider X-HELO: mail-it0-f46.google.com Received: from mail-it0-f46.google.com (HELO mail-it0-f46.google.com) (209.85.214.46) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 07 Jun 2016 08:26:00 +0000 Received: by mail-it0-f46.google.com with SMTP id h62so13168549itb.1 for ; Tue, 07 Jun 2016 01:25:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=J2M0z1J7kpURQqtCufH6aBWFmS9fspWMwb1hwvnjw0A=; b=EF4hkd7yb9NWL1wHEtRkBSLlhn1DKzIDplpRTK9MJXsR9P/8M76UboB7A3Z13EHd+S WO++j7X+DFia9pmN57RuMjmzQHpptb4uZs0gO+IVUNeC/638MIYxUhxmX27z3XHxlQZ7 Uhx/8RxfiH0UDLRqPuE9vJ6XbVLpQVz78ipQNByLzr7f60t50r2jQo3Rsta1YoqiBHjw qCPj6IR1j7+UGP1MZf9h86HWgwenFeCDvpbsNbmsteMi/7NV0eD0OJl/GDI5SezgRctM t5ttiET7Q100PF1YdXHp8xUu7yES1pxTXp4Qh9fXjJBodpbQRwgdK4a2jG42BFj+RhOt u3pQ== X-Gm-Message-State: ALyK8tKk8SUcz94KsHM9SMddjwHcmIgndBJP8Alxk6w6QXK0yo4SIuCuTxKp2jjnQHC85vfz0rHv+RHHDtJix0Yo X-Received: by 10.36.65.97 with SMTP id x94mr1730183ita.16.1465287957935; Tue, 07 Jun 2016 01:25:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.236.5 with HTTP; Tue, 7 Jun 2016 01:25:57 -0700 (PDT) In-Reply-To: References: <55BB4127.5050202@foss.arm.com> From: Prathamesh Kulkarni Date: Tue, 07 Jun 2016 08:26:00 -0000 Message-ID: Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations To: Ramana Radhakrishnan Cc: Ramana Radhakrishnan , gcc Patches , Charles Baylis Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-06/txt/msg00455.txt.bz2 On 30 May 2016 at 13:52, Prathamesh Kulkarni wrote: > On 23 May 2016 at 14:59, Prathamesh Kulkarni > wrote: >> On 5 February 2016 at 18:40, Prathamesh Kulkarni >> wrote: >>> On 4 February 2016 at 16:31, Ramana Radhakrishnan >>> wrote: >>>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni >>>> wrote: >>>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan >>>>> wrote: >>>>>> >>>>>> >>>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote: >>>>>>> Hi, >>>>>>> This patch tries to implement division with multiplication by >>>>>>> reciprocal using vrecpe/vrecps >>>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled. >>>>>>> Tested on arm-none-linux-gnueabihf using qemu. >>>>>>> OK for trunk ? >>>>>>> >>>>>>> Thank you, >>>>>>> Prathamesh >>>>>>> >>>>>> >>>>>> I've tried this in the past and never been convinced that 2 iteratio= ns are enough to get to stability with this given that the results are only= precise for 8 bits / iteration. Thus I've always believed you need 3 itera= tions rather than 2 at which point I've never been sure that it's worth it.= So the testing that you've done with this currently is not enough for this= to go into the tree. >>>>>> >>>>>> I'd like this to be tested on a couple of different AArch32 implemen= tations with a wider range of inputs to verify that the results are accepta= ble as well as running something like SPEC2k(6) with atleast one iteration = to ensure correctness. >>>>> Hi, >>>>> I got results of SPEC2k6 fp benchmarks: >>>>> a15: +0.64% overall, 481.wrf: +6.46% >>>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% >>>>> a57: +0.35% overall, 481.wrf: +3.84% >>>>> The other benchmarks had (almost) identical results. >>>> >>>> Thanks for the benchmarking results - Please repost the patch with >>>> the changes that I had requested in my previous review - given it is >>>> now stage4 , I would rather queue changes like this for stage1 now. >>> Hi, >>> Please find the updated patch attached. >>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabih= f and >>> arm-none-eabi. >>> However the test-case added in the patch (neon-vect-div-1.c) fails to >>> get vectorized at -O2 >>> for armeb-none-linux-gnueabihf. >>> Charles suggested me to try with -O3, which worked. >>> It appears the test-case fails to get vectorized with >>> -fvect-cost-model=3Dcheap (which is default enabled at -O2) >>> and passes for -fno-vect-cost-model / -fvect-cost-model=3Ddynamic >>> >>> I can't figure out why it fails -fvect-cost-model=3Dcheap. >>> From the vect dump (attached): >>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned loa= d.*_9 >> Hi, >> I think I have some idea why the test-case fails attached with patch >> fail to get vectorized on armeb with -O2. >> >> Issue with big endian vectorizer: >> The patch does not cause regressions on big endian vectorizer but >> fails to vectorize the test-cases attached with the patch, while they >> get vectorized on >> litttle-endian. >> Fails with armeb with the following message in dump: >> note: not vectorized: unsupported unaligned load.*_9 >> >> The behavior of big and little endian vectorizer seems to be different >> in arm_builtin_support_vector_misalignment() which overrides the hook >> targetm.vectorize.support_vector_misalignment(). >> >> targetm.vectorize.support_vector_misalignment is called by >> vect_supportable_dr_alignment () which in turn is called >> by verify_data_refs_alignment (). >> >> Execution upto following condition is common between arm and armeb >> in vect_supportable_dr_alignment(): >> >> if ((TYPE_USER_ALIGN (type) && !is_packed) >> || targetm.vectorize.support_vector_misalignment (mode, type, >> DR_MISALIGNMENT (dr), is_pac= ked)) >> /* Can't software pipeline the loads, but can at least do them. = */ >> return dr_unaligned_supported; >> >> For little endian case: >> arm_builtin_support_vector_misalignment() is called with >> V2SF mode and misalignment =3D=3D -1, and the following condition >> becomes true: >> /* If the misalignment is unknown, we should be able to handle the access >> so long as it is not to a member of a packed data structure. */ >> if (misalignment =3D=3D -1) >> return true; >> >> Since the hook returned true we enter the condition above in >> vect_supportable_dr_alignment() and return dr_unaligned_supported; >> >> For big-endian: >> arm_builtin_support_vector_misalignment() is called with V2SF mode. >> The following condition that gates the entire function body fails: >> if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) >> and the default hook gets called with V2SF mode and the default hook >> returns false because >> movmisalign_optab does not exist for V2SF mode. >> >> So the condition above in vect_supportable_dr_alignment() fails >> and we come here: >> /* Unsupported. */ >> return dr_unaligned_unsupported; >> >> And hence we get the unaligned load not supported message in the dump >> for armeb in verify_data_ref_alignment (): >> >> static bool >> verify_data_ref_alignment (data_reference_p dr) >> { >> enum dr_alignment_support supportable_dr_alignment >> =3D vect_supportable_dr_alignment (dr, false); >> if (!supportable_dr_alignment) >> { >> if (dump_enabled_p ()) >> { >> if (DR_IS_READ (dr)) >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "not vectorized: unsupported unaligned load= ."); >> >> With -O3, the test-cases vectorize for armeb, because loop peeling for a= lignment >> is turned on. >> The above behavior is also reproducible with test-case which is >> irrelevant to the patch. >> for instance, we get the same unsupported unaligned load for following >> test-case (replaced / with +) >> >> void >> foo (int len, float * __restrict p, float *__restrict x) >> { >> len =3D len & ~31; >> for (int i =3D 0; i < len; i++) >> p[i] =3D p[i] + x[i]; >> } >> Is the patch OK to commit after bootstrap+test ? > ping https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html ping * 2 https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html Thanks, Prathamesh > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Prathamesh >>>> >>>> Thanks, >>>> Ramana >>>> >>>>> >>>>> Thanks, >>>>> Prathamesh >>>>>> >>>>>> >>>>>> moving on to the patches. >>>>>> >>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>>>> index 654d9d5..28c2e2a 100644 >>>>>>> --- a/gcc/config/arm/neon.md >>>>>>> +++ b/gcc/config/arm/neon.md >>>>>>> @@ -548,6 +548,32 @@ >>>>>>> (const_string "neon_mul_")))] >>>>>>> ) >>>>>>> >>>>>> >>>>>> Please add a comment here. >>>>>> >>>>>>> +(define_expand "div3" >>>>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=3Dw") >>>>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>>>> >>>>>> I want to double check that this doesn't collide with Alan's patches= for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 c= ases. >>>>>> >>>>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciproca= l_math" >>>>>>> + { >>>>>>> + rtx rec =3D gen_reg_rtx (mode); >>>>>>> + rtx vrecps_temp =3D gen_reg_rtx (mode); >>>>>>> + >>>>>>> + /* Reciprocal estimate */ >>>>>>> + emit_insn (gen_neon_vrecpe (rec, operands[2])); >>>>>>> + >>>>>>> + /* Perform 2 iterations of Newton-Raphson method for better ac= curacy */ >>>>>>> + for (int i =3D 0; i < 2; i++) >>>>>>> + { >>>>>>> + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[= 2])); >>>>>>> + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); >>>>>>> + } >>>>>>> + >>>>>>> + /* We now have reciprocal in rec, perform operands[0] =3D oper= ands[1] * rec */ >>>>>>> + emit_insn (gen_mul3 (operands[0], operands[1], rec)); >>>>>>> + DONE; >>>>>>> + } >>>>>>> +) >>>>>>> + >>>>>>> + >>>>>>> (define_insn "mul3add_neon" >>>>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=3Dw") >>>>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_op= erand" "w") >>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsu= ite/gcc.target/arm/vect-div-1.c >>>>>>> new file mode 100644 >>>>>>> index 0000000..e562ef3 >>>>>>> --- /dev/null >>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>>> @@ -0,0 +1,14 @@ >>>>>>> +/* { dg-do compile } */ >>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize = -fdump-tree-vect-all" } */ >>>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>> >>>>>> No this is wrong. >>>>>> >>>>>> What is armv8 specific about this test ? This is just like another t= est that is for Neon. vrecpe / vrecps are not instructions that were introd= uced in the v8 version of the architecture. They've existed in the base Neo= n instruction set. The code generation above in the patterns will be enable= d when TARGET_NEON is true which can happen when -mfpu=3Dneon -mfloat-abi= =3D{softfp/hard} is true. >>>>>> >>>>>>> + >>>>>>> +void >>>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>>> +{ >>>>>>> + len =3D len & ~31; >>>>>>> + for (int i =3D 0; i < len; i++) >>>>>>> + p[i] =3D p[i] / x[i]; >>>>>>> +} >>>>>>> + >>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"= } } */ >>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsu= ite/gcc.target/arm/vect-div-2.c >>>>>>> new file mode 100644 >>>>>>> index 0000000..8e15d0a >>>>>>> --- /dev/null >>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>>> @@ -0,0 +1,14 @@ >>>>>>> +/* { dg-do compile } */ >>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>> >>>>>> And likewise. >>>>>> >>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-m= ath -ftree-vectorize -fdump-tree-vect-all" } */ >>>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>>> + >>>>>>> +void >>>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>>> +{ >>>>>>> + len =3D len & ~31; >>>>>>> + for (int i =3D 0; i < len; i++) >>>>>>> + p[i] =3D p[i] / x[i]; >>>>>>> +} >>>>>>> + >>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"= } } */ >>>>>> >>>>>> >>>>>> regards >>>>>> Ramana