From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 129825 invoked by alias); 7 Jun 2016 08:37:26 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 128155 invoked by uid 89); 7 Jun 2016 08:37:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-lf0-f66.google.com Received: from mail-lf0-f66.google.com (HELO mail-lf0-f66.google.com) (209.85.215.66) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 07 Jun 2016 08:37:15 +0000 Received: by mail-lf0-f66.google.com with SMTP id u74so421731lff.0 for ; Tue, 07 Jun 2016 01:37:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=iWor0FdHtpM7CLsxkUQdBkAREd46Y0kXCsI4SVmptec=; b=lD6F5xNLhDL8JPmOTzwZV7pHNQy7Ga0epNTWhNEs4zvDrfGk4xLRNEqffCT9xbeXJn osmonyo7vZmebdIQTu0gN9ZkOqWmfmty79eTPL9PVpsMb9a84YE9em4/gFJwCQQd0Pam Z+XyLdNZHSwl7TXmG/Xhf1UnnvOG7DUfoNCzpkcw3+A54hKSrOSA/qDKHuCSvJM39a4Z qGBEwMPgdsYgqzU3zN7KH9SY7DJn0OJKnY7tgn2HGnsMU4vJlMUhYfvTq2nEQWFdmHzo XPWvTKqmMMpdV3idxbYIsl4eWQ0ufD868la/rqx5FSu7QMbypNBRsf4gzqumIqmjTjrG ggeg== X-Gm-Message-State: ALyK8tI1fYaYaLF/wSTWadrDdSqjq4GJAH876j0rH91fc2RQ2DfSFeI9cd67cGcZqtwQaokrCSJq8ldEKDFsLw== X-Received: by 10.46.33.214 with SMTP id h83mr5582980lji.41.1465288631964; Tue, 07 Jun 2016 01:37:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.25.200 with HTTP; Tue, 7 Jun 2016 01:37:11 -0700 (PDT) In-Reply-To: References: <55BB4127.5050202@foss.arm.com> From: Ramana Radhakrishnan Date: Tue, 07 Jun 2016 08:37:00 -0000 Message-ID: Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations To: Prathamesh Kulkarni Cc: Ramana Radhakrishnan , gcc Patches , Charles Baylis Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2016-06/txt/msg00459.txt.bz2 >> Please find the updated patch attached. >> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf= and >> arm-none-eabi. >> However the test-case added in the patch (neon-vect-div-1.c) fails to >> get vectorized at -O2 >> for armeb-none-linux-gnueabihf. >> Charles suggested me to try with -O3, which worked. >> It appears the test-case fails to get vectorized with >> -fvect-cost-model=3Dcheap (which is default enabled at -O2) >> and passes for -fno-vect-cost-model / -fvect-cost-model=3Ddynamic >> >> I can't figure out why it fails -fvect-cost-model=3Dcheap. >> From the vect dump (attached): >> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load= .*_9 > Hi, > I think I have some idea why the test-case fails attached with patch > fail to get vectorized on armeb with -O2. > > Issue with big endian vectorizer: > The patch does not cause regressions on big endian vectorizer but > fails to vectorize the test-cases attached with the patch, while they > get vectorized on > litttle-endian. > Fails with armeb with the following message in dump: > note: not vectorized: unsupported unaligned load.*_9 > > The behavior of big and little endian vectorizer seems to be different > in arm_builtin_support_vector_misalignment() which overrides the hook > targetm.vectorize.support_vector_misalignment(). > > targetm.vectorize.support_vector_misalignment is called by > vect_supportable_dr_alignment () which in turn is called > by verify_data_refs_alignment (). > > Execution upto following condition is common between arm and armeb > in vect_supportable_dr_alignment(): > > if ((TYPE_USER_ALIGN (type) && !is_packed) > || targetm.vectorize.support_vector_misalignment (mode, type, > DR_MISALIGNMENT (dr), is_pack= ed)) > /* Can't software pipeline the loads, but can at least do them. = */ > return dr_unaligned_supported; > > For little endian case: > arm_builtin_support_vector_misalignment() is called with > V2SF mode and misalignment =3D=3D -1, and the following condition > becomes true: > /* If the misalignment is unknown, we should be able to handle the access > so long as it is not to a member of a packed data structure. */ > if (misalignment =3D=3D -1) > return true; > > Since the hook returned true we enter the condition above in > vect_supportable_dr_alignment() and return dr_unaligned_supported; > > For big-endian: > arm_builtin_support_vector_misalignment() is called with V2SF mode. > The following condition that gates the entire function body fails: > if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) > and the default hook gets called with V2SF mode and the default hook > returns false because > movmisalign_optab does not exist for V2SF mode. > > So the condition above in vect_supportable_dr_alignment() fails > and we come here: > /* Unsupported. */ > return dr_unaligned_unsupported; > > And hence we get the unaligned load not supported message in the dump > for armeb in verify_data_ref_alignment (): > > static bool > verify_data_ref_alignment (data_reference_p dr) > { > enum dr_alignment_support supportable_dr_alignment > =3D vect_supportable_dr_alignment (dr, false); > if (!supportable_dr_alignment) > { > if (dump_enabled_p ()) > { > if (DR_IS_READ (dr)) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: unsupported unaligned load.= "); > > With -O3, the test-cases vectorize for armeb, because loop peeling for al= ignment > is turned on. > The above behavior is also reproducible with test-case which is > irrelevant to the patch. > for instance, we get the same unsupported unaligned load for following > test-case (replaced / with +) > > void > foo (int len, float * __restrict p, float *__restrict x) > { > len =3D len & ~31; > for (int i =3D 0; i < len; i++) > p[i] =3D p[i] + x[i]; > } > Is the patch OK to commit after bootstrap+test ? Thanks for the analysis - all the test needs is an additional marker to skip it on armeb (is there a helper for misaligned loads from the vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your usecase and you want to appropriately fix it for little endian arm with neon support enabled. =46rom the patch. >>+ && flag_unsafe_math_optimizations && flag_reciprocal_math" Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math ? flag_unsafe_math_optimizations should be sufficient since it enables flag_reciprocal_math - the reason for flag_unsafe_math_optimizations is to prevent loss of precision and the fact that on neon denormalized numbers are flushed to zero. Ok with that change and a quick test with vect_hw_misalign added to your testcase. Sorry about the delay in reviewing. Ramana > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Ramana >>> >>>> >>>> Thanks, >>>> Prathamesh >>>>> >>>>> >>>>> moving on to the patches. >>>>> >>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>>> index 654d9d5..28c2e2a 100644 >>>>>> --- a/gcc/config/arm/neon.md >>>>>> +++ b/gcc/config/arm/neon.md >>>>>> @@ -548,6 +548,32 @@ >>>>>> (const_string "neon_mul_")))] >>>>>> ) >>>>>> >>>>> >>>>> Please add a comment here. >>>>> >>>>>> +(define_expand "div3" >>>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=3Dw") >>>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>>> >>>>> I want to double check that this doesn't collide with Alan's patches = for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 ca= ses. >>>>> >>>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal= _math" >>>>>> + { >>>>>> + rtx rec =3D gen_reg_rtx (mode); >>>>>> + rtx vrecps_temp =3D gen_reg_rtx (mode); >>>>>> + >>>>>> + /* Reciprocal estimate */ >>>>>> + emit_insn (gen_neon_vrecpe (rec, operands[2])); >>>>>> + >>>>>> + /* Perform 2 iterations of Newton-Raphson method for better acc= uracy */ >>>>>> + for (int i =3D 0; i < 2; i++) >>>>>> + { >>>>>> + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2= ])); >>>>>> + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); >>>>>> + } >>>>>> + >>>>>> + /* We now have reciprocal in rec, perform operands[0] =3D opera= nds[1] * rec */ >>>>>> + emit_insn (gen_mul3 (operands[0], operands[1], rec)); >>>>>> + DONE; >>>>>> + } >>>>>> +) >>>>>> + >>>>>> + >>>>>> (define_insn "mul3add_neon" >>>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=3Dw") >>>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_ope= rand" "w") >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsui= te/gcc.target/arm/vect-div-1.c >>>>>> new file mode 100644 >>>>>> index 0000000..e562ef3 >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -= fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>> >>>>> No this is wrong. >>>>> >>>>> What is armv8 specific about this test ? This is just like another te= st that is for Neon. vrecpe / vrecps are not instructions that were introdu= ced in the v8 version of the architecture. They've existed in the base Neon= instruction set. The code generation above in the patterns will be enabled= when TARGET_NEON is true which can happen when -mfpu=3Dneon -mfloat-abi=3D= {softfp/hard} is true. >>>>> >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len =3D len & ~31; >>>>>> + for (int i =3D 0; i < len; i++) >>>>>> + p[i] =3D p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" = } } */ >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsui= te/gcc.target/arm/vect-div-2.c >>>>>> new file mode 100644 >>>>>> index 0000000..8e15d0a >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>> >>>>> And likewise. >>>>> >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-ma= th -ftree-vectorize -fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len =3D len & ~31; >>>>>> + for (int i =3D 0; i < len; i++) >>>>>> + p[i] =3D p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" = } } */ >>>>> >>>>> >>>>> regards >>>>> Ramana