From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-429182-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 130380 invoked by alias); 7 Jun 2016 08:26:12 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 130239 invoked by uid 89); 7 Jun 2016 08:26:11 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=upto, believed, a15, wider
X-HELO: mail-it0-f46.google.com
Received: from mail-it0-f46.google.com (HELO mail-it0-f46.google.com) (209.85.214.46) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 07 Jun 2016 08:26:00 +0000
Received: by mail-it0-f46.google.com with SMTP id h62so13168549itb.1        for <gcc-patches@gcc.gnu.org>; Tue, 07 Jun 2016 01:25:59 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:from:date         :message-id:subject:to:cc:content-transfer-encoding;        bh=J2M0z1J7kpURQqtCufH6aBWFmS9fspWMwb1hwvnjw0A=;        b=EF4hkd7yb9NWL1wHEtRkBSLlhn1DKzIDplpRTK9MJXsR9P/8M76UboB7A3Z13EHd+S         WO++j7X+DFia9pmN57RuMjmzQHpptb4uZs0gO+IVUNeC/638MIYxUhxmX27z3XHxlQZ7         Uhx/8RxfiH0UDLRqPuE9vJ6XbVLpQVz78ipQNByLzr7f60t50r2jQo3Rsta1YoqiBHjw         qCPj6IR1j7+UGP1MZf9h86HWgwenFeCDvpbsNbmsteMi/7NV0eD0OJl/GDI5SezgRctM         t5ttiET7Q100PF1YdXHp8xUu7yES1pxTXp4Qh9fXjJBodpbQRwgdK4a2jG42BFj+RhOt         u3pQ==
X-Gm-Message-State: ALyK8tKk8SUcz94KsHM9SMddjwHcmIgndBJP8Alxk6w6QXK0yo4SIuCuTxKp2jjnQHC85vfz0rHv+RHHDtJix0Yo
X-Received: by 10.36.65.97 with SMTP id x94mr1730183ita.16.1465287957935; Tue, 07 Jun 2016 01:25:57 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.36.236.5 with HTTP; Tue, 7 Jun 2016 01:25:57 -0700 (PDT)
In-Reply-To: <CAAgBjMmOQesByOu1wQx0d8NvfBRJx1XsikzC8eN=d-qk4rM-HQ@mail.gmail.com>
References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ@mail.gmail.com> <55BB4127.5050202@foss.arm.com> <CAAgBjMnteaOO=JFifLD1CG7s-JE5r=65wcQm+HBwSMtpwg3ReA@mail.gmail.com> <CAJA7tRaf8sZHmzHCt0n8rGJKt4bQycfSU+ABmb8rbcKb1VFQ=A@mail.gmail.com> <CAAgBjM=X8rXnDnNV4-sWC_0fCt9hcvuOx2-hJ=vwsAb1oBq22Q@mail.gmail.com> <CAAgBjMkh=0tyzaMjqpSQg_mV8GYH-Tx9WV5xjHs2k0H0=UeqtQ@mail.gmail.com> <CAAgBjMmOQesByOu1wQx0d8NvfBRJx1XsikzC8eN=d-qk4rM-HQ@mail.gmail.com>
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Date: Tue, 07 Jun 2016 08:26:00 -0000
Message-ID: <CAAgBjMmMF16cZdF5nkUZKm0y8TVkKEqwDqWY1st7SeVppnDPrw@mail.gmail.com>
Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
To: Ramana Radhakrishnan <ramana.gcc@googlemail.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@foss.arm.com>, 	gcc Patches <gcc-patches@gcc.gnu.org>, Charles Baylis <charles.baylis@linaro.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2016-06/txt/msg00455.txt.bz2

On 30 May 2016 at 13:52, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
> On 23 May 2016 at 14:59, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>> On 5 February 2016 at 18:40, Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>>> On 4 February 2016 at 16:31, Ramana Radhakrishnan
>>> <ramana.gcc@googlemail.com> wrote:
>>>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
>>>> <prathamesh.kulkarni@linaro.org> wrote:
>>>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan
>>>>> <ramana.radhakrishnan@foss.arm.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>>>>>> Hi,
>>>>>>> This patch tries to implement division with multiplication by
>>>>>>> reciprocal using vrecpe/vrecps
>>>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>>>>>> Tested on arm-none-linux-gnueabihf using qemu.
>>>>>>> OK for trunk ?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Prathamesh
>>>>>>>
>>>>>>
>>>>>> I've tried this in the past and never been convinced that 2 iteratio=
ns are enough to get to stability with this given that the results are only=
 precise for 8 bits / iteration. Thus I've always believed you need 3 itera=
tions rather than 2 at which point I've never been sure that it's worth it.=
 So the testing that you've done with this currently is not enough for this=
 to go into the tree.
>>>>>>
>>>>>> I'd like this to be tested on a couple of different AArch32 implemen=
tations with a wider range of inputs to verify that the results are accepta=
ble as well as running something like SPEC2k(6) with atleast one iteration =
to ensure correctness.
>>>>> Hi,
>>>>> I got results of SPEC2k6 fp benchmarks:
>>>>> a15: +0.64% overall, 481.wrf: +6.46%
>>>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
>>>>> a57: +0.35% overall, 481.wrf: +3.84%
>>>>> The other benchmarks had (almost) identical results.
>>>>
>>>> Thanks for the benchmarking results -  Please repost the patch with
>>>> the changes that I had requested in my previous review - given it is
>>>> now stage4 , I would rather queue changes like this for stage1 now.
>>> Hi,
>>> Please find the updated patch attached.
>>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabih=
f and
>>> arm-none-eabi.
>>> However the test-case added in the patch (neon-vect-div-1.c) fails to
>>> get vectorized at -O2
>>> for armeb-none-linux-gnueabihf.
>>> Charles suggested me to try with -O3, which worked.
>>> It appears the test-case fails to get vectorized with
>>> -fvect-cost-model=3Dcheap (which is default enabled at -O2)
>>> and passes for -fno-vect-cost-model / -fvect-cost-model=3Ddynamic
>>>
>>> I can't figure out why it fails -fvect-cost-model=3Dcheap.
>>> From the vect dump (attached):
>>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
>>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned loa=
d.*_9
>> Hi,
>> I think I have some idea why the test-case fails attached with patch
>> fail to get vectorized on armeb with -O2.
>>
>> Issue with big endian vectorizer:
>> The patch does not cause regressions on big endian vectorizer but
>> fails to vectorize the test-cases attached with the patch, while they
>> get vectorized on
>> litttle-endian.
>> Fails with armeb with the following message in dump:
>> note: not vectorized: unsupported unaligned load.*_9
>>
>> The behavior of big and little endian vectorizer seems to be different
>> in arm_builtin_support_vector_misalignment() which overrides the hook
>> targetm.vectorize.support_vector_misalignment().
>>
>> targetm.vectorize.support_vector_misalignment is called by
>> vect_supportable_dr_alignment () which in turn is called
>> by verify_data_refs_alignment ().
>>
>> Execution upto following condition is common between arm and armeb
>> in vect_supportable_dr_alignment():
>>
>> if ((TYPE_USER_ALIGN (type) && !is_packed)
>>       || targetm.vectorize.support_vector_misalignment (mode, type,
>>                                             DR_MISALIGNMENT (dr), is_pac=
ked))
>>         /* Can't software pipeline the loads, but can at least do them. =
 */
>>         return dr_unaligned_supported;
>>
>> For little endian case:
>> arm_builtin_support_vector_misalignment() is called with
>> V2SF mode and misalignment =3D=3D -1, and the following condition
>> becomes true:
>> /* If the misalignment is unknown, we should be able to handle the access
>>          so long as it is not to a member of a packed data structure.  */
>>   if (misalignment =3D=3D -1)
>>     return true;
>>
>> Since the hook returned true we enter the condition above in
>> vect_supportable_dr_alignment() and return dr_unaligned_supported;
>>
>> For big-endian:
>> arm_builtin_support_vector_misalignment() is called with V2SF mode.
>> The following condition that gates the entire function body fails:
>>  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
>> and the default hook gets called with V2SF mode and the default hook
>> returns false because
>> movmisalign_optab does not exist for V2SF mode.
>>
>> So the condition above in vect_supportable_dr_alignment() fails
>> and we come here:
>>  /* Unsupported.  */
>> return dr_unaligned_unsupported;
>>
>> And hence we get the unaligned load not supported message in the dump
>> for armeb in verify_data_ref_alignment ():
>>
>> static bool
>> verify_data_ref_alignment (data_reference_p dr)
>> {
>>   enum dr_alignment_support supportable_dr_alignment
>>     =3D vect_supportable_dr_alignment (dr, false);
>>   if (!supportable_dr_alignment)
>>     {
>>       if (dump_enabled_p ())
>>         {
>>           if (DR_IS_READ (dr))
>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>                              "not vectorized: unsupported unaligned load=
.");
>>
>> With -O3, the test-cases vectorize for armeb, because loop peeling for a=
lignment
>> is turned on.
>> The above behavior is also reproducible with test-case which is
>> irrelevant to the patch.
>> for instance, we get the same unsupported unaligned load for following
>> test-case (replaced / with +)
>>
>> void
>> foo (int len, float * __restrict p, float *__restrict x)
>> {
>>   len =3D len & ~31;
>>   for (int i =3D 0; i < len; i++)
>>     p[i] =3D p[i] + x[i];
>> }
>> Is the patch OK to commit after bootstrap+test ?
> ping https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
ping * 2 https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Prathamesh
>>>>
>>>> Thanks,
>>>> Ramana
>>>>
>>>>>
>>>>> Thanks,
>>>>> Prathamesh
>>>>>>
>>>>>>
>>>>>> moving on to the patches.
>>>>>>
>>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>>>> index 654d9d5..28c2e2a 100644
>>>>>>> --- a/gcc/config/arm/neon.md
>>>>>>> +++ b/gcc/config/arm/neon.md
>>>>>>> @@ -548,6 +548,32 @@
>>>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>>>  )
>>>>>>>
>>>>>>
>>>>>> Please add a comment here.
>>>>>>
>>>>>>> +(define_expand "div<mode>3"
>>>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=3Dw")
>>>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>>>
>>>>>> I want to double check that this doesn't collide with Alan's patches=
 for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 c=
ases.
>>>>>>
>>>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciproca=
l_math"
>>>>>>> +  {
>>>>>>> +    rtx rec =3D gen_reg_rtx (<MODE>mode);
>>>>>>> +    rtx vrecps_temp =3D gen_reg_rtx (<MODE>mode);
>>>>>>> +
>>>>>>> +    /* Reciprocal estimate */
>>>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>>>> +
>>>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better ac=
curacy */
>>>>>>> +    for (int i =3D 0; i < 2; i++)
>>>>>>> +      {
>>>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[=
2]));
>>>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>>>> +      }
>>>>>>> +
>>>>>>> +    /* We now have reciprocal in rec, perform operands[0] =3D oper=
ands[1] * rec */
>>>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>>>> +    DONE;
>>>>>>> +  }
>>>>>>> +)
>>>>>>> +
>>>>>>> +
>>>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=3Dw")
>>>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_op=
erand" "w")
>>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsu=
ite/gcc.target/arm/vect-div-1.c
>>>>>>> new file mode 100644
>>>>>>> index 0000000..e562ef3
>>>>>>> --- /dev/null
>>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>>> @@ -0,0 +1,14 @@
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize =
-fdump-tree-vect-all" } */
>>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>>
>>>>>> No this is wrong.
>>>>>>
>>>>>> What is armv8 specific about this test ? This is just like another t=
est that is for Neon. vrecpe / vrecps are not instructions that were introd=
uced in the v8 version of the architecture. They've existed in the base Neo=
n instruction set. The code generation above in the patterns will be enable=
d when TARGET_NEON is true which can happen when -mfpu=3Dneon -mfloat-abi=
=3D{softfp/hard} is true.
>>>>>>
>>>>>>> +
>>>>>>> +void
>>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>>> +{
>>>>>>> +  len =3D len & ~31;
>>>>>>> +  for (int i =3D 0; i < len; i++)
>>>>>>> +    p[i] =3D p[i] / x[i];
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"=
 } } */
>>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsu=
ite/gcc.target/arm/vect-div-2.c
>>>>>>> new file mode 100644
>>>>>>> index 0000000..8e15d0a
>>>>>>> --- /dev/null
>>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>>> @@ -0,0 +1,14 @@
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>>
>>>>>> And likewise.
>>>>>>
>>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-m=
ath -ftree-vectorize -fdump-tree-vect-all" } */
>>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>>> +
>>>>>>> +void
>>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>>> +{
>>>>>>> +  len =3D len & ~31;
>>>>>>> +  for (int i =3D 0; i < len; i++)
>>>>>>> +    p[i] =3D p[i] / x[i];
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"=
 } } */
>>>>>>
>>>>>>
>>>>>> regards
>>>>>> Ramana