From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-419279-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 96876 invoked by alias); 17 Jan 2016 09:06:26 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 96849 invoked by uid 89); 17 Jan 2016 09:06:24 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_40,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=021, div, 646, f16
X-HELO: mail-ig0-f180.google.com
Received: from mail-ig0-f180.google.com (HELO mail-ig0-f180.google.com) (209.85.213.180) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Sun, 17 Jan 2016 09:06:22 +0000
Received: by mail-ig0-f180.google.com with SMTP id t15so38015737igr.0        for <gcc-patches@gcc.gnu.org>; Sun, 17 Jan 2016 01:06:22 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=1e100.net; s=20130820;        h=x-gm-message-state:mime-version:in-reply-to:references:date         :message-id:subject:from:to:cc:content-type         :content-transfer-encoding;        bh=vXFc4TRR6znziiqMObn9pw2aHhg1RGZtgqicJ36+x7M=;        b=e09PGIe7iJmX0Ptzj/RM+LTY/xxLl7jupFIFFL7KeM6/dsDAUTkyhKidQrg8AQvP1y         1w2bdTb81HGr5SYm9xMsiUPE1oAB1PMMkIILBBz8X822QbiAR35coUiVBmBjtgWEzoOV         kqd/RwDrQaZ0HW+YeFxcpDs7GnMegcrkIOQ+Dm5QyzxTQyFkO55keev2/5wyqTa10uva         VkBOaPPhxXwZQwvmScn/cLi8kwSCZ4oKYvxOPpp2Wa+2lVm4zhrXWuTkfGQyvRi2IAU1         DHjIQZL6uqq6c3BnWCErLZjg2r8kpL7zkzms8WSCLWDeUiUeUfrqh5JGoqrkd4oV2tkz         igUw==
X-Gm-Message-State: AG10YOQTT8VBBH/KvnI0dmO2o6XJssMfSaIC0e4vOWMj7BPeWvANAPfr289ZO7Ojd7nhVXj6OPsuxBC4rvPsHjz3
MIME-Version: 1.0
X-Received: by 10.50.62.103 with SMTP id x7mr2954219igr.39.1453021580820; Sun, 17 Jan 2016 01:06:20 -0800 (PST)
Received: by 10.36.214.70 with HTTP; Sun, 17 Jan 2016 01:06:20 -0800 (PST)
In-Reply-To: <55BB4127.5050202@foss.arm.com>
References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ@mail.gmail.com>	<55BB4127.5050202@foss.arm.com>
Date: Sun, 17 Jan 2016 09:06:00 -0000
Message-ID: <CAAgBjMnteaOO=JFifLD1CG7s-JE5r=65wcQm+HBwSMtpwg3ReA@mail.gmail.com>
Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Ramana Radhakrishnan <ramana.radhakrishnan@foss.arm.com>
Cc: gcc Patches <gcc-patches@gcc.gnu.org>, Charles Baylis <charles.baylis@linaro.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2016-01/txt/msg01209.txt.bz2

On 31 July 2015 at 15:04, Ramana Radhakrishnan
<ramana.radhakrishnan@foss.arm.com> wrote:
>
>
> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>> Hi,
>> This patch tries to implement division with multiplication by
>> reciprocal using vrecpe/vrecps
>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>> Tested on arm-none-linux-gnueabihf using qemu.
>> OK for trunk ?
>>
>> Thank you,
>> Prathamesh
>>
>
> I've tried this in the past and never been convinced that 2 iterations ar=
e enough to get to stability with this given that the results are only prec=
ise for 8 bits / iteration. Thus I've always believed you need 3 iterations=
 rather than 2 at which point I've never been sure that it's worth it. So t=
he testing that you've done with this currently is not enough for this to g=
o into the tree.
>
> I'd like this to be tested on a couple of different AArch32 implementatio=
ns with a wider range of inputs to verify that the results are acceptable a=
s well as running something like SPEC2k(6) with atleast one iteration to en=
sure correctness.
Hi,
I got results of SPEC2k6 fp benchmarks:
a15: +0.64% overall, 481.wrf: +6.46%
a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
a57: +0.35% overall, 481.wrf: +3.84%
The other benchmarks had (almost) identical results.

Thanks,
Prathamesh
>
>
> moving on to the patches.
>
>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>> index 654d9d5..28c2e2a 100644
>> --- a/gcc/config/arm/neon.md
>> +++ b/gcc/config/arm/neon.md
>> @@ -548,6 +548,32 @@
>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>  )
>>
>
> Please add a comment here.
>
>> +(define_expand "div<mode>3"
>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=3Dw")
>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>
> I want to double check that this doesn't collide with Alan's patches for =
FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>
>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_mat=
h"
>> +  {
>> +    rtx rec =3D gen_reg_rtx (<MODE>mode);
>> +    rtx vrecps_temp =3D gen_reg_rtx (<MODE>mode);
>> +
>> +    /* Reciprocal estimate */
>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>> +
>> +    /* Perform 2 iterations of Newton-Raphson method for better accurac=
y */
>> +    for (int i =3D 0; i < 2; i++)
>> +      {
>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>> +      }
>> +
>> +    /* We now have reciprocal in rec, perform operands[0] =3D operands[=
1] * rec */
>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>> +    DONE;
>> +  }
>> +)
>> +
>> +
>>  (define_insn "mul<mode>3add<mode>_neon"
>>    [(set (match_operand:VDQW 0 "s_register_operand" "=3Dw")
>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand=
" "w")
>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/g=
cc.target/arm/vect-div-1.c
>> new file mode 100644
>> index 0000000..e562ef3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdum=
p-tree-vect-all" } */
>> +/* { dg-add-options arm_v8_neon } */
>
> No this is wrong.
>
> What is armv8 specific about this test ? This is just like another test t=
hat is for Neon. vrecpe / vrecps are not instructions that were introduced =
in the v8 version of the architecture. They've existed in the base Neon ins=
truction set. The code generation above in the patterns will be enabled whe=
n TARGET_NEON is true which can happen when -mfpu=3Dneon -mfloat-abi=3D{sof=
tfp/hard} is true.
>
>> +
>> +void
>> +foo (int len, float * __restrict p, float *__restrict x)
>> +{
>> +  len =3D len & ~31;
>> +  for (int i =3D 0; i < len; i++)
>> +    p[i] =3D p[i] / x[i];
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } =
*/
>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/g=
cc.target/arm/vect-div-2.c
>> new file mode 100644
>> index 0000000..8e15d0a
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>
> And likewise.
>
>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -=
ftree-vectorize -fdump-tree-vect-all" } */
>> +/* { dg-add-options arm_v8_neon } */
>> +
>> +void
>> +foo (int len, float * __restrict p, float *__restrict x)
>> +{
>> +  len =3D len & ~31;
>> +  for (int i =3D 0; i < len; i++)
>> +    p[i] =3D p[i] / x[i];
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } =
*/
>
>
> regards
> Ramana