From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id D10A63857805;
 Wed, 20 Jan 2021 14:14:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D10A63857805
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 303B0D6E;
 Wed, 20 Jan 2021 06:14:38 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5B8F73F66E;
 Wed, 20 Jan 2021 06:14:37 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Hongtao Liu <crazylht@gmail.com>
Mail-Followup-To: Hongtao Liu <crazylht@gmail.com>,
 Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org>,
 ebotcazou@libertysurf.fr, steven@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>,
 richard.sandiford@arm.com
Cc: Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org>,
 ebotcazou@libertysurf.fr, steven@gcc.gnu.org, Jakub Jelinek <jakub@redhat.com>
Subject: Re: [PATCH] [PR rtl/optimization/98694] Fix incorrect optimization by
 cprop_hardreg.
References: <CAMZc-by1DCvkq-TwSssx6LDUKbTrBNC6VPCU-amzwQ5GPzousw@mail.gmail.com>
 <mpta6t67dkj.fsf@arm.com>
 <CAMZc-byONEr3Kpj7ThTERK6aVKyg8_uPwQ0zigYGpA_nGgVkNg@mail.gmail.com>
 <mpt35yy7b74.fsf@arm.com>
 <CAMZc-bwJwbnWafSQsWXoSR7-0PysyfXNTTGqajpUmgoREgvVUQ@mail.gmail.com>
 <mpteeih3xuw.fsf@arm.com> <20210119144514.GA4020736@tucnak>
 <mpt35yw3o1y.fsf@arm.com>
 <CAMZc-byHHb4ZUur0TQ-yeNUvm_oHstbProRsSCLPED7eRHaOsQ@mail.gmail.com>
Date: Wed, 20 Jan 2021 14:14:35 +0000
In-Reply-To: <CAMZc-byHHb4ZUur0TQ-yeNUvm_oHstbProRsSCLPED7eRHaOsQ@mail.gmail.com>
 (Hongtao Liu's message of "Wed, 20 Jan 2021 12:35:15 +0800")
Message-ID: <mptzh13zodw.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Jan 2021 14:14:40 -0000

Hongtao Liu <crazylht@gmail.com> writes:
> On Wed, Jan 20, 2021 at 12:10 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > On Tue, Jan 19, 2021 at 12:38:47PM +0000, Richard Sandiford via Gcc-pa=
tches wrote:
>> >> > actually only the lower 16bits are needed, the original insn is like
>> >> >
>> >> > .294.r.ira
>> >> > (insn 69 68 70 13 (set (reg:HI 96 [ _52 ])
>> >> >         (subreg:HI (reg:DI 82 [ var_6.0_1 ]) 0)) "test.c":21:23 76
>> >> > {*movhi_internal}
>> >> >      (nil))
>> >> > (insn 78 75 82 13 (set (reg:V4HI 140 [ _283 ])
>> >> >         (vec_duplicate:V4HI (truncate:HI (subreg:SI (reg:HI 96 [ _52
>> >> > ]) 0)))) 1412 {*vec_dupv4hi}
>> >> >      (nil))
>> >> >
>> >> > .295r.reload
>> >> > (insn 69 68 70 13 (set (reg:HI 5 di [orig:96 _52 ] [96])
>> >> >         (reg:HI 68 k0 [orig:82 var_6.0_1 ] [82])) "test.c":21:23 76
>> >> > {*movhi_internal}
>> >> >      (nil))
>> >> > (insn 489 75 78 13 (set (reg:SI 22 xmm2 [297])
>> >> >         (reg:SI 5 di [orig:96 _52 ] [96])) 75 {*movsi_internal}
>> >> >      (nil))
>> >> > (insn 78 489 490 13 (set (reg:V4HI 20 xmm0 [orig:140 _283 ] [140])
>> >> >         (vec_duplicate:V4HI (truncate:HI (reg:SI 22 xmm2 [297]))))
>> >> > 1412 {*vec_dupv4hi}
>> >> >      (nil))
>> >> >
>> >> > and insn 489 is created by lra/reload which seems ok for the sequen=
ce,
>> >> > but problemistic with considering the logic of hardreg_cprop.
>> >>
>> >> It looks OK even with the regcprop behaviour though:
>> >>
>> >> - insn 69 defines only the low 16 bits of di,
>> >> - insn 489 defines only the low 16 bits of xmm2, but copies bits 16-31
>> >>   too (with unknown contents)
>> >> - insn 78 uses only the low 16 bits of xmm2 (the unknown contents
>> >>   introduced by insn 489 are truncated away)
>> >>
>> >> So where do bits 16-31 become significant?  What goes wrong if they're
>> >> not zero?
>> >
>> > The k0 register is initialized I believe with
>> > (insn 20 2 21 2 (set (reg:DI 68 k0 [orig:82 var_6.0_1 ] [82])
>> >         (mem/c:DI (symbol_ref:DI ("var_6") [flags 0x40]  <var_decl 0x7=
f7babeaaf30 var_6>) [3 var_6+0 S8 A64])) "pr98694.C":21:10 74 {*movdi_inter=
nal}
>> >      (nil))
>> > and so it contains all 64-bits, and then the code sometimes uses all t=
he
>> > bits, sometimes just the low 16-bits and sometimes low 32-bits of that
>> > value.
>> > (insn 69 68 70 12 (set (reg:HI 5 di [orig:96 _52 ] [96])
>> >         (reg:HI 68 k0 [orig:82 var_6.0_1 ] [82])) "pr98694.C":27:23 76=
 {*movhi_internal}
>> >      (nil))
>> > (insn 74 73 75 12 (set (reg:SI 36 r8 [orig:149 _52 ] [149])
>> >         (zero_extend:SI (reg:HI 68 k0 [orig:82 var_6.0_1 ] [82]))) 144=
 {*zero_extendhisi2}
>> >      (nil))
>> > (insn 489 75 78 12 (set (reg:SI 22 xmm2 [297])
>> >         (reg:SI 5 di [orig:96 _52 ] [96])) 75 {*movsi_internal}
>> >      (nil))
>> > (insn 78 489 490 12 (set (reg:V4HI 20 xmm0 [orig:140 _283 ] [140])
>> >         (vec_duplicate:V4HI (truncate:HI (reg:SI 22 xmm2 [297])))) 141=
2 {*vec_dupv4hi}
>> >      (expr_list:REG_DEAD (reg:SI 22 xmm2 [297])
>> >         (nil)))
>> > are examples when it uses only the low 16 bits from that, and
>> > (insn 487 72 73 12 (set (reg:SI 1 dx [148])
>> >         (reg:SI 68 k0 [orig:82 var_6.0_1 ] [82])) 75 {*movsi_internal}
>> >      (nil))
>> >
>> > (insn 85 84 491 13 (set (reg:SI 37 r9 [orig:86 _11 ] [86])
>> >         (reg:SI 68 k0 [orig:82 var_6.0_1 ] [82])) "pr98694.C":28:14 75=
 {*movsi_internal}
>> >      (nil))
>> >
>> > (insn 491 85 88 13 (set (reg:SI 3 bx [299])
>> >         (reg:SI 68 k0 [orig:82 var_6.0_1 ] [82])) 75 {*movsi_internal}
>> >      (nil))
>> > (insn 88 491 89 13 (set (reg:CCNO 17 flags)
>> >         (compare:CCNO (reg:SI 3 bx [299])
>> >             (const_int 0 [0]))) 7 {*cmpsi_ccno_1}
>> >      (expr_list:REG_DEAD (reg:SI 3 bx [299])
>> >         (nil)))
>> >
>> > (insn 457 499 460 33 (set (reg:SI 39 r11 [orig:86 _11 ] [86])
>> >         (reg:SI 37 r9 [orig:86 _11 ] [86])) "pr98694.C":35:36 75 {*mov=
si_internal}
>> >      (expr_list:REG_DEAD (reg:SI 37 r9 [orig:86 _11 ] [86])
>> >         (nil)))
>> > are examples where it uses low 32-bits from k0.
>> > So the
>> >  (insn 457 499 460 33 (set (reg:SI 39 r11 [orig:86 _11 ] [86])
>> > -        (reg:SI 37 r9 [orig:86 _11 ] [86])) "pr98694.C":35:36 75 {*mo=
vsi_internal}
>> > -     (expr_list:REG_DEAD (reg:SI 37 r9 [orig:86 _11 ] [86])
>> > +        (reg:SI 22 xmm2 [orig:86 _11 ] [86])) "pr98694.C":35:36 75 {*=
movsi_internal}
>> > +     (expr_list:REG_DEAD (reg:SI 22 xmm2 [orig:86 _11 ] [86])
>> >          (nil)))
>> > cprop_hardreg change indeed looks bogus, while xmm2 has SImode, it hol=
ds
>> > only the low 16-bits of the value and has the upper bits undefined, wh=
ile r9
>> > it is replacing had all of the low 32-bits well defined.
>>
>> Ah, ok, thanks for the extra context.
>>
>> So AIUI the problem when recording xmm2<-di isn't just:
>>
>>  [A] partial_subreg_p (vd->e[sr].mode, GET_MODE (src))
>>
>> but also that:
>>
>>  [B] partial_subreg_p (vd->e[sr].mode, vd->e[vd->e[sr].oldest_regno].mod=
e)
>>
>> For example, all registers in this sequence can be part of the same chai=
n:
>>
>>     (set (reg:HI R1) (reg:HI R0))
>>     (set (reg:SI R2) (reg:SI R1)) // [A]
>>     (set (reg:DI R3) (reg:DI R2)) // [A]
>>     (set (reg:SI R4) (reg:SI R[0-3]))
>>     (set (reg:HI R5) (reg:HI R[0-4]))
>>
>> But:
>>
>>     (set (reg:SI R1) (reg:SI R0))
>>     (set (reg:HI R2) (reg:HI R1))
>>     (set (reg:SI R3) (reg:SI R2)) // [A] && [B]
>>
>> is problematic because it dips below the precision of the oldest regno
>> and then increases again.
>>
>> When this happens, I guess we have two choices:
>>
>> (1) what the patch does: treat R3 as the start of a new chain.
>> (2) pretend that the copy occured in vd->e[sr].mode instead
>>     (i.e. copy vd->e[sr].mode to vd->e[dr].mode)
>>
>> I guess (2) would need to be subject to REG_CAN_CHANGE_MODE_P.
>> Maybe the optimisation provided by (2) compared to (1) isn't common
>> enough to be worth the complication.
>>
>> I think we should test [B] as well as [A] though.  The pass is set
>> up to do some quite elaborate mode changes and I think rejecting
>> [A] on its own would make some of the other code redundant.
>> It also feels like it should be a seperate =E2=80=9Cif=E2=80=9D or =E2=
=80=9Celse if=E2=80=9D,
>> with its own comment.
>>
> Update patch.
>> Thanks,
>> Richard
>
>
>
> --=20
> BR,
> Hongtao
>
> From a52b3c8a90a0bf6cbda8ce86d99c82c6182863a7 Mon Sep 17 00:00:00 2001
> From: liuhongt <hongtao.liu@intel.com>
> Date: Mon, 18 Jan 2021 16:55:32 +0800
> Subject: [PATCH] [PR rtl/optimization/98694] Fix incorrect optimization by
>  cprop_hardreg.
>
> If SRC had been assigned a mode narrower than the copy, we can't link
> DEST into the chain even they have same
> hard_regno_nregs(i.e. HImode/SImode in i386 backend).

This is a bit out of date now.  Maybe just say =E2=80=9Ccan't always link=
=E2=80=9D
instead of just =E2=80=9Ccan't link=E2=80=9D.

> i.e
>         kmovw   %k0, %edi
>         vmovd   %edi, %xmm2
> 	vpshuflw        $0, %xmm2, %xmm0
>         kmovw   %k0, %r8d
>         kmovd   %k0, %r9d
> ...
> -	 movl %r9d, %r11d
> +	 vmovd %xmm2, %r11d
>
> gcc/ChangeLog:
>
> 	PR rtl-optimization/98694
> 	* regcprop.c (copy_value): If SRC had been assigned a mode
> 	narrower than the copy, we can't link DEST into the chain even
> 	they have same hard_regno_nregs(i.e. HImode/SImode in i386
> 	backend).
>
> gcc/testsuite/ChangeLog:
>
> 	PR rtl-optimization/98694
> 	* gcc.target/i386/pr98694.c: New test.
> ---
>  gcc/regcprop.c                          | 33 +++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr98694.c | 38 +++++++++++++++++++++++++
>  2 files changed, 71 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98694.c
>
> diff --git a/gcc/regcprop.c b/gcc/regcprop.c
> index dd62cb36013..908298beaea 100644
> --- a/gcc/regcprop.c
> +++ b/gcc/regcprop.c
> @@ -358,6 +358,39 @@ copy_value (rtx dest, rtx src, struct value_data *vd)
>    else if (sn > hard_regno_nregs (sr, vd->e[sr].mode))
>      return;
>=20=20
> +  /* If SRC had been assigned a mode narrower than the copy, Although
> +     they have same hard_regno_nregs, it's not safe to link DEST into the
> +     chain. .i.e.

How about:

  It is not safe to link DEST into the chain if SRC was defined in some
  narrower mode M and if M is also narrower than the mode of the first
  register in the chain.  For example:

> +     (set (reg:DI r1) (reg:DI r0))
> +     (set (reg:HI r2) (reg:HI r1))
> +     (set (reg:SI r3) (reg:SI r2)) //Should be a new chain start at r3
> +     (set (reg:SI r4) (reg:SI r1))
> +     (set (reg:SI r5) (reg:SI r4))
> +     the upper part of r3 is undefined, if adding it to the chain, it ma=
y be
> +     prop to r5 which has defined upper bits, .i.e. pr98694.

And for this:

     (set (reg:DI r1) (reg:DI r0))
     (set (reg:HI r2) (reg:HI r1))
     (set (reg:SI r3) (reg:SI r2)) // Should be a new chain starting at r3
     (set (reg:SI r4) (reg:SI r1))
     (set (reg:SI r5) (reg:SI r4))

  the upper part of r3 is undefined.  If we added it to the chain,
  it may be used to replace r5, which has defined upper bits.
  See PR98694 for details.

> +
> +     [A] partial_subreg_p (vd->e[sr].mode, GET_MODE (src))
> +     [B] partial_subreg_p (vd->e[sr].mode, vd->e[vd->e[sr].oldest_regno]=
.mode)
> +     Condition B is added to to catch optimization opportunities of
> +
> +     (set (reg:HI R1) (reg:HI R0))
> +     (set (reg:SI R2) (reg:SI R1)) // [A]
> +     (set (reg:DI R3) (reg:DI R2)) // [A]
> +     (set (reg:SI R4) (reg:SI R[0-3]))
> +     (set (reg:HI R5) (reg:HI R[0-4]))
> +

Maybe add here:

  in which all registers have only 16 defined bits.

> +     but problematic for
> +
> +     (set (reg:SI R1) (reg:SI R0))
> +     (set (reg:HI R2) (reg:HI R1))
> +     (set (reg:SI R3) (reg:SI R2)) // [A] && [B]
> +
> +     to be fixed????   */

I think we should drop this part.  Your example above covers it in
more detail.

OK with those changes and the one that HJ asked for.

Thanks,
Richard