From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 313173858C50; Fri, 21 Apr 2023 09:49:43 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 313173858C50
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1682070583;
	bh=o9TBbY3Q5yua2e+M0JpvvwEoscpbkdDtqpQIRBLqx6Q=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=qYOwdYmnlQiUrvGmxXx02yJOMyBqiJW6dg4RDzeC2iIM2ynLuiFl6t+q5ysKHZOev
	 rsE4Wc0YWkh8KtiO4HywvZ70RSVALmiNJ+zHLXujU/Wk+A/bV1nxAjH8oVxZG74/zs
	 dufGku32yCFMf+vzmXc2l836Zq9QK46AQgwEJhuE=
From: "cvs-commit at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/108270] un-optimal vsetvl for multi-loop if avl is 0 ~
 31 immediate
Date: Fri, 21 Apr 2023 09:49:41 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: cvs-commit at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108270-4-QdDikl2tfD@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108270-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108270-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108270
--- Comment #2 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Kito Cheng <kito@gcc.gnu.org>:

https://gcc.gnu.org/g:d06e9264b0192c2c77e07d7fb0fe090efcb510c0

commit r14-135-gd06e9264b0192c2c77e07d7fb0fe090efcb510c0
Author: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Date:   Fri Apr 21 17:19:12 2023 +0800

    RISC-V: Defer vsetvli insertion to later if possible [PR108270]

    Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108270.

    Consider the following testcase:
    void f (void * restrict in, void * restrict out, int l, int n, int m)
    {
      for (int i =3D 0; i < l; i++){
        for (int j =3D 0; j < m; j++){
          for (int k =3D 0; k < n; k++)
            {
              vint8mf8_t v =3D __riscv_vle8_v_i8mf8 (in + i + j, 17);
              __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
            }
        }
      }
    }

    Compile option: -O3

    Before this patch:
            mv      a7,a2
            mv      a6,a0
            mv      t1,a1
            mv      a2,a3
            vsetivli        zero,17,e8,mf8,ta,ma
            ble     a7,zero,.L1
            ble     a4,zero,.L1
            ble     a3,zero,.L1
    ...

    After this patch:
            mv      a7,a2
            mv      a6,a0
            mv      t1,a1
            mv      a2,a3
            ble     a7,zero,.L1
            ble     a4,zero,.L1
            ble     a3,zero,.L1
            add     a1,a0,a4
            li      a0,0
            vsetivli        zero,17,e8,mf8,ta,ma
    ...

    This issue is a missed optmization produced by Phase 3 global backward
demand
    fusion instead of LCM.

    This patch is fixing poor placement of the vsetvl.

    This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand
info
    backward fusion and propogation) which
    is I introduced into VSETVL PASS to enhance LCM && improve vsetvl
instruction
    performance.

    This patch is to supress the Phase 3 too aggressive backward fusion and
    propagation to the top of the function program
    when there is no define instruction of AVL (AVL is 0 ~ 31 imm since
vsetivli
    instruction allows imm value instead of reg).

    You may want to ask why we need Phase 3 to the job.
    Well, we have so many situations that pure LCM fails to optimize, here I
can
    show you a simple case to demonstrate it:

    void f (void * restrict in, void * restrict out, int n, int m, int cond)
    {
      size_t vl =3D 101;
      for (size_t j =3D 0; j < m; j++){
        if (cond) {
          for (size_t i =3D 0; i < n; i++)
            {
              vint8mf8_t v =3D __riscv_vle8_v_i8mf8 (in + i + j, vl);
              __riscv_vse8_v_i8mf8 (out + i, v, vl);
            }
        } else {
          for (size_t i =3D 0; i < n; i++)
            {
              vint32mf2_t v =3D __riscv_vle32_v_i32mf2 (in + i + j, vl);
              v =3D __riscv_vadd_vv_i32mf2 (v,v,vl);
              __riscv_vse32_v_i32mf2 (out + i, v, vl);
            }
        }
      }
    }

    You can see:
    The first inner loop needs vsetvli e8 mf8 for vle+vse.
    The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

    If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up
with :

    outerloop:
    ...
    vsetvli e8mf8
    inner loop 1:
    ....

    vsetvli e32mf2
    inner loop 2:
    ....

    However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 m=
f2
of
    inner loop 2 into vsetvli e8 mf8, then we will end up with this result
after
    phase 3:

    outerloop:
    ...
    inner loop 1:
    vsetvli e32mf2
    ....

    inner loop 2:
    vsetvli e32mf2
    ....

    Then, this demand information after phase 3 will be well optimized after
phase 4
    (LCM), after Phase 4 result is:

    vsetvli e32mf2
    outerloop:
    ...
    inner loop 1:
    ....

    inner loop 2:
    ....

    You can see this is the optimal codegen after current VSETVL PASS (Phas=
e 3:
    Demand backward fusion and propagation + Phase 4: LCM ). This is a known
issue
     when I start to implement VSETVL PASS.

    gcc/ChangeLog:

            PR target/108270
            * config/riscv/riscv-vsetvl.cc
            (vector_infos_manager::all_empty_predecessor_p): New function.
            (pass_vsetvl::backward_demand_fusion): Ditto.
            * config/riscv/riscv-vsetvl.h: Ditto.

    gcc/testsuite/ChangeLog:

            PR target/108270
            * gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
            * gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.=