From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 10F233858439; Wed, 20 Dec 2023 09:54:02 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 10F233858439
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1703066042;
	bh=YmnGdz9Uo0oqiCvyrzCWw3xWmuwNxtVsQzgdJyBFj8I=;
	h=From:To:Subject:Date:From;
	b=vhdBkLlw6ggOEn2LvVmgBY4QHRabo9+0DK9JMzx8vj57b0+Bkgnir7kl4amTYMoV0
	 Q7SHTet8H4P30pMGD9lVazGmUxBvkSMgodFDddr5q6YhQIT7532yOTACHQG1XqDCQy
	 rSc1u4iA2AD1aMqdky8f/5JTeZX8INcoUlhZ/1QE=
From: "fxue at os dot amperecomputing.com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113091] New: Over-estimate SLP
 vector-to-scalar cost for non-live pattern statement
Date: Wed, 20 Dec 2023 09:54:01 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: fxue at os dot amperecomputing.com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-113091-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113091

            Bug ID: 113091
           Summary: Over-estimate SLP vector-to-scalar cost for non-live
                    pattern statement
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Gcc fails to vectorize the below testcase on aarch64.

  int test(unsigned array[8]);

  int foo(char *a, char *b)
  {
    unsigned array[8];

    array[0] =3D (a[0] - b[0]);
    array[1] =3D (a[1] - b[1]);
    array[2] =3D (a[2] - b[2]);
    array[3] =3D (a[3] - b[3]);
    array[4] =3D (a[4] - b[4]);
    array[5] =3D (a[5] - b[5]);
    array[6] =3D (a[6] - b[6]);
    array[7] =3D (a[7] - b[7]);

    return test(array);
  }

The dump shows that loads to a[i] and b[i] are considered to be live as sca=
lar
references, which results in over-estimated vector-to-scalar cost.

*a_50(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)a_50(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue
*b_51(D) 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
MEM[(char *)b_51(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue

Subtraction on char type is recognized as widen-sub, and involves two kinds=
 of
pattern replacement.

 * Original
 _1 =3D *a_50(D);
 _2 =3D (int) _1;
 _3 =3D *b_51(D);
 _4 =3D (int) _3;
 _5 =3D _2 - _4;


 * After pattern replacement
 patt_63 =3D (unsigned short) _1;  //  _2 =3D (int) _1;
 patt_64 =3D (int) patt_63;        //  _2 =3D (int) _1;

 patt_65 =3D (unsigned short) _3;  //  _4 =3D (int) _3;
 patt_66 =3D (int) patt_65;        //  _4 =3D (int) _3;

 patt_67 =3D .VEC_WIDEN_MINUS (_1, _3);  //  _5 =3D _2 - _4;
 patt_68 =3D (signed short) patt_67;     //  _5 =3D _2 - _4;
 patt_69 =3D (int) patt_68;              //  _5 =3D _2 - _4;

For the statement "_2 =3D (int) _1", its vectorization representative "patt=
_64 =3D
(int) patt_63" is not marked as PURE_SLP, so it is conservatively considere=
d to
having scalar use and being live outside of SLP bb (in the function
vect_bb_slp_mark_live_stmts). However, the pattern definition is actually d=
ead,
should not contribute to vector-to-scalar cost.=20

Those defs from pattern statements are not part of function body, we could =
not
track def/use chain as ordinary SSAs. Probably, we may have a quick fix for=
 one
situation, if the original SSA "_2" has single use, its existence should be
only covered by vectorized operation, no matter what/how it would be w/o
pattern replacement.=