From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id A9FEB3858C30; Wed, 31 May 2023 16:11:20 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A9FEB3858C30
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1685549480;
	bh=1VaN57LJWDRu3Pm+OVvYVsgmEIDjC/Ki1BMOSWZbJEU=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=eFF87HZPmlFT0OH0XlaAvoysgTkA87ZBvKU0nTnjrFtdo8eQPraeweKuoaBnVSEZN
	 tsaoYl+gpZqbMmff8B2sLrGWqgdFOm8aomd3eMLg1hJisueyLTI7qc0SL/tUxfGl7p
	 Am2xo+rZ07vOFcNo72cifNCLrKRd5EIlWPx/ikpQ=
From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109812] GraphicsMagick resize is a lot slower in GCC
 13.1 vs Clang 16 on Intel Raptor Lake
Date: Wed, 31 May 2023 16:11:19 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.1.1
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hubicka at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc see_also
Message-ID: <bug-109812-4-yumD54N8hU@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109812-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109812-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109812

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenther at suse dot de
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=3D110062
--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The only difference between slp vectorization is:

-  # _68 =3D PHI <_5(3)>
-  # _67 =3D PHI <_11(3)>
-  # _66 =3D PHI <_16(3)>
-  <retval>.r =3D _68;
-  <retval>.g =3D _67;
-  <retval>.b =3D _66;
+  # _70 =3D PHI <_5(3)>
+  # _69 =3D PHI <_11(3)>
+  # _68 =3D PHI <_16(3)>
+  <retval>.r =3D _70;
+  <retval>.g =3D _69;
+  <retval>.b =3D _68;
+  <retval>.o =3D r$o_33(D);

so SRA invents r$o_33(D) even if that variable is undefined.

SLP vectorizer then sees it as interleaving stores:

-t.c:19:16: note:       _1 =3D rgbs[i_35].r;
-t.c:19:16: note:       _7 =3D rgbs[i_35].g;
-t.c:19:16: note:       _12 =3D rgbs[i_35].b;
-t.c:19:16: note:   Detected interleaving store of size 3
-t.c:19:16: note:       <retval>.r =3D _68;
-t.c:19:16: note:       <retval>.g =3D _67;
-t.c:19:16: note:       <retval>.b =3D _66;
+t.c:19:16: note:       _1 =3D rgbs[i_37].r;
+t.c:19:16: note:       _7 =3D rgbs[i_37].g;
+t.c:19:16: note:       _12 =3D rgbs[i_37].b;
+t.c:19:16: note:   Detected interleaving store of size 4
+t.c:19:16: note:       <retval>.r =3D _70;
+t.c:19:16: note:       <retval>.g =3D _69;
+t.c:19:16: note:       <retval>.b =3D _68;
+t.c:19:16: note:       <retval>.o =3D r$o_33(D);

For first case it first tries to vectorize for vector of 3 doubles and fail=
s:

-t.c:19:16: note:     <retval>.r =3D _68;
-t.c:19:16: note:     <retval>.g =3D _67;
-t.c:19:16: note:     <retval>.b =3D _66;
-t.c:19:16: note:   starting SLP discovery for node 0x2cb4fe8
-t.c:19:16: note:   Build SLP for <retval>.r =3D _68;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits =3D 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block S=
LP
-t.c:19:16: note:   Build SLP for <retval>.g =3D _67;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits =3D 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block S=
LP
-t.c:19:16: note:   Build SLP for <retval>.b =3D _66;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits =3D 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block S=
LP
-t.c:19:16: note:   SLP discovery for node 0x2cb4fe8 failed

And later it tries to vectorize first 2 items:

-t.c:19:16: note:   Splitting SLP group at stmt 2
-t.c:19:16: note:   Split group into 2 and 1
-t.c:19:16: note:   Starting SLP discovery for
-t.c:19:16: note:     <retval>.r =3D _68;
-t.c:19:16: note:     <retval>.g =3D _67;
-t.c:19:16

... and after a lot of blablabla succeeds.

If opaque field is present we start with vector of size 4:
+t.c:19:16: note:     <retval>.r =3D _70;
+t.c:19:16: note:     <retval>.g =3D _69;
+t.c:19:16: note:     <retval>.b =3D _68;
+t.c:19:16: note:     <retval>.o =3D r$o_33(D);


+t.c:19:16: note:   vect_is_simple_use: operand _70 =3D PHI <_5(3)>, type o=
f def:
internal
+t.c:19:16: note:   vect_is_simple_use: operand _69 =3D PHI <_11(3)>, type =
of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand _68 =3D PHI <_16(3)>, type =
of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand r$o_33(D), type of def:
external
+t.c:19:16: missed:   treating operand as external
+t.c:19:16: note:   SLP discovery for node 0x2e80058 succeeded
+t.c:19:16: note:   SLP size 1 vs. limit 23.
+t.c:19:16: note:   Final SLP tree for instance 0x2def840:
+t.c:19:16: note:   node 0x2e80058 (max_nunits=3D4, refcnt=3D2) vector(4) d=
ouble
+t.c:19:16: note:   op template: <retval>.r =3D _70;
+t.c:19:16: note:       stmt 0 <retval>.r =3D _70;
+t.c:19:16: note:       stmt 1 <retval>.g =3D _69;
+t.c:19:16: note:       stmt 2 <retval>.b =3D _68;
+t.c:19:16: note:       stmt 3 <retval>.o =3D r$o_33(D);
+t.c:19:16: note:       children 0x2e800d8
+t.c:19:16: note:   node (external) 0x2e800d8 (max_nunits=3D1, refcnt=3D1)
+t.c:19:16: note:       { _70, _69, _68, r$o_33(D) }

So it seems to succeed vectorizing with 4 entries but it does so for the si=
ngle
return statement:

  <bb 3> [local count: 1063004409]:
  # i_37 =3D PHI <i_22(5), 0(2)>
  # r$r_40 =3D PHI <_5(5), r$r_25(D)(2)>
  # r$g_42 =3D PHI <_11(5), r$g_26(D)(2)>
  # r$b_44 =3D PHI <_16(5), r$b_27(D)(2)>
  # ivtmp_67 =3D PHI <ivtmp_66(5), 10000000(2)>
  _1 =3D rgbs[i_37].r;
  _2 =3D (int) _1;
  _3 =3D (double) _2;
  _4 =3D _3 * w_21(D);
  _5 =3D _4 + r$r_40;
  _7 =3D rgbs[i_37].g;
  _8 =3D (int) _7;
  _9 =3D (double) _8;
  _10 =3D _9 * w_21(D);
  _11 =3D _10 + r$g_42;
  _12 =3D rgbs[i_37].b;
  _13 =3D (int) _12;
  _14 =3D (double) _13;
  _15 =3D _14 * w_21(D);
  _16 =3D _15 + r$b_44;
  i_22 =3D i_37 + 1;
  ivtmp_66 =3D ivtmp_67 - 1;
  if (ivtmp_66 !=3D 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 1052374367]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 10737416]:
  # _70 =3D PHI <_5(3)>
  # _69 =3D PHI <_11(3)>
  # _68 =3D PHI <_16(3)>
  _65 =3D {_70, _69, _68, r$o_33(D)};
  MEM <vector(4) double> [(double *)&<retval>] =3D _65;

that seems somewhat pointless.
If one adds code initializing opacity field then vectorization works well. =
So
perhaps SLP vectorizer needs to be told how to deal with uninitialized
variabels that may be common in code like this after SRA?

Richi, it is not clear to me where SLP vectorizer discards the idea of
vectorizing the loop body in this case. But I think one needs to address:
+t.c:19:16: missed:   treating operand as external

I wonder if the loop would work faster it it used vectors of size 4 with the
last field unused.=