From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id A9FEB3858C30; Wed, 31 May 2023 16:11:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A9FEB3858C30 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685549480; bh=1VaN57LJWDRu3Pm+OVvYVsgmEIDjC/Ki1BMOSWZbJEU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=eFF87HZPmlFT0OH0XlaAvoysgTkA87ZBvKU0nTnjrFtdo8eQPraeweKuoaBnVSEZN tsaoYl+gpZqbMmff8B2sLrGWqgdFOm8aomd3eMLg1hJisueyLTI7qc0SL/tUxfGl7p Am2xo+rZ07vOFcNo72cifNCLrKRd5EIlWPx/ikpQ= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake Date: Wed, 31 May 2023 16:11:19 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.1.1 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc see_also Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109812 Jan Hubicka changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenther at suse dot de See Also| |https://gcc.gnu.org/bugzill | |a/show_bug.cgi?id=3D110062 --- Comment #13 from Jan Hubicka --- The only difference between slp vectorization is: - # _68 =3D PHI <_5(3)> - # _67 =3D PHI <_11(3)> - # _66 =3D PHI <_16(3)> - .r =3D _68; - .g =3D _67; - .b =3D _66; + # _70 =3D PHI <_5(3)> + # _69 =3D PHI <_11(3)> + # _68 =3D PHI <_16(3)> + .r =3D _70; + .g =3D _69; + .b =3D _68; + .o =3D r$o_33(D); so SRA invents r$o_33(D) even if that variable is undefined. SLP vectorizer then sees it as interleaving stores: -t.c:19:16: note: _1 =3D rgbs[i_35].r; -t.c:19:16: note: _7 =3D rgbs[i_35].g; -t.c:19:16: note: _12 =3D rgbs[i_35].b; -t.c:19:16: note: Detected interleaving store of size 3 -t.c:19:16: note: .r =3D _68; -t.c:19:16: note: .g =3D _67; -t.c:19:16: note: .b =3D _66; +t.c:19:16: note: _1 =3D rgbs[i_37].r; +t.c:19:16: note: _7 =3D rgbs[i_37].g; +t.c:19:16: note: _12 =3D rgbs[i_37].b; +t.c:19:16: note: Detected interleaving store of size 4 +t.c:19:16: note: .r =3D _70; +t.c:19:16: note: .g =3D _69; +t.c:19:16: note: .b =3D _68; +t.c:19:16: note: .o =3D r$o_33(D); For first case it first tries to vectorize for vector of 3 doubles and fail= s: -t.c:19:16: note: .r =3D _68; -t.c:19:16: note: .g =3D _67; -t.c:19:16: note: .b =3D _66; -t.c:19:16: note: starting SLP discovery for node 0x2cb4fe8 -t.c:19:16: note: Build SLP for .r =3D _68; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits =3D 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block S= LP -t.c:19:16: note: Build SLP for .g =3D _67; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits =3D 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block S= LP -t.c:19:16: note: Build SLP for .b =3D _66; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits =3D 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block S= LP -t.c:19:16: note: SLP discovery for node 0x2cb4fe8 failed And later it tries to vectorize first 2 items: -t.c:19:16: note: Splitting SLP group at stmt 2 -t.c:19:16: note: Split group into 2 and 1 -t.c:19:16: note: Starting SLP discovery for -t.c:19:16: note: .r =3D _68; -t.c:19:16: note: .g =3D _67; -t.c:19:16 ... and after a lot of blablabla succeeds. If opaque field is present we start with vector of size 4: +t.c:19:16: note: .r =3D _70; +t.c:19:16: note: .g =3D _69; +t.c:19:16: note: .b =3D _68; +t.c:19:16: note: .o =3D r$o_33(D); +t.c:19:16: note: vect_is_simple_use: operand _70 =3D PHI <_5(3)>, type o= f def: internal +t.c:19:16: note: vect_is_simple_use: operand _69 =3D PHI <_11(3)>, type = of def: internal +t.c:19:16: note: vect_is_simple_use: operand _68 =3D PHI <_16(3)>, type = of def: internal +t.c:19:16: note: vect_is_simple_use: operand r$o_33(D), type of def: external +t.c:19:16: missed: treating operand as external +t.c:19:16: note: SLP discovery for node 0x2e80058 succeeded +t.c:19:16: note: SLP size 1 vs. limit 23. +t.c:19:16: note: Final SLP tree for instance 0x2def840: +t.c:19:16: note: node 0x2e80058 (max_nunits=3D4, refcnt=3D2) vector(4) d= ouble +t.c:19:16: note: op template: .r =3D _70; +t.c:19:16: note: stmt 0 .r =3D _70; +t.c:19:16: note: stmt 1 .g =3D _69; +t.c:19:16: note: stmt 2 .b =3D _68; +t.c:19:16: note: stmt 3 .o =3D r$o_33(D); +t.c:19:16: note: children 0x2e800d8 +t.c:19:16: note: node (external) 0x2e800d8 (max_nunits=3D1, refcnt=3D1) +t.c:19:16: note: { _70, _69, _68, r$o_33(D) } So it seems to succeed vectorizing with 4 entries but it does so for the si= ngle return statement: [local count: 1063004409]: # i_37 =3D PHI # r$r_40 =3D PHI <_5(5), r$r_25(D)(2)> # r$g_42 =3D PHI <_11(5), r$g_26(D)(2)> # r$b_44 =3D PHI <_16(5), r$b_27(D)(2)> # ivtmp_67 =3D PHI _1 =3D rgbs[i_37].r; _2 =3D (int) _1; _3 =3D (double) _2; _4 =3D _3 * w_21(D); _5 =3D _4 + r$r_40; _7 =3D rgbs[i_37].g; _8 =3D (int) _7; _9 =3D (double) _8; _10 =3D _9 * w_21(D); _11 =3D _10 + r$g_42; _12 =3D rgbs[i_37].b; _13 =3D (int) _12; _14 =3D (double) _13; _15 =3D _14 * w_21(D); _16 =3D _15 + r$b_44; i_22 =3D i_37 + 1; ivtmp_66 =3D ivtmp_67 - 1; if (ivtmp_66 !=3D 0) goto ; [99.00%] else goto ; [1.00%] [local count: 1052374367]: goto ; [100.00%] [local count: 10737416]: # _70 =3D PHI <_5(3)> # _69 =3D PHI <_11(3)> # _68 =3D PHI <_16(3)> _65 =3D {_70, _69, _68, r$o_33(D)}; MEM [(double *)&] =3D _65; that seems somewhat pointless. If one adds code initializing opacity field then vectorization works well. = So perhaps SLP vectorizer needs to be told how to deal with uninitialized variabels that may be common in code like this after SRA? Richi, it is not clear to me where SLP vectorizer discards the idea of vectorizing the loop body in this case. But I think one needs to address: +t.c:19:16: missed: treating operand as external I wonder if the loop would work faster it it used vectors of size 4 with the last field unused.=