From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 04E99385842F; Mon, 24 Jan 2022 16:49:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 04E99385842F From: "roger at nextmovesoftware dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step Date: Mon, 24 Jan 2022 16:49:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 11.2.0 X-Bugzilla-Keywords: compile-time-hog X-Bugzilla-Severity: normal X-Bugzilla-Who: roger at nextmovesoftware dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Jan 2022 16:49:37 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D103641 --- Comment #22 from Roger Sayle --- I completely agree with Richard that the decision to vectorize or not to vectorize should be made elsewhere taking the whole function/loop into acco= unt. It's quite reasonable to synthesize a slow vector multiply if there's an overall benefit from SLP. What I think is required is that the "baseline" = cost should be the cost of moving from the vector to a scalar mode, performing t= he multiplication(s) as a scalar and moving the result back again. i.e. we're assuming that we're always going to multiply the value in a vector register, we're just choosing the cheapest implementation for it. For the xxhash.i testcase, I'm seeing DI mode multiplications with COSTS_N_INSNS(30) [i.e. a mult_cost of 120]. Even with slow inter-unit moves it must be possible to do this faster on AArch64? In fact, we'll probably vectorize more in SLP, if = we have the option to shuffle data back to the scalar multiplier if required. Perhaps even a define_insn_and_split of mulv2di3 to fool the middle-end into thinking we can do this "natively" via an optab. Note that multipliers used in cryptographic hash functions are sometimes (chosen to be) pathological to synth_mult. Like the design of DES' sboxes, these are coefficients designed to be slow to implement in software [and fa= ster in custom hardware]. 64bit values with around 32 (random) bits set. I/we can try to speed up the recursion in synth_mult, and/or increase the s= ize of the hash-table cache [which will help hppa64 and other targets with slow multipliers] but that's perhaps just working around the deeper issue with t= his PR.=