From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id C5051385841A; Thu, 9 Feb 2023 13:54:25 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C5051385841A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1675950865; bh=E2Ddmyw85OwHzFTpX0b+RKA5G/tc8p07YdYAEaFjWvM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=ChbNLb3pk5fNbBnPV/hM1RyYT/0uL4GHKcO/o/DO5T7Irx3Z0AZe0MOpNMQWxLCRs dGLeRxrE26WBW3QykMNXloQNJt1kYcYT8MKKncMetIwmoQ8mkh8fBf4TkFQRxJFIle R28poiFNhkd9fcj7Ndoq5RVg3jRCohnZA2Qwfg1w= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE Date: Thu, 09 Feb 2023 13:54:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to everconfirmed bug_status cf_reconfirmed_on Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108724 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot = gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2023-02-09 --- Comment #3 from Richard Biener --- Adding -fopt-info shows t.c:3:21: optimized: loop vectorized using 8 byte vectors t.c:1:6: optimized: loop with 7 iterations completely unrolled (header execution count 63136016) disabling unrolling instead shows .L2: leaq (%rsi,%rax), %r8 leaq (%rdx,%rax), %rdi movl (%r8), %ecx addl (%rdi), %ecx movq %r10, -8(%rsp) movl %ecx, -8(%rsp) movq -8(%rsp), %rcx movl 4(%rdi), %edi addl 4(%r8), %edi movq %rcx, -16(%rsp) movl %edi, -12(%rsp) movq -16(%rsp), %rcx movq %rcx, (%r9,%rax) addq $8, %rax cmpq $64, %rax jne .L2 and what happens is that vector lowering fails to perform generic vector addition (vector lowering is supposed to materialize that), but instead decomposes the vector, doing scalar adds, which eventually results in us spilling ... The reason is that vector lowering does /* Expand a vector operation to scalars; for integer types we can use special bit twiddling tricks to do the sums a word at a time, using function F_PARALLEL instead of F. These tricks are done only if they can process at least four items, that is, only if the vector holds at least four items and if a word can hold four items. */ static tree expand_vector_addition (gimple_stmt_iterator *gsi, elem_op_func f, elem_op_func f_parallel, tree type, tree a, tree b, enum tree_code code) { int parts_per_word =3D BITS_PER_WORD / vector_element_bits (type); if (INTEGRAL_TYPE_P (TREE_TYPE (type)) && parts_per_word >=3D 4 && nunits_for_known_piecewise_op (type) >=3D 4) return expand_vector_parallel (gsi, f_parallel, type, a, b, code); else return expand_vector_piecewise (gsi, f, type, TREE_TYPE (type), a, b, code, false); so it only treats >=3D 4 elements as profitable to vectorize this way but t= he vectorizer doesn't seem to know that, it instead applies its own cost model here while vector lowering doesn't have any.=