From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 557B73858D35; Thu, 29 Jun 2023 04:35:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 557B73858D35 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1688013324; bh=eVFHEdrUuP9i/g+U7ZYG8qms4l0qotCIVxltfbZx7bg=; h=From:To:Subject:Date:From; b=bEyFFd3Pn8A/SMFvkqgj2jfetLyDaxin65pyjBbp/qmu74VJHXFZIIz4Jor5nEQ/U 68x2Ixd45TOQcqps7aZTrVIf8kAjmfs9mTsb2ytWIgM4DGkzGNDygcPFHZpbBGTREg toOYQDq0OZIsHviO/rEiKY2yjNOLRDmd70mIjAAg= From: "hliu at amperecomputing dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/110474] New: Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization Date: Thu, 29 Jun 2023 04:35:23 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hliu at amperecomputing dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110474 Bug ID: 110474 Summary: Vect: the epilog vect loop should have small VF if the loop is unrolled during vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- Hi, I'm trying to use tune loop unrolling during vectorization (see more: tree-vect-loop.cc suggested_unroll_factor). I find the unrolling may hurt performance as unrolling also increases the VF (vector factor) of epilog ve= ct loop. For example: int foo(short *A, char *B, int N) { int sum =3D 0; for (int i =3D 0; i < N; ++i) { sum +=3D A[i] * B[i]; } return sum; } Compile it with "-O3 -mtune=3Dneoverse-n2 -mcpu=3Dneoverse-n1 --param aarch64-vect-unroll-limit=3D2" (I'm using -mcpu n1 as I want to try a target without SVE). GCC vectorization pass unrolls the loop by 2 and generates co= de as following: if N >=3D 32: main vect loop ... if N >=3D 16: # This may hurt performance if N is small (e.g. 8) epilog vect loop ... epilog scalar code ... If the loop is not unrolled (i.e. use "--param aarch64-vect-unroll-limit=3D= 1"). GCC generates code as following: if N >=3D 16: main vect loop ... if N >=3D 8: epilog vect loop ... epilog scalar code ... The runtime check is based on the VF of epilog vectorization. There is code= in tree-vect-loop.cc (line 2990) to choose epilog vect VF: /* If we're vectorizing an epilogue loop, the vectorized loop either needs to be able to handle fewer than VF scalars, or needs to have a lower VF than the main loop. */ if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo))) return opt_result::failure_at (vect_location, "Vectorization factor too high for" " epilogue loop.\n"); But it doesn't consider about the suggested_unroll_factor. So I'm thinking about adding following code to unscale the orig_loop_vinfo's VF by unroll_factor: unscaled_orig_vf =3D exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vin= fo), orig_loop_vinfo->suggested_unroll_factor); Is this reasonable?=