From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 44C3E3858D37; Wed, 28 Jun 2023 09:22:41 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 44C3E3858D37 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1687944161; bh=x3A0DS92NhHR/ZYh0TOwqcaW/K+9iN9dCWlc6Uwvsgo=; h=From:To:Subject:Date:From; b=pFs2zaJlHqzxh/qQDHBBurC1J+Vt5gvEbDBVYHmoF10/9/Gl3bY7Acy5Gf4gxnncK 5Jp3WE94NCwftV13ivK1S10/JFkDHke+Hz6fF41NgW0Kvz9/UxjrFTwstqkwYd0Vvm yPaw1UIEyr/hwPhqXXsNN6M0vUkCjGu9ZxWrMlXQ= From: "hliu at amperecomputing dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/110449] New: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization Date: Wed, 28 Jun 2023 09:22:40 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hliu at amperecomputing dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110449 Bug ID: 110449 Summary: Vect: use a small step to calculate the loop induction if the loop is unrolled during loop vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- This is inspired by clang. Compile the follwing case with "-mcpu=3Dneoverse= -n2 -O3": void foo(int *arr, int val, int step) { for (int i =3D 0; i < 1024; i++) { arr[i] =3D val; val +=3D step; } } It will be unrolled by 2 during vectorization. GCC generates code: fmov s29, w2 # step shl v27.2s, v29.2s, 3 # 8*step shl v28.2s, v29.2s, 2 # 4*step ... .L2: mov v30.16b, v31.16b add v31.4s, v31.4s, v27.4s # +=3D 8*step add v29.4s, v30.4s, v28.4s # +=3D 4*step stp q30, q29, [x0] add x0, x0, 32 cmp x1, x0 bne .L2 The v27 (i.e. "8*step") is actually not necessary. We can use v29 + v28 (i.= e. "+ 4*step") and generate simpler code: fmov s29, w2 # step shl v28.2s, v29.2s, 2 # 4*step ... .L2: add v29.4s, v30.4s, v28.4s # +=3D 4*step stp q30, q29, [x0] add x0, x0, 32 add v30.4s, v29.4s, v28.4s # +=3D 4*step cmp x1, x0 bne .L2 This has two benefits: (1) Save 1 vector register and one "mov" instructon (2) For floating point, the result value of small step should be closer to = the original scalar result value than large step. I.e. "A + 4*step + ... + 4*st= ep" should be closer to "A + step + ... + step" than "A + 8*step + ... 8*step". Do you think if this is reasonable?=20 I have a simple patch to enhance the tree-vect-loop.cc "vectorizable_induction()" to achieve this. Will send out the patch for code review later.=