From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 92BEE3858C52; Mon, 29 May 2023 14:20:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 92BEE3858C52 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685370036; bh=KrrFFn9TdCO1y9EAKC6LGUh6o6eOkm3Y5E8/yoh68sk=; h=From:To:Subject:Date:From; b=wCmXfy4jAhH3KdwokfNj8xp/c9ItjYdG1oPxVP1XfICjLs2VJzImTgSIGwt4++P2B Q6/5jsWyTsZTIWmX+j4PLqGzoJRAXEAB1M+XJf0f71VVp84b9F2Qdpj9sw823EzoON gzN2NyABw8Us+2X4qd1hSSo9vvB6DKn+eyv+gH+A= From: "d_vampile at 163 dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/110023] New: [10.3 Regression] 10% performance drop on important benchmark after r247544. Date: Mon, 29 May 2023 14:20:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 10.3.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: d_vampile at 163 dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110023 Bug ID: 110023 Summary: [10.3 Regression] 10% performance drop on important benchmark after r247544. Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: d_vampile at 163 dot com Target Milestone: --- Created attachment 55183 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D55183&action=3Dedit Open-source stream benchmark The stream benchmark performance deteriorates. The loop peeling policy in t= he vect_enhance_data_refs_alignment function is modified to degrade the benchm= ark performance,which can be demonstrated on example from attachment.=20 Alternatively, you can obtain it from https://github.com/jeffhammond/stream/archive/master.zip. Compiling & Running: gcc -fopenmp -O -DSTREAM_ARRAY_SIZE=3D100000000 stream.c -o stream ./stream Patches to modify the loop stripping policy of the vect_enhance_data_refs_alignment function are available from: https://gcc.gnu.org/git/?p=3Dgcc.git&a=3Dcommit;h=3D49ab46214e9288ee1268f87= ddcd64dacfd21c31d After you open OpenMP, Before modification: (Add subitem) ldr d0, [x5, x1, lsl #3] fadd d0, d0, d1 str d0, [x4, x1, lsl #3] mov w4, w2 sub w7, w7, w2 add x4, x4, x1 ldr x1, [x10, #888] lsl x4, x4, #3 lsr w8, w7, #1 add x6, x4, x6 add x5, x4, x5 mov w2, #0x0 / add x4, x4, x1 mov x1, #0x0 / ldr q0, [x5, x1] add w2, w2, #0x1 ldr q1, [x6, x1] cmp w2, w8 fadd v0.2d, v0.2d, v1.2d str q0, [x4, x1] add x1, x1, #0x10 b.cc 4012d8 and w1, w7, #0xfffffffe add w0, w0, w1 cmp w7, w1 b.eq 401348 ldr x5, [x9, #880] sxtw x1, w0 ldr x4, [x11, #896] add w0, w0, #0x1 ldr d1, [x5, x1, lsl #3] cmp w3, w0 ldr x2, [x10, #888] ldr d0, [x4, x1, lsl #3] fadd d0, d0, d1 str d0, [x2, x1, lsl #3] b.le 401348 sxtw x0, w0 ldr d0, [x5, x0, lsl #3] ldr d1, [x4, x0, lsl #3] fadd d0, d0, d1 str d0, [x2, x0, lsl #3] ldr x19, [sp, #16] ldp x29, x30, [sp], #32 After the modification: mov x29, sp str x19, [sp, #16] bl 4006e0 mov w19, w0 bl 4006b0 mov w2, #0x8000 / movk w2, #0x61a, lsl #16 sdiv w1, w2, w19 msub w2, w1, w19, w2 cmp w0, w2 b.ge 401238 add w1, w1, #0x1 mov w2, #0x0 / madd w0, w1, w0, w2 add w1, w1, w0 cmp w0, w1 b.ge 4012d8 sub w2, w1, w0 adrp x8, 401000 adrp x9, 401000 adrp x7, 401000 cmp w2, #0x1 b.eq 4012b8 ldr x1, [x7, #760] sbfiz x4, x0, #3, #32 ldr x6, [x8, #744] lsr w10, w2, #1 ldr x5, [x9, #752] mov w3, #0x0 / add x6, x4, x6 add x5, x4, x5 add x4, x4, x1 mov x1, #0x0 / ldr q0, [x6, x1] add w3, w3, #0x1 ldr q1, [x5, x1] cmp w3, w10 fadd v0.2d, v0.2d, v1.2d str q0, [x4, x1] add x1, x1, #0x10 b.cc 401288 and w1, w2, #0xfffffffe add w0, w0, w1 cmp w2, w1 b.eq 4012d8 ldr x3, [x9, #752] sxtw x0, w0 ldr x2, [x8, #744] ldr x1, [x7, #760] ldr d0, [x3, x0, lsl #3] ldr d1, [x2, x0, lsl #3] fadd d0, d0, d1 str d0, [x1, x0, lsl #3] ldr x19, [sp, #16] ldp x29, x30, [sp], #32 ret After modifying the peeling policy, the vectorization of the for loop in the Add subitem does not attempt to peel the loop, but the performance eventual= ly degrades.=