From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 142023858D1E; Wed, 17 May 2023 14:53:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 142023858D1E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1684335219; bh=lfVXcY86ak4Wyps10BCaHuW+bflFoy7HdoVbEDXXk6w=; h=From:To:Subject:Date:In-Reply-To:References:From; b=T5CViQZueAs3zhEjeUNmFJ+9MN3TfNEpkTSJVnUWHgU82t238xwHjx4FSHW/tdWEF PRnBY2ROf7quVxpjW36kXwbG/hPUNs1y+E8eGkNW6TJcopFWzVKturiXnCHV3FRNhp FIVqyEXNBugIk0CSS8RU0VHBVtan+49TUgpq6uE8= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/109849] suboptimal code for vector walking loop Date: Wed, 17 May 2023 14:53:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: blocked cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109849 Jan Hubicka changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |109811 CC| |mjambor at suse dot cz --- Comment #6 from Jan Hubicka --- Here is slightly improved testcase which actually pushes into stack and measures something. It test loops 1000 times and returns. It also makes st= ack to be local variable so race conditions are not a problem. #include typedef unsigned int uint32_t; std::pair pair; void test() { std::vector> stack; stack.push_back (pair); while (!stack.empty()) { std::pair cur =3D stack.back(); stack.pop_back(); if (!cur.first) { cur.second++; stack.push_back (cur); } if (cur.second > 10000) break; } } int main() { for (int i =3D 0; i < 10000; i++) test(); } Clang code is about twice as fast jan@localhost:/tmp> clang++ -O2 tt.C -fno-exceptions jan@localhost:/tmp> g++ -O2 tt.C -fno-exceptions -o a.out-gcc jan@localhost:/tmp> perf stat ./a.out Performance counter stats for './a.out': 434.24 msec task-clock:u # 0.997 CPUs utilized=20=20=20=20=20=20=20=20=20=20=20=20=20 0 context-switches:u # 0.000 /sec=20= =20=20=20=20=20=20=20 0 cpu-migrations:u # 0.000 /sec=20= =20=20=20=20=20=20=20 129 page-faults:u # 297.073 /sec=20= =20=20=20=20=20=20=20 1,003,191,657 cycles:u # 2.310 GHz=20= =20=20=20=20=20=20=20=20 68,927 stalled-cycles-frontend:u # 0.01% frontend cycles idle=20=20=20=20=20=20 800,792,619 stalled-cycles-backend:u # 79.82% backend cycles idle=20=20=20=20=20=20=20 1,904,682,933 instructions:u # 1.90 insn per cycle=20=20=20=20=20=20=20=20=20=20=20=20 # 0.42 stalled cycles= per insn=20=20=20 500,912,196 branches:u # 1.154 G/sec= =20=20=20=20=20=20=20 23,144 branch-misses:u # 0.00% of all branches=20=20=20=20=20=20=20=20=20=20=20 0.435340389 seconds time elapsed 0.431409000 seconds user 0.003994000 seconds sys jan@localhost:/tmp> perf stat ./a.out-gcc Performance counter stats for './a.out-gcc': 1,197.28 msec task-clock:u # 0.999 CPUs utilized=20=20=20=20=20=20=20=20=20=20=20=20=20 0 context-switches:u # 0.000 /sec=20= =20=20=20=20=20=20=20 0 cpu-migrations:u # 0.000 /sec=20= =20=20=20=20=20=20=20 131 page-faults:u # 109.415 /sec=20= =20=20=20=20=20=20=20 2,903,995,656 cycles:u # 2.425 GHz=20= =20=20=20=20=20=20=20=20 86,204 stalled-cycles-frontend:u # 0.00% frontend cycles idle=20=20=20=20=20=20 2,690,907,052 stalled-cycles-backend:u # 92.66% backend cycles idle=20=20=20=20=20=20=20 2,005,212,311 instructions:u # 0.69 insn per cycle=20=20=20=20=20=20=20=20=20=20=20=20 # 1.34 stalled cycles= per insn=20=20=20 401,007,320 branches:u # 334.932 M/sec= =20=20=20=20=20=20=20 23,290 branch-misses:u # 0.01% of all branches=20=20=20=20=20=20=20=20=20=20=20 1.198388186 seconds time elapsed 1.198450000 seconds user 0.000000000 seconds sys The problem seems to be, like in first example, that we keep updating in-me= mory stack in the main loop. .L39: movl 12(%rsp), %ebx .L30: movq 16(%rsp), %rax cmpl $10000, %ebx ja .L33 .L40: movq 24(%rsp), %rdi cmpq %rdi, %rax je .L28 .L34: movq -8(%rdi), %rax leaq -8(%rdi), %rsi movq %rsi, 24(%rsp) movq %rax, 8(%rsp) testl %eax, %eax jne .L39 While clang does: .LBB0_1: # in Loop: Header=3DBB0_4 Depth= =3D1 movq %rax, %r14 .LBB0_2: # in Loop: Header=3DBB0_4 Depth= =3D1 movq %rbx, %r12 movq %r12, %rbx cmpl $10001, %r13d # imm =3D 0x2711 jae .LBB0_27 .LBB0_4: # =3D>This Loop Header: Depth=3D1 # Child Loop BB0_16 Depth 2 # Child Loop BB0_21 Depth 2 cmpq %r14, %rbx je .LBB0_26 # %bb.5: # in Loop: Header=3DBB0_4 Depth= =3D1 leaq -8(%r14), %rax movq -8(%r14), %rcx movq %rcx, %r13 shrq $32, %r13 testl %ecx, %ecx jne .LBB0_1 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109811 [Bug 109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16=