From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 142023858D1E; Wed, 17 May 2023 14:53:39 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 142023858D1E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1684335219;
	bh=lfVXcY86ak4Wyps10BCaHuW+bflFoy7HdoVbEDXXk6w=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=T5CViQZueAs3zhEjeUNmFJ+9MN3TfNEpkTSJVnUWHgU82t238xwHjx4FSHW/tdWEF
	 PRnBY2ROf7quVxpjW36kXwbG/hPUNs1y+E8eGkNW6TJcopFWzVKturiXnCHV3FRNhp
	 FIVqyEXNBugIk0CSS8RU0VHBVtan+49TUgpq6uE8=
From: "hubicka at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/109849] suboptimal code for vector walking loop
Date: Wed, 17 May 2023 14:53:38 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: hubicka at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: blocked cc
Message-ID: <bug-109849-4-GYWI9fIZre@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109849-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109849-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109849

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |109811
                 CC|                            |mjambor at suse dot cz
--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Here is slightly improved testcase which actually pushes into stack and
measures something. It test loops 1000 times and returns.  It also makes st=
ack
to be local variable so race conditions are not a problem.

#include <vector>
typedef unsigned int uint32_t;
std::pair<uint32_t, uint32_t> pair;
void
test()
{
        std::vector<std::pair<uint32_t, uint32_t>> stack;
        stack.push_back (pair);
        while (!stack.empty()) {
                std::pair<uint32_t, uint32_t> cur =3D stack.back();
                stack.pop_back();
                if (!cur.first)
                {
                        cur.second++;
                        stack.push_back (cur);
                }
                if (cur.second > 10000)
                        break;
        }
}
int
main()
{
        for (int i =3D 0; i < 10000; i++)
          test();
}

Clang code is about twice as fast

jan@localhost:/tmp> clang++ -O2 tt.C  -fno-exceptions
jan@localhost:/tmp> g++ -O2 tt.C  -fno-exceptions -o a.out-gcc
jan@localhost:/tmp> perf stat ./a.out

 Performance counter stats for './a.out':

            434.24 msec task-clock:u                     #    0.997 CPUs
utilized=20=20=20=20=20=20=20=20=20=20=20=20=20
                 0      context-switches:u               #    0.000 /sec=20=
=20=20=20=20=20=20=20
                 0      cpu-migrations:u                 #    0.000 /sec=20=
=20=20=20=20=20=20=20
               129      page-faults:u                    #  297.073 /sec=20=
=20=20=20=20=20=20=20
     1,003,191,657      cycles:u                         #    2.310 GHz=20=
=20=20=20=20=20=20=20=20
            68,927      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle=20=20=20=20=20=20
       800,792,619      stalled-cycles-backend:u         #   79.82% backend
cycles idle=20=20=20=20=20=20=20
     1,904,682,933      instructions:u                   #    1.90  insn per
cycle=20=20=20=20=20=20=20=20=20=20=20=20
                                                  #    0.42  stalled cycles=
 per
insn=20=20=20
       500,912,196      branches:u                       #    1.154 G/sec=
=20=20=20=20=20=20=20
            23,144      branch-misses:u                  #    0.00% of all
branches=20=20=20=20=20=20=20=20=20=20=20

       0.435340389 seconds time elapsed

       0.431409000 seconds user
       0.003994000 seconds sys


jan@localhost:/tmp> perf stat ./a.out-gcc

 Performance counter stats for './a.out-gcc':

          1,197.28 msec task-clock:u                     #    0.999 CPUs
utilized=20=20=20=20=20=20=20=20=20=20=20=20=20
                 0      context-switches:u               #    0.000 /sec=20=
=20=20=20=20=20=20=20
                 0      cpu-migrations:u                 #    0.000 /sec=20=
=20=20=20=20=20=20=20
               131      page-faults:u                    #  109.415 /sec=20=
=20=20=20=20=20=20=20
     2,903,995,656      cycles:u                         #    2.425 GHz=20=
=20=20=20=20=20=20=20=20
            86,204      stalled-cycles-frontend:u        #    0.00% frontend
cycles idle=20=20=20=20=20=20
     2,690,907,052      stalled-cycles-backend:u         #   92.66% backend
cycles idle=20=20=20=20=20=20=20
     2,005,212,311      instructions:u                   #    0.69  insn per
cycle=20=20=20=20=20=20=20=20=20=20=20=20
                                                  #    1.34  stalled cycles=
 per
insn=20=20=20
       401,007,320      branches:u                       #  334.932 M/sec=
=20=20=20=20=20=20=20
            23,290      branch-misses:u                  #    0.01% of all
branches=20=20=20=20=20=20=20=20=20=20=20

       1.198388186 seconds time elapsed

       1.198450000 seconds user
       0.000000000 seconds sys


The problem seems to be, like in first example, that we keep updating in-me=
mory
stack in the main loop.

.L39:
        movl    12(%rsp), %ebx
.L30:
        movq    16(%rsp), %rax
        cmpl    $10000, %ebx
        ja      .L33
.L40:
        movq    24(%rsp), %rdi
        cmpq    %rdi, %rax
        je      .L28
.L34:
        movq    -8(%rdi), %rax
        leaq    -8(%rdi), %rsi
        movq    %rsi, 24(%rsp)
        movq    %rax, 8(%rsp)
        testl   %eax, %eax
        jne     .L39

While clang does:

.LBB0_1:                                #   in Loop: Header=3DBB0_4 Depth=
=3D1
        movq    %rax, %r14
.LBB0_2:                                #   in Loop: Header=3DBB0_4 Depth=
=3D1
        movq    %rbx, %r12
        movq    %r12, %rbx
        cmpl    $10001, %r13d                   # imm =3D 0x2711
        jae     .LBB0_27
.LBB0_4:                                # =3D>This Loop Header: Depth=3D1
                                        #     Child Loop BB0_16 Depth 2
                                        #     Child Loop BB0_21 Depth 2
        cmpq    %r14, %rbx
        je      .LBB0_26
# %bb.5:                                #   in Loop: Header=3DBB0_4 Depth=
=3D1
        leaq    -8(%r14), %rax
        movq    -8(%r14), %rcx
        movq    %rcx, %r13
        shrq    $32, %r13
        testl   %ecx, %ecx
        jne     .LBB0_1


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109811
[Bug 109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16=