From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 12CA43858005; Fri, 15 Oct 2021 06:54:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 12CA43858005 From: "siarhei.siamashka at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug d/102765] New: [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code Date: Fri, 15 Oct 2021 06:54:06 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: d X-Bugzilla-Version: 11.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: siarhei.siamashka at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: ibuclaw at gdcproject dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Oct 2021 06:54:07 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102765 Bug ID: 102765 Summary: [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: siarhei.siamashka at gmail dot com Target Milestone: --- The performance of the following simple binary search code regressed a lot starting from GDC11: /*******************************************************/ import std.algorithm, std.range, std.stdio, std.stdint; // calculate integer square root using binary search int64_t isqrt(int64_t x) { return iota(0, min(x, 3037000499) + 1) .map!(v =3D> (v * v > x)) .assumeSorted.lowerBound(true) .length - 1; } // print the sum of 20M square roots void main() { 20000000.iota.map!isqrt.sum.writeln; } /*******************************************************/ $ gdc-6.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out=20 59618479180 real 0m1.924s user 0m1.924s sys 0m0.000s $ gdc-9.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out=20 59618479180 real 0m2.100s user 0m2.099s sys 0m0.000s $ gdc-10.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out=20 59618479180 real 0m1.776s user 0m1.776s sys 0m0.000s $ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out=20 59618479180 real 0m6.889s user 0m6.887s sys 0m0.000s My expectation is that the compilers should inline everything here and gene= rate code for a small and efficient binary search loop. But GDC11 stopped doing this, as can be confirmed by running "perf record ./a.out && perf report": 27.86% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5is= qrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc= __T18getTransitionIndexVEQGrQGq12SearchPolicyi3SQHoQHn__TQHkTQHaVQDha5_6120= 3c2062ZQIj3geqTbZQDlMFNaNbNiNfbZm 15.02% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5is= qrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc= __T3geqTbTbZQjMFNaNbNiNfbbZb 10.34% a.out a.out [.] _D3std9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQCm5range= __T4iotaTiTlZQkFilZ6ResultZQCv7opIndexMFNaNbNiNfmZb 10.31% a.out a.out [.] _D3std10functional__T9binaryFunVAyaa5_61203c2062VQra1_61VQza1_62Z__TQBvTbTb= ZQCdFNaNbNiNfKbKbZb 3.03% a.out a.out [.] _D3std5range__T4iotaTiTlZQkFilZ6Result7opIndexMNgFNaNbNiNfmZNgl 2.34% a.out a.out [.] 0x0000000000031a09 2.28% a.out a.out [.] _D4core6atomic__T7casImplTmTxmTmZQqFNaNbNiNePOmxmmZb 2.11% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5is= qrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc= 7opSliceMFNaNbNiNfmmZSQGoQGn__TQGkTQGaVQCha5_61203c2062ZQHj 2.02% a.out a.out [.] _D3std5range__T12assumeSortedVAyaa5_61203c2062TSQBu9algorithm9iteration__T9= MapResultS4test5isqrtFlZ9__lambda2TSQEfQEe__T4iotaTiTlZQkFilZ6ResultZQCsZQF= dFNaNbNiNfQEjZSQGhQGg__T11SortedRangeTQFlVQGga5_61203c2062ZQBj Using either -fwhole-program or -flto cmdline options resolves the performa= nce problem and allows all of these functions to be inlined again: $ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check -flto test.d && time ./a.ou= t=20 59618479180 real 0m2.085s user 0m2.085s sys 0m0.000s But is this expected? Does GDC now require using -flto option for getting reasonable performance starting from version 11? Or is this a real performa= nce regression and something can be done to improve the inlining behaviour?=