From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id EE16A384641E; Thu, 25 Apr 2024 12:28:42 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EE16A384641E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1714048122; bh=w2QjRIuxaMyF/X/sRVnZN+dPu1Ue4Y9TZW8e7X18zNk=; h=From:To:Subject:Date:From; b=yrayweQdCPAF98a6AbLrT6TbAR2fhu9Xo9U53HKqQlfRVUh2W3ln2vHA22/iltAXo 7l4MglWYQkupcSxQ5ZkEgAaCsw/V8JFOud19b+6STYwPAEe3hcCWo40OUYWslaN9FY 8C+Tbl+ncEmsweTqbwQ7D+lMVO8mBPpEYEt99Arc= From: "hubicka at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/114852] New: jpegxl 10.0.1 is faster with clang18 then with gcc14 Date: Thu, 25 Apr 2024 12:28:42 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: hubicka at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114852 Bug ID: 114852 Summary: jpegxl 10.0.1 is faster with clang18 then with gcc14 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3 reports about 8% difference. I can measure 13% on zen3. The code has chan= ged and it is no longer bound by push_back but runs AVX2 version of inner loops. The hottest loops looks comparable 0.00 =E2=94=82266:=E2=94=8C=E2=94=80=E2=86=92vmovaps (%r14,%rax,4),%= ymm0 0.11 =E2=94=82 =E2=94=82 vmulps (%rcx,%rax,4),%ymm7,%ymm2 1.18 =E2=94=82 =E2=94=82 vfnmadd213ps (%rsi,%rax,4),%ymm11,%ymm0 0.25 =E2=94=82 =E2=94=82 vmulps %ymm2,%ymm0,%ymm0 5.94 =E2=94=82 =E2=94=82 vroundps $0x8,%ymm0,%ymm2 0.35 =E2=94=82 =E2=94=82 vsubps %ymm2,%ymm0,%ymm0 1.05 =E2=94=82 =E2=94=82 vmulps (%rdx,%rax,4),%ymm0,%ymm0 3.19 =E2=94=82 =E2=94=82 vmovaps %ymm0,0x0(%r13,%rax,4) 0.15 =E2=94=82 =E2=94=82 vandps %ymm10,%ymm2,%ymm0 0.03 =E2=94=82 =E2=94=82 add $0x8,%rax 0.03 =E2=94=82 =E2=94=82 vcmpeqps %ymm8,%ymm0,%ymm2 0.09 =E2=94=82 =E2=94=82 vsqrtps %ymm0,%ymm0 27.25 =E2=94=82 =E2=94=82 vaddps %ymm0,%ymm6,%ymm6 0.35 =E2=94=82 =E2=94=82 vandnps %ymm9,%ymm2,%ymm0 0.12 =E2=94=82 =E2=94=82 vaddps %ymm0,%ymm5,%ymm5 0.05 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %r12,%rax 0.02 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jb 266 and clang 0.00 =E2=94=82 c90:=E2=94=8C=E2=94=80=E2=86=92vmulps (%r9,%rdx,4),%= ymm0,%ymm2 0.97 =E2=94=82 =E2=94=82 vmovaps (%r15,%rdx,4),%ymm1 0.36 =E2=94=82 =E2=94=82 vsubps %ymm2,%ymm1,%ymm1 4.24 =E2=94=82 =E2=94=82 vmulps (%rcx,%rdx,4),%ymm4,%ymm2 1.92 =E2=94=82 =E2=94=82 vmulps %ymm2,%ymm1,%ymm1 0.65 =E2=94=82 =E2=94=82 vroundps $0x8,%ymm1,%ymm2 0.06 =E2=94=82 =E2=94=82 vsubps %ymm2,%ymm1,%ymm1 1.11 =E2=94=82 =E2=94=82 vmulps (%rax,%rdx,4),%ymm1,%ymm1 3.53 =E2=94=82 =E2=94=82 vmovaps %ymm1,(%rsi,%rdx,4) 0.68 =E2=94=82 =E2=94=82 vandps %ymm6,%ymm2,%ymm1 0.23 =E2=94=82 =E2=94=82 vcmpneqps %ymm5,%ymm2,%ymm2 3.64 =E2=94=82 =E2=94=82 add $0x8,%rdx 0.24 =E2=94=82 =E2=94=82 vsqrtps %ymm1,%ymm1 22.16 =E2=94=82 =E2=94=82 vaddps %ymm1,%ymm8,%ymm8 0.25 =E2=94=82 =E2=94=82 vbroadcastss 0x31eba5(%rip),%ymm1 # = 34f840 0.05 =E2=94=82 =E2=94=82 vandps %ymm1,%ymm2,%ymm1 0.04 =E2=94=82 =E2=94=82 vaddps %ymm1,%ymm7,%ymm7 0.11 =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80cmp %rdi,%rdx 0.07 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80jb c90=E2=96=92 GCC profile: 10.78% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::EstimateEntropy(jxl::AcStrategy const&, float, unsigned long, unsigned long, jxl::ACSConfig const&, float con 7.02% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::FindBestMultiplier(float const*, float const*, unsigned long, float, float, bool) [clone .part.0] 4.50% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::Symmetric5Row(jxl::Plane const&, jxl::RectT const&, long, jxl: 4.47% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformFromPixels(jxl::AcStrategy::Ty= pe, float const*, unsigned long, float*, float* 4.31% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformToPixels(jxl::AcStrategy::Type, float*, float*, unsigned long, float*) 4.00% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, int const* restrict*, jxl::AcStra 3.56% cjxl libm.so.6 [.] __ieee754_pow= _fma 3.49% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::IDCT1DImpl<8ul, 8ul>::operator()(float const*, unsigned long, float*, unsigned long, f 3.43% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::AdaptiveQuantizationImpl::ComputeTile(float, float, jxl::Image3 const&, jxl::Re 3.27% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::DCT1DWrapper<32ul, 0ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2: 3.16% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<8ul, 8ul>::operator()(float*, float*) [clone .isra.0] 2.87% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::ComputeScaledIDCT<4ul, 8ul>::operator()::operator()::operator() const&, jxl::RectT const&, jxl::DequantMatrices const&, jxl::AcStrategyImage const*, jxl::Plane const*, jxl::Quantizer const*, jxl::Rect=E2=96=92 5.03% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, jxl::RectT const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*, jxl::Pla=E2=96=92 4.66% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<16ul, 8ul>::operator()(float*, float*)=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =E2=96=92 4.56% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState::operator()(float const*, unsigned long, float*, unsigned long, float*)=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =E2=96=92 4.22% cjxl libm.so.6 [.] __ieee754_pow= _fmacjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, jxl::RectT const&, float, jxl::ThreadPool*, jxl::Plane=E2=96=92 3.04% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::GenericTransposeBlock<0ul, 0ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2::(anonymous namespace)::DCTTo>(jxl::N_AVX2::(anonymous namespace)::Tra=E2=96=92 2.84% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::AdjustQuantBlockAC(jxl::Quantizer const&, unsigned long, float, unsigned long, unsigned long, unsigned long, float*, float const*, int*)=20= =20=20=20=20=20=20 =E2=96=92 2.67% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState*)::$_0>::CallDataFunc(void*, unsigned int, unsigned long= )=20=20=20 =E2=96=92 2.51% cjxl libm.so.6 [.] __log1pfcjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::TokenizeCoefficients(unsigned int const*, jxl::RectT const&, int const* restrict*, jxl::AcStrategyImage const&, jxl::YCbCrChromaSubsampling, jxl::Image3*, std::vector<=E2=96=92=