public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
@ 2023-05-11 14:24 aros at gmx dot com
  2023-05-11 15:30 ` [Bug target/109811] " xry111 at gcc dot gnu.org
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: aros at gmx dot com @ 2023-05-11 14:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

            Bug ID: 109811
           Summary: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
           Product: gcc
           Version: 13.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: aros at gmx dot com
  Target Milestone: ---

Created attachment 55051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55051&action=edit
Graphs

Check this:

https://www.phoronix.com/review/gcc13-clang16-raptorlake/3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
@ 2023-05-11 15:30 ` xry111 at gcc dot gnu.org
  2023-05-11 15:49 ` aros at gmx dot com
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: xry111 at gcc dot gnu.org @ 2023-05-11 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Xi Ruoyao <xry111 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xry111 at gcc dot gnu.org

--- Comment #1 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
A URL of an external site nor a plot is not a valid bug report.  We need
precisely how to reproduce the issue.  And you cannot assume the GCC developers
always have a Raptor Lake or they want to run the full Phoronix bench suite.

See https://gcc.gnu.org/bugs/.

That being said, maybe we can try `-fno-semantic-interposition` first?  It's
the default of clang, but not GCC.  And if a library is compiled with -fPIC
sometimes semantic interposition can cause a large performance loss.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
  2023-05-11 15:30 ` [Bug target/109811] " xry111 at gcc dot gnu.org
@ 2023-05-11 15:49 ` aros at gmx dot com
  2023-05-12  6:59 ` rguenth at gcc dot gnu.org
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: aros at gmx dot com @ 2023-05-11 15:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Artem S. Tashkinov <aros at gmx dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #2 from Artem S. Tashkinov <aros at gmx dot com> ---
According to the latest Phoronix test which can be easily downloaded, run and
reproduced, GCC 13.1 loses to Clang by a wide margin, in certain workloads it's
~30% (!) slower and I just wanted to alert its developers to a widening gap in
performance v Clang. I'm not a developer either, I'm simply no one.

My previous bug reports for performance regressions and deficiencies weren't
met with such ... words, so, I'm sorry I'm not in a mood of proving anything,
so I'll just go ahead and close it as useless, annoying and maybe even outright
invalid.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
  2023-05-11 15:30 ` [Bug target/109811] " xry111 at gcc dot gnu.org
  2023-05-11 15:49 ` aros at gmx dot com
@ 2023-05-12  6:59 ` rguenth at gcc dot gnu.org
  2023-05-17  8:42 ` hubicka at gcc dot gnu.org
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-12  6:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
           Keywords|                            |missed-optimization
         Resolution|INVALID                     |---
             Status|RESOLVED                    |UNCONFIRMED

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's certainly worth investigating, so re-opening as a tracking bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (2 preceding siblings ...)
  2023-05-12  6:59 ` rguenth at gcc dot gnu.org
@ 2023-05-17  8:42 ` hubicka at gcc dot gnu.org
  2023-05-17  9:08 ` hubicka at gcc dot gnu.org
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-17  8:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-05-17

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Confirmed. LTO is not necessary to reproduce the differnce.

I got libjxl and the test jpeg file from Phoronix testuiste and configure clang
build with:

CC=clang CXX=clang++ CFLAGS="-O3 -g    -march=native -fno-exceptions"
CXXFLAGS="-O3 -g   -march=native -fno-exceptions" cmake
-DCMAKE_C_FLAGS_RELEASE="$CFLAGS -DNDEBUG" -DCMAKE_CXX_FLAGS_RELEASE="$CXXFLAGS
-DNDEBUG" -DBUILD_TESTING=OFF ..

and

CFLAGS="-O3 -g    -march=native -fno-exceptions" CXXFLAGS="-O3 -g  
-march=native -fno-exceptions" cmake -DCMAKE_C_FLAGS_RELEASE="$CFLAGS -DNDEBUG"
-DCMAKE_CXX_FLAGS_RELEASE="$CXXFLAGS -DNDEBUG" -DBUILD_TESTING=OFF ..

jh@ryzen3:~/.phoronix-test-suite/installed-tests/pts/jpegxl-1.5.0>
./libjxl-0.7.0/build-gcc/tools/cjxl sample-photo-6000x4000.JPG --quality=90
--lossless_jpeg=0
JPEG XL encoder v0.7.0 [AVX2]
No output file specified.
Encoding will be performed, but the result will be discarded.
Read 6000x4000 image, 7837694 bytes, 926.0 MP/s
Encoding [Container | VarDCT, d1.000, effort: 7 | 29424-byte Exif], 
Compressed to 2288431 bytes including container (0.763 bpp).
6000 x 4000, 11.12 MP/s [11.12, 11.12], 1 reps, 16 threads.
jh@ryzen3:~/.phoronix-test-suite/installed-tests/pts/jpegxl-1.5.0>
./libjxl-0.7.0/build-gcc/tools/cjxl sample-photo-6000x4000.JPG --quality=90
--lossless_jpeg=0 test
JPEG XL encoder v0.7.0 [AVX2]
Read 6000x4000 image, 7837694 bytes, 926.5 MP/s
Encoding [Container | VarDCT, d1.000, effort: 7 | 29424-byte Exif], 
Compressed to 2288431 bytes including container (0.763 bpp).
6000 x 4000, 11.09 MP/s [11.09, 11.09], 1 reps, 16 threads.
jh@ryzen3:~/.phoronix-test-suite/installed-tests/pts/jpegxl-1.5.0>
./libjxl-0.7.0/build-gcc/tools/cjxl sample-photo-6000x4000.JPG --quality=90
--lossless_jpeg=0 test
JPEG XL encoder v0.7.0 [AVX2]
Read 6000x4000 image, 7837694 bytes, 925.6 MP/s
Encoding [Container | VarDCT, d1.000, effort: 7 | 29424-byte Exif], 
Compressed to 2288431 bytes including container (0.763 bpp).
6000 x 4000, 11.12 MP/s [11.12, 11.12], 1 reps, 16 threads.
jh@ryzen3:~/.phoronix-test-suite/installed-tests/pts/jpegxl-1.5.0>
./libjxl-0.7.0/build-clang/tools/cjxl sample-photo-6000x4000.JPG --quality=90
--lossless_jpeg=0 test
JPEG XL encoder v0.7.0 [AVX2]
Read 6000x4000 image, 7837694 bytes, 924.6 MP/s
Encoding [Container | VarDCT, d1.000, effort: 7 | 29424-byte Exif], 
Compressed to 2288430 bytes including container (0.763 bpp).
6000 x 4000, 15.17 MP/s [15.17, 15.17], 1 reps, 16 threads.
jh@ryzen3:~/.phoronix-test-suite/installed-tests/pts/jpegxl-1.5.0>
./libjxl-0.7.0/build-clang/tools/cjxl sample-photo-6000x4000.JPG --quality=90
--lossless_jpeg=0 test
JPEG XL encoder v0.7.0 [AVX2]
Read 6000x4000 image, 7837694 bytes, 922.4 MP/s
Encoding [Container | VarDCT, d1.000, effort: 7 | 29424-byte Exif], 
Compressed to 2288430 bytes including container (0.763 bpp).
6000 x 4000, 15.18 MP/s [15.18, 15.18], 1 reps, 16 threads.


So GCC does 11MB/s while clang 15MB/s

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (3 preceding siblings ...)
  2023-05-17  8:42 ` hubicka at gcc dot gnu.org
@ 2023-05-17  9:08 ` hubicka at gcc dot gnu.org
  2023-05-17 12:35 ` [Bug target/109811] libjxl " hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-17  9:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Also forgot to mention, I used zen3 machine.  So Raptor lake is not necessary.
Note that build systems appends -O2 after any CFLAGS specified, so it really is
-O2 build:

# Force build with optimizations in release mode.
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O2")

For Clang other options are appended:

      -fnew-alignment=8
      -fno-cxx-exceptions
      -fno-slp-vectorize
      -fno-vectorize

      -disable-free
      -disable-llvm-verifier


Perf profile mixing both GCC and clang build is:

   8.36%  cjxl     libjxl.so.0.7.0          [.] jxl::(anonymous
namespace)::FindTextLikePatches                                                
                                                                               
                                       ◆
   5.74%  cjxl     libjxl.so.0.7.0          [.] jxl::FindBestPatchDictionary   
                                                                               
                                                                               
                       ▒
   4.51%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::EstimateEntropy   
                                                                               
                                                                               
                       ▒
   4.50%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                
                                                                               
                               ▒
   4.25%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::QuantizeBlockAC   
                                                                               
                                                                               
                       ▒
   4.10%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::EstimateEntropy   
                                                                               
                                                                               
                       ▒
   3.77%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                
                                                                               
                               ▒
   3.46%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::QuantizeBlockAC   
                                                                               
                                                                               
                       ▒
   3.08%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::FindBestMultiplier
                                                                               
                                                                               
                       ▒
   3.04%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::FindBestMultiplier
                                                                               
                                                                               
                       ▒
   2.98%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<8ul, 8ul>::operator()                                    
                                                                               
                               ▒
   2.80%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::Symmetric5(jxl::Plane<float> const&, jxl::RectT<unsigned long>
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*,
jxl::Plane<float>*)::{l▒
   2.75%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::Symmetric5(jxl::Plane<float> const&, jxl::RectT<unsigned long>
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*,
jxl::Plane<float>*)::$_▒
   2.26%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::SRGBToXYB(jxl::Image3<float> const&, float const*,
jxl::ThreadPool*, jxl::Image3<float>*)::$_0>::CallDataFunc                     
            ▒
   2.00%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<4ul, 4ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom,
jxl::N_AVX2::(anonymous namespace)::DCTTo>                                     
                               ▒
   1.95%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<16ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.68%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::ConvertFromExternal(jxl::Span<unsigned char const>, unsigned long,
unsigned long, jxl::ColorEncoding const&, unsigned long, bool, unsigned long,
JxlEnd▒
   1.68%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<32ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.66%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::ConvertFromExternal(jxl::Span<unsigned char const>, unsigned long,
unsigned long, jxl::ColorEncoding const&, unsigned long, bool, unsigned long,
JxlEnd▒
   1.56%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<8ul, 4ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom,
jxl::N_AVX2::(anonymous namespace)::DCTTo>                                     
                               ▒
   1.52%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<32ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.33%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::N_AVX2::(anonymous
namespace)::AdaptiveQuantizationMap(float, jxl::Image3<float> const&,
jxl::FrameDimensions const&, float, jxl::ThreadPool*, jxl::Plane<float>*)::$_0,
jxl::N_AVX2▒
   1.27%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<64ul, 0ul, jxl::N_AVX2::(anonymous
namespace)::DCTFrom, jxl::N_AVX2::(anonymous namespace)::DCTTo>                
                                                   ▒
   1.11%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long), jxl::(anonymous
namespace)::FindTextLikePatches(jxl::Image3<float> const&,
jxl::PassesEncoderState const*, jxl::ThreadPool*, jxl::AuxOut*,
bool)::{lambda(un▒

So it is some hand written AVX code.  In GCC top function is
FindTextLikePatches
while clang FindBestPatchDictionary.  We  do not inline it because of large
function growth limit. Adding --param large-function-insns=1000000 makes
inlining decisions to match and has no effect on the performance.

With these changes I get:
   8.42%  cjxl     libjxl.so.0.7.0          [.] jxl::FindBestPatchDictionary   
                                                                               
                                                                               
                       ◆
   5.72%  cjxl     libjxl.so.0.7.0          [.] jxl::FindBestPatchDictionary   
                                                                               
                                                                               
                       ▒
   4.50%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                
                                                                               
                               ▒
   4.46%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::EstimateEntropy   
                                                                               
                                                                               
                       ▒
   4.25%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::QuantizeBlockAC   
                                                                               
                                                                               
                       ▒
   4.14%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::EstimateEntropy   
                                                                               
                                                                               
                       ▒
   3.76%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                
                                                                               
                               ▒
   3.56%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::QuantizeBlockAC   
                                                                               
                                                                               
                       ▒
   3.10%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::FindBestMultiplier
                                                                               
                                                                               
                       ▒
   3.00%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::FindBestMultiplier
                                                                               
                                                                               
                       ▒
   2.98%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<8ul, 8ul>::operator()                                    
                                                                               
                               ▒
   2.82%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::Symmetric5(jxl::Plane<float> const&, jxl::RectT<unsigned long>
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*,
jxl::Plane<float>*)::{l▒
   2.75%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::Symmetric5(jxl::Plane<float> const&, jxl::RectT<unsigned long>
const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*,
jxl::Plane<float>*)::$_▒
   2.26%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::N_AVX2::SRGBToXYB(jxl::Image3<float> const&, float const*,
jxl::ThreadPool*, jxl::Image3<float>*)::$_0>::CallDataFunc                     
            ▒
   1.99%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<4ul, 4ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom,
jxl::N_AVX2::(anonymous namespace)::DCTTo>                                     
                               ▒
   1.95%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<16ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.69%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::ConvertFromExternal(jxl::Span<unsigned char const>, unsigned long,
unsigned long, jxl::ColorEncoding const&, unsigned long, bool, unsigned long,
JxlEnd▒
   1.67%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long),
jxl::ConvertFromExternal(jxl::Span<unsigned char const>, unsigned long,
unsigned long, jxl::ColorEncoding const&, unsigned long, bool, unsigned long,
JxlEnd▒
   1.66%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<32ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.54%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<8ul, 4ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom,
jxl::N_AVX2::(anonymous namespace)::DCTTo>                                     
                               ▒
   1.49%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DImpl<32ul, 8ul>::operator()                                   
                                                                               
                               ▒
   1.34%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::N_AVX2::(anonymous
namespace)::AdaptiveQuantizationMap(float, jxl::Image3<float> const&,
jxl::FrameDimensions const&, float, jxl::ThreadPool*, jxl::Plane<float>*)::$_0,
jxl::N_AVX2▒
   1.27%  cjxl     libjxl.so.0.7.0          [.] jxl::N_AVX2::(anonymous
namespace)::DCT1DWrapper<64ul, 0ul, jxl::N_AVX2::(anonymous
namespace)::DCTFrom, jxl::N_AVX2::(anonymous namespace)::DCTTo>                
                                                   ▒
   1.16%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long), jxl::(anonymous
namespace)::FindTextLikePatches(jxl::Image3<float> const&,
jxl::PassesEncoderState const*, jxl::ThreadPool*, jxl::AuxOut*,
bool)::{lambda(un▒
   1.07%  cjxl     libjxl.so.0.7.0          [.]
jxl::ThreadPool::RunCallState<jxl::Status (unsigned long), jxl::(anonymous
namespace)::FindTextLikePatches(jxl::Image3<float> const&,
jxl::PassesEncoderState const*, jxl::ThreadPool*, jxl::AuxOut*,
bool)::$_0>::Call▒

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (4 preceding siblings ...)
  2023-05-17  9:08 ` hubicka at gcc dot gnu.org
@ 2023-05-17 12:35 ` hubicka at gcc dot gnu.org
  2023-05-17 13:59 ` juzhe.zhong at rivai dot ai
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-17 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
hottest loop in clang's profile is:
  for (size_t y = 0; y < opsin.ysize(); y++) {
    for (size_t x = 0; x < opsin.xsize(); x++) {
      if (is_background_row[y * is_background_stride + x]) continue;
      cc.clear();
      stack.clear();
      stack.emplace_back(x, y);
      size_t min_x = x;
      size_t max_x = x;
      size_t min_y = y;
      size_t max_y = y;
      std::pair<uint32_t, uint32_t> reference;
      bool found_border = false;
      bool all_similar = true;
      while (!stack.empty()) {
        std::pair<uint32_t, uint32_t> cur = stack.back();
        stack.pop_back();
        if (visited_row[cur.second * visited_stride + cur.first]) continue;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
closed by this continue.
        visited_row[cur.second * visited_stride + cur.first] = 1;
        if (cur.first < min_x) min_x = cur.first;
        if (cur.first > max_x) max_x = cur.first;
        if (cur.second < min_y) min_y = cur.second;
        if (cur.second > max_y) max_y = cur.second;
        if (paint_ccs) {
          cc.push_back(cur);
        }
        for (int dx = -kSearchRadius; dx <= kSearchRadius; dx++) {
          for (int dy = -kSearchRadius; dy <= kSearchRadius; dy++) {
            if (dx == 0 && dy == 0) continue;
            int next_first = static_cast<int32_t>(cur.first) + dx;
            int next_second = static_cast<int32_t>(cur.second) + dy;
            if (next_first < 0 || next_second < 0 ||
                static_cast<uint32_t>(next_first) >= opsin.xsize() ||
                static_cast<uint32_t>(next_second) >= opsin.ysize()) {
              continue;
            }
            std::pair<uint32_t, uint32_t> next{next_first, next_second};
            if (!is_background_row[next.second * is_background_stride +
                                   next.first]) {
              stack.push_back(next);
            } else {
              if (!found_border) {
                reference = next;
                found_border = true;
              } else {
                if (!is_similar_b(next, reference)) all_similar = false;
              }
            }
          }
        }
      }
      if (!found_border || !all_similar || max_x - min_x >= kMaxPatchSize ||
          max_y - min_y >= kMaxPatchSize) {
        continue;
      }
      size_t bpos = background_stride * reference.second + reference.first;
      float ref[3] = {background_rows[0][bpos], background_rows[1][bpos],
                      background_rows[2][bpos]};
      bool has_similar = false;
      for (size_t iy = std::max<int>(
               static_cast<int32_t>(min_y) - kHasSimilarRadius, 0);
           iy < std::min(max_y + kHasSimilarRadius + 1, opsin.ysize()); iy++) {
        for (size_t ix = std::max<int>(
                 static_cast<int32_t>(min_x) - kHasSimilarRadius, 0);
             ix < std::min(max_x + kHasSimilarRadius + 1, opsin.xsize());
             ix++) {
          size_t opos = opsin_stride * iy + ix;
          float px[3] = {opsin_rows[0][opos], opsin_rows[1][opos],
                         opsin_rows[2][opos]};
          if (pci.is_similar_v(ref, px, kHasSimilarThreshold)) {
            has_similar = true;
          }
        }
      }
      if (!has_similar) continue;
      info.emplace_back();
      info.back().second.emplace_back(min_x, min_y);
      QuantizedPatch& patch = info.back().first;
      patch.xsize = max_x - min_x + 1;
      patch.ysize = max_y - min_y + 1;
      int max_value = 0;
      for (size_t c : {1, 0, 2}) {
        for (size_t iy = min_y; iy <= max_y; iy++) {
          for (size_t ix = min_x; ix <= max_x; ix++) {
            size_t offset = (iy - min_y) * patch.xsize + ix - min_x;
            patch.fpixels[c][offset] =
                opsin_rows[c][iy * opsin_stride + ix] - ref[c];
            int val = pci.Quantize(patch.fpixels[c][offset], c);
            patch.pixels[c][offset] = val;
            if (std::abs(val) > max_value) max_value = std::abs(val);
          }
        }
      }
      if (max_value < kMinPeak) {
        info.pop_back();
        continue;
      }
      if (paint_ccs) {
        float cc_color = rng.UniformF(0.5, 1.0);
        for (std::pair<uint32_t, uint32_t> p : cc) {
          ccs.Row(p.second)[p.first] = cc_color;
        }
      }
    }
  }

I guess such a large loop nest with hottest loop not being the innermost is bad
for register pressure. 
Clangs code is :
  0.02 │1196:┌─→cmp          %r10,-0xb8(%rbp)                                  
                                  ▒
       │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
jxl::PassesEncoderState*, JxlCmsInterface co▒
       │     │while (!stack.empty()) {                                         
                                  ◆
  1.39 │     │↓ je           1690                                              
                                  ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                  ▒
       │11a3:│  mov          -0x8(%r10),%rbx                                   
                                  ▒
  9.80 │     │  mov          -0x230(%rbp),%r14                                 
                                  ▒
       │     │__gnu_cxx::__normal_iterator<std::pair<unsigned int, unsigned
int>*, std::vector<std::pair<unsigned ▒
       │     │{ return __normal_iterator(_M_current - __n); }                  
                                  ▒
  0.86 │     │  add          $0xfffffffffffffff8,%r10                          
                                  ▒
       │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
jxl::PassesEncoderState*, JxlCmsInterface co▒
  0.36 │     │  mov          %rbx,%r13                                         
                                  ▒
  0.01 │     │  shr          $0x20,%r13                                        
                                  ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                 ▒
  2.69 │     │  mov          %ebx,%eax                                         
                                  ▒
  0.00 │     │  mov          %r13,%rcx                                         
                                  ▒
  1.12 │     │  imul         -0x180(%rbp),%rcx                                 
                                  ▒
  0.78 │     │  add          %rax,%rcx                                         
                                  ▒
  2.73 │     ├──cmpb         $0x0,(%r14,%rcx,1)                                
                                  ▒
 19.91 │     └──jne          1196                                              
                                  ▒

While GCC is longer:
Percent│     Percent│      while (!stack.empty()) {                            
                                                                ▒
       │1241:┌─→cmp         %rax,-0x370(%rbp)                                  
                                                   ▒
  0.70 │     │↓ je          1481                                               
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  0.02 │124e:│  mov         -0x8(%rax),%rdx                                    
                                                   ▒
       │     │std::vector<std::pair<unsigned int, unsigned int>,
std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
       │     │--this->_M_impl._M_finish;                                       
                                                   ▒
  0.63 │     │  sub         $0x8,%rax                                          
                                                   ▒
  0.74 │     │  mov         %rax,-0x368(%rbp)                                  
                                                   ▒
       │     │FindTextLikePatches():                                           
                                                   ▒
  0.01 │     │  mov         %edx,%esi                                          
                                                   ◆
  0.35 │     │  mov         %rdx,%rdi                                          
                                                   ▒
  0.82 │     │  mov         %rdx,-0x490(%rbp)                                  
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.00 │     │  mov         -0x5f8(%rbp),%rdx                                  
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  1.32 │     │  shr         $0x20,%rdi                                         
                                                   ▒
  0.11 │     │  mov         %rsi,%r15                                          
                                                   ▒
  0.09 │     │  mov         %edi,%ecx                                          
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.72 │     │  mov         -0x5f0(%rbp),%rdi                                  
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  0.03 │     │  mov         %rcx,%r12                                          
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.39 │     │  imul        %rcx,%rdx                                          
                                                   ▒
  1.00 │     │  add         %rsi,%rdx                                          
                                                   ▒
  1.35 │     │  add         %rdi,%rdx                                          
                                                   ▒
  1.37 │     ├──cmpb        $0x0,(%rdx)                                        
                                                   ▒
  9.56 │     └──jne         1241                                               
                                                   ▒ while (!stack.empty()) {  
                                                                               
          ▒
       │1241:┌─→cmp         %rax,-0x370(%rbp)                                  
                                                   ▒
  0.70 │     │↓ je          1481                                               
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  0.02 │124e:│  mov         -0x8(%rax),%rdx                                    
                                                   ▒
       │     │std::vector<std::pair<unsigned int, unsigned int>,
std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
       │     │--this->_M_impl._M_finish;                                       
                                                   ▒
  0.63 │     │  sub         $0x8,%rax                                          
                                                   ▒
  0.74 │     │  mov         %rax,-0x368(%rbp)                                  
                                                   ▒
       │     │FindTextLikePatches():                                           
                                                   ▒
  0.01 │     │  mov         %edx,%esi                                          
                                                   ◆
  0.35 │     │  mov         %rdx,%rdi                                          
                                                   ▒
  0.82 │     │  mov         %rdx,-0x490(%rbp)                                  
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.00 │     │  mov         -0x5f8(%rbp),%rdx                                  
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  1.32 │     │  shr         $0x20,%rdi                                         
                                                   ▒
  0.11 │     │  mov         %rsi,%r15                                          
                                                   ▒
  0.09 │     │  mov         %edi,%ecx                                          
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.72 │     │  mov         -0x5f0(%rbp),%rdi                                  
                                                   ▒
       │     │std::pair<uint32_t, uint32_t> cur = stack.back();                
                                                   ▒
  0.03 │     │  mov         %rcx,%r12                                          
                                                   ▒
       │     │if (visited_row[cur.second * visited_stride + cur.first])
continue;                                                  ▒
  0.39 │     │  imul        %rcx,%rdx                                          
                                                   ▒
  1.00 │     │  add         %rsi,%rdx                                          
                                                   ▒
  1.35 │     │  add         %rdi,%rdx                                          
                                                   ▒
  1.37 │     ├──cmpb        $0x0,(%rdx)                                        
                                                   ▒
  9.56 │     └──jne         1241                                               
                                                   ▒


Moreover we have heuristics saying that continue usually closes loops with low
trip count.
-fno-ivopts improves performance with --num_threads=0 however makes number of
instructions worse. Curiously enought with --num_threads=1 and more ivopts is
benefical.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (5 preceding siblings ...)
  2023-05-17 12:35 ` [Bug target/109811] libjxl " hubicka at gcc dot gnu.org
@ 2023-05-17 13:59 ` juzhe.zhong at rivai dot ai
  2023-05-17 14:04 ` hubicka at gcc dot gnu.org
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-05-17 13:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Jan Hubicka from comment #6)
> hottest loop in clang's profile is:
>   for (size_t y = 0; y < opsin.ysize(); y++) {
>     for (size_t x = 0; x < opsin.xsize(); x++) {
>       if (is_background_row[y * is_background_stride + x]) continue;
>       cc.clear();
>       stack.clear();
>       stack.emplace_back(x, y);
>       size_t min_x = x;
>       size_t max_x = x;
>       size_t min_y = y;
>       size_t max_y = y;
>       std::pair<uint32_t, uint32_t> reference;
>       bool found_border = false;
>       bool all_similar = true;
>       while (!stack.empty()) {
>         std::pair<uint32_t, uint32_t> cur = stack.back();
>         stack.pop_back();
>         if (visited_row[cur.second * visited_stride + cur.first]) continue;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> closed by this continue.
>         visited_row[cur.second * visited_stride + cur.first] = 1;
>         if (cur.first < min_x) min_x = cur.first;
>         if (cur.first > max_x) max_x = cur.first;
>         if (cur.second < min_y) min_y = cur.second;
>         if (cur.second > max_y) max_y = cur.second;
>         if (paint_ccs) {
>           cc.push_back(cur);
>         }
>         for (int dx = -kSearchRadius; dx <= kSearchRadius; dx++) {
>           for (int dy = -kSearchRadius; dy <= kSearchRadius; dy++) {
>             if (dx == 0 && dy == 0) continue;
>             int next_first = static_cast<int32_t>(cur.first) + dx;
>             int next_second = static_cast<int32_t>(cur.second) + dy;
>             if (next_first < 0 || next_second < 0 ||
>                 static_cast<uint32_t>(next_first) >= opsin.xsize() ||
>                 static_cast<uint32_t>(next_second) >= opsin.ysize()) {
>               continue;
>             }
>             std::pair<uint32_t, uint32_t> next{next_first, next_second};
>             if (!is_background_row[next.second * is_background_stride +
>                                    next.first]) {
>               stack.push_back(next);
>             } else {
>               if (!found_border) {
>                 reference = next;
>                 found_border = true;
>               } else {
>                 if (!is_similar_b(next, reference)) all_similar = false;
>               }
>             }
>           }
>         }
>       }
>       if (!found_border || !all_similar || max_x - min_x >= kMaxPatchSize ||
>           max_y - min_y >= kMaxPatchSize) {
>         continue;
>       }
>       size_t bpos = background_stride * reference.second + reference.first;
>       float ref[3] = {background_rows[0][bpos], background_rows[1][bpos],
>                       background_rows[2][bpos]};
>       bool has_similar = false;
>       for (size_t iy = std::max<int>(
>                static_cast<int32_t>(min_y) - kHasSimilarRadius, 0);
>            iy < std::min(max_y + kHasSimilarRadius + 1, opsin.ysize());
> iy++) {
>         for (size_t ix = std::max<int>(
>                  static_cast<int32_t>(min_x) - kHasSimilarRadius, 0);
>              ix < std::min(max_x + kHasSimilarRadius + 1, opsin.xsize());
>              ix++) {
>           size_t opos = opsin_stride * iy + ix;
>           float px[3] = {opsin_rows[0][opos], opsin_rows[1][opos],
>                          opsin_rows[2][opos]};
>           if (pci.is_similar_v(ref, px, kHasSimilarThreshold)) {
>             has_similar = true;
>           }
>         }
>       }
>       if (!has_similar) continue;
>       info.emplace_back();
>       info.back().second.emplace_back(min_x, min_y);
>       QuantizedPatch& patch = info.back().first;
>       patch.xsize = max_x - min_x + 1;
>       patch.ysize = max_y - min_y + 1;
>       int max_value = 0;
>       for (size_t c : {1, 0, 2}) {
>         for (size_t iy = min_y; iy <= max_y; iy++) {
>           for (size_t ix = min_x; ix <= max_x; ix++) {
>             size_t offset = (iy - min_y) * patch.xsize + ix - min_x;
>             patch.fpixels[c][offset] =
>                 opsin_rows[c][iy * opsin_stride + ix] - ref[c];
>             int val = pci.Quantize(patch.fpixels[c][offset], c);
>             patch.pixels[c][offset] = val;
>             if (std::abs(val) > max_value) max_value = std::abs(val);
>           }
>         }
>       }
>       if (max_value < kMinPeak) {
>         info.pop_back();
>         continue;
>       }
>       if (paint_ccs) {
>         float cc_color = rng.UniformF(0.5, 1.0);
>         for (std::pair<uint32_t, uint32_t> p : cc) {
>           ccs.Row(p.second)[p.first] = cc_color;
>         }
>       }
>     }
>   }
> 
> I guess such a large loop nest with hottest loop not being the innermost is
> bad for register pressure. 
> Clangs code is :
>   0.02 │1196:┌─→cmp          %r10,-0xb8(%rbp)                               
> ▒
>        │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
> jxl::PassesEncoderState*, JxlCmsInterface co▒
>        │     │while (!stack.empty()) {                                      
> ◆
>   1.39 │     │↓ je           1690                                           
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>        │11a3:│  mov          -0x8(%r10),%rbx                                
> ▒
>   9.80 │     │  mov          -0x230(%rbp),%r14                              
> ▒
>        │     │__gnu_cxx::__normal_iterator<std::pair<unsigned int, unsigned
> int>*, std::vector<std::pair<unsigned ▒
>        │     │{ return __normal_iterator(_M_current - __n); }               
> ▒
>   0.86 │     │  add          $0xfffffffffffffff8,%r10                       
> ▒
>        │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
> jxl::PassesEncoderState*, JxlCmsInterface co▒
>   0.36 │     │  mov          %rbx,%r13                                      
> ▒
>   0.01 │     │  shr          $0x20,%r13                                     
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                 ▒
>   2.69 │     │  mov          %ebx,%eax                                      
> ▒
>   0.00 │     │  mov          %r13,%rcx                                      
> ▒
>   1.12 │     │  imul         -0x180(%rbp),%rcx                              
> ▒
>   0.78 │     │  add          %rax,%rcx                                      
> ▒
>   2.73 │     ├──cmpb         $0x0,(%r14,%rcx,1)                             
> ▒
>  19.91 │     └──jne          1196                                           
> ▒
> 
> While GCC is longer:
> Percent│     Percent│      while (!stack.empty()) {                         
> ▒
>        │1241:┌─→cmp         %rax,-0x370(%rbp)                               
> ▒
>   0.70 │     │↓ je          1481                                            
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.02 │124e:│  mov         -0x8(%rax),%rdx                                 
> ▒
>        │     │std::vector<std::pair<unsigned int, unsigned int>,
> std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
>        │     │--this->_M_impl._M_finish;                                    
> ▒
>   0.63 │     │  sub         $0x8,%rax                                       
> ▒
>   0.74 │     │  mov         %rax,-0x368(%rbp)                               
> ▒
>        │     │FindTextLikePatches():                                        
> ▒
>   0.01 │     │  mov         %edx,%esi                                       
> ◆
>   0.35 │     │  mov         %rdx,%rdi                                       
> ▒
>   0.82 │     │  mov         %rdx,-0x490(%rbp)                               
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.00 │     │  mov         -0x5f8(%rbp),%rdx                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   1.32 │     │  shr         $0x20,%rdi                                      
> ▒
>   0.11 │     │  mov         %rsi,%r15                                       
> ▒
>   0.09 │     │  mov         %edi,%ecx                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.72 │     │  mov         -0x5f0(%rbp),%rdi                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.03 │     │  mov         %rcx,%r12                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.39 │     │  imul        %rcx,%rdx                                       
> ▒
>   1.00 │     │  add         %rsi,%rdx                                       
> ▒
>   1.35 │     │  add         %rdi,%rdx                                       
> ▒
>   1.37 │     ├──cmpb        $0x0,(%rdx)                                     
> ▒
>   9.56 │     └──jne         1241                                            
> ▒ while (!stack.empty()) {                                                  
> ▒
>        │1241:┌─→cmp         %rax,-0x370(%rbp)                               
> ▒
>   0.70 │     │↓ je          1481                                            
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.02 │124e:│  mov         -0x8(%rax),%rdx                                 
> ▒
>        │     │std::vector<std::pair<unsigned int, unsigned int>,
> std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
>        │     │--this->_M_impl._M_finish;                                    
> ▒
>   0.63 │     │  sub         $0x8,%rax                                       
> ▒
>   0.74 │     │  mov         %rax,-0x368(%rbp)                               
> ▒
>        │     │FindTextLikePatches():                                        
> ▒
>   0.01 │     │  mov         %edx,%esi                                       
> ◆
>   0.35 │     │  mov         %rdx,%rdi                                       
> ▒
>   0.82 │     │  mov         %rdx,-0x490(%rbp)                               
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.00 │     │  mov         -0x5f8(%rbp),%rdx                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   1.32 │     │  shr         $0x20,%rdi                                      
> ▒
>   0.11 │     │  mov         %rsi,%r15                                       
> ▒
>   0.09 │     │  mov         %edi,%ecx                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.72 │     │  mov         -0x5f0(%rbp),%rdi                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.03 │     │  mov         %rcx,%r12                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.39 │     │  imul        %rcx,%rdx                                       
> ▒
>   1.00 │     │  add         %rsi,%rdx                                       
> ▒
>   1.35 │     │  add         %rdi,%rdx                                       
> ▒
>   1.37 │     ├──cmpb        $0x0,(%rdx)                                     
> ▒
>   9.56 │     └──jne         1241                                            
> ▒
> 
> 
> Moreover we have heuristics saying that continue usually closes loops with
> low trip count.
> -fno-ivopts improves performance with --num_threads=0 however makes number
> of instructions worse. Curiously enought with --num_threads=1 and more
> ivopts is benefical.


Hi, thanks for the analysis.
It seems that Clang has better performance than GCC in case of no vectorizer?
What about with vectorizer ?

Well, we have a lot testing benchmark between our downstream RISCV Clang vs
GCC.
From my observation, Clang do much better than GCC in case of scalar code.

For example, we found in mlperf, Clang 40% better than GCC when specifying
no-vectorizer. However, when enabling vectorization, gcc does better than clang
overall from my observation.

Thanks

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (6 preceding siblings ...)
  2023-05-17 13:59 ` juzhe.zhong at rivai dot ai
@ 2023-05-17 14:04 ` hubicka at gcc dot gnu.org
  2023-05-17 14:10 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-17 14:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 55101
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55101&action=edit
hottest loop

jpegxl build machinery adds -fno-vectorize and -fno-slp-vectorize to clang
flags.  Adding -fno-tree-vectorize -fno-tree-slp-vectorize makes GCC generated
code more similar.  With this most difference is caused by
FindBestPatchDictionary or FindTextLikePatches if that function is not inlined.

  15.22%  cjxl     libjxl.so.0.7.0       [.] jxl::(anonymous
namespace)::FindTextLikePatches                                                 
  10.19%  cjxl     libjxl.so.0.7.0       [.] jxl::FindBestPatchDictionary       
   5.27%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::QuantizeBlockAC       
   5.06%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::EstimateEntropy       
   4.82%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::EstimateEntropy       
   4.35%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::QuantizeBlockAC       
   4.21%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                 
   3.87%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels                                                 
   3.78%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::FindBestMultiplier    
   3.27%  cjxl     libjxl.so.0.7.0       [.] jxl::N_AVX2::FindBestMultiplier    

I think it is mostly register allocation not handling well the internal loop
quoted above.  I am adding preprocessed sources.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (7 preceding siblings ...)
  2023-05-17 14:04 ` hubicka at gcc dot gnu.org
@ 2023-05-17 14:10 ` jakub at gcc dot gnu.org
  2023-05-17 14:58 ` hubicka at gcc dot gnu.org
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-17 14:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #7)
> It seems that Clang has better performance than GCC in case of no vectorizer?

That is very general statement.  On some particular code, some particular arch,
with some particular flags Clang performs better than GCC, on other it is the
other way around, on some it is wash.  How it performs on larger amounts of
code can be seen from standard benchmarks like SPEC, the Phoronix benchmark
suite is known not to be a very good benchmark for various reasons, but that
doesn't mean it isn't worth looking at it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (8 preceding siblings ...)
  2023-05-17 14:10 ` jakub at gcc dot gnu.org
@ 2023-05-17 14:58 ` hubicka at gcc dot gnu.org
  2023-05-18 13:40 ` hubicka at gcc dot gnu.org
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-17 14:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #10 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Actually vectorization hurts on both compilers and bit more with clang.
It seems that all important loops are hand vectorized and since register
pressure is a problem, vectorizing other loops causes enough of collateral
damage to register allocation to regress performance.

I believe the core of the problem (or at least one of them) is simply way we
compile loops popping data from std::vector based stack. See PR109849
We keep updating stack datastructure in the innermost loop becuase in not too
common case reallocation needs to be done and that is done by offlined code.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (9 preceding siblings ...)
  2023-05-17 14:58 ` hubicka at gcc dot gnu.org
@ 2023-05-18 13:40 ` hubicka at gcc dot gnu.org
  2023-06-19 16:28 ` cvs-commit at gcc dot gnu.org
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-05-18 13:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I got -fprofile-use builds working and with profile we peel the innermost loop
8 times which actually gets it off the hottest spot.
We get more slective on what to inline (do not inline cold calls) which may
make the dependency on SRA even worse....

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (10 preceding siblings ...)
  2023-05-18 13:40 ` hubicka at gcc dot gnu.org
@ 2023-06-19 16:28 ` cvs-commit at gcc dot gnu.org
  2023-11-02 14:12 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-06-19 16:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:7b34cacc5735385e7e2855d7c0a6fad60ef4a99b

commit r14-1951-g7b34cacc5735385e7e2855d7c0a6fad60ef4a99b
Author: Jan Hubicka <jh@suse.cz>
Date:   Mon Jun 19 18:28:17 2023 +0200

    optimize std::max early

    we currently produce very bad code on loops using std::vector as a stack,
since
    we fail to inline push_back which in turn prevents SRA and we fail to
optimize
    out some store-to-load pairs.

    I looked into why this function is not inlined and it is inlined by clang. 
We
    currently estimate it to 66 instructions and inline limits are 15 at -O2
and 30
    at -O3.  Clang has similar estimate, but still decides to inline at -O2.

    I looked into reason why the body is so large and one problem I spotted is
the
    way std::max is implemented by taking and returning reference to the
values.

      const T& max( const T& a, const T& b );

    This makes it necessary to store the values to memory and load them later
    and max is used by code computing new size of vector on resize.

    We optimize this to MAX_EXPR, but only during late optimizations.  I think
this
    is a common enough coding pattern and we ought to make this transparent to
    early opts and IPA.  The following is easist fix that simply adds phiprop
pass
    that turns the PHI of address values into PHI of values so later FRE can
    propagate values across memory, phiopt discover the MAX_EXPR pattern and
DSE
    remove the memory stores.

    gcc/ChangeLog:

            PR tree-optimization/109811
            PR tree-optimization/109849
            * passes.def: Add phiprop to early optimization passes.
            * tree-ssa-phiprop.cc: Allow clonning.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/109811
            PR tree-optimization/109849
            * gcc.dg/tree-ssa/phiprop-1.c: New test.
            * gcc.dg/tree-ssa/pr21463.c: Adjust template.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (11 preceding siblings ...)
  2023-06-19 16:28 ` cvs-commit at gcc dot gnu.org
@ 2023-11-02 14:12 ` hubicka at gcc dot gnu.org
  2023-11-21 14:17 ` cvs-commit at gcc dot gnu.org
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-02 14:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
So I re-tested it with current mainline and clang 16/17

For mainline I get (megapixels per second, bigger is better):
        13.39
        13.38
        13.42
clang 16:
        20.06
        20.06
        19.87
clang 17:
        19.7
        19.68
        19.69


mainline with Martin's patch to enable SRA across calls where parameter doesn't
example (improvement for PR109849) I get:
        19.37
        19.35
        19.31
this is without inlining m_realloc_insert which we do at -O3 but we don't at
-O2 since it is large (clang inlines at both at -O2 and -O3).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (12 preceding siblings ...)
  2023-11-02 14:12 ` hubicka at gcc dot gnu.org
@ 2023-11-21 14:17 ` cvs-commit at gcc dot gnu.org
  2023-11-24 23:13 ` hubicka at gcc dot gnu.org
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-21 14:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #14 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:1d82fc2e6824bf83159389729c31a942f7b91b04

commit r14-5679-g1d82fc2e6824bf83159389729c31a942f7b91b04
Author: Jan Hubicka <jh@suse.cz>
Date:   Tue Nov 21 15:17:16 2023 +0100

    optimize std::vector::push_back

    this patch speeds up the push_back at -O3 significantly by making the
    reallocation to be inlined by default.  _M_realloc_insert is general
    insertion that takes iterator pointing to location where the value
    should be inserted.  As such it contains code to move other entries around
    that is quite large.

    Since appending to the end of array is common operation, I think we should
    have specialized code for that.  Sadly it is really hard to work out this
    from IPA passes, since we basically care whether the iterator points to
    the same place as the end pointer, which are both passed by reference.
    This is inter-procedural value numbering that is quite out of reach.

    I also added extra check making it clear that the new length of the vector
    is non-zero.  This saves extra conditionals.  Again it is quite hard case
    since _M_check_len seem to be able to return 0 if its parameter is 0.
    This never happens here, but we are not able to propagate this early nor
    at IPA stage.

    libstdc++-v3/ChangeLog:

            PR libstdc++/110287
            PR middle-end/109811
            PR middle-end/109849
            * include/bits/stl_vector.h (_M_realloc_append): New member
function.
            (push_back): Use it.
            * include/bits/vector.tcc: (emplace_back): Use it.
            (_M_realloc_insert): Let compiler know that new vector size is
non-zero.
            (_M_realloc_append): New member function.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (13 preceding siblings ...)
  2023-11-21 14:17 ` cvs-commit at gcc dot gnu.org
@ 2023-11-24 23:13 ` hubicka at gcc dot gnu.org
  2023-11-24 23:23 ` sjames at gcc dot gnu.org
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-24 23:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With SRA improvements r:aae723d360ca26cd9fd0b039fb0a616bd0eae363 we finally get
good performance at -O2. Improvements to push_back implementation also helps a
bit.

Mainline with default flags (-O2):
    Input: JPEG - Quality: 90:
        19.76
        19.75
        19.68
Mainline with -O2 -march=native:
    Input: JPEG - Quality: 90:
        20.01
        20
        19.98
Mainline with -O2 -march=native -flto
    Input: JPEG - Quality: 90:
        19.95
        19.98
        19.81
Mainline with -O2 -march=native -flto --param max-inline-insns-auto=80 (this
makes push_back inlined)
    Input: JPEG - Quality: 90:
        19.98
        20.05
        20.03
Mainline with -O2 -flto  -march=native -I/usr/include/c++/v1 -nostdinc++ -lc++
(so clang's libc++)
        21.38
        21.37
        21.32
Mainline with -O2 -flto  -march=native run manualy since build machinery patch
is needed
        23.03
        22.85
        23.04
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with system libstdc++ from
GCC13 so before push_back improvements.
        21.16
        20.95
        21.06
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with trunk libstdc++ with
push_back improvements.
        21.2
        20.93
        20.98
Clang 17 with -O2 -march=native -flto -stdlib=libc++ and also
-fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with clan'g
libc++
    Input: JPEG - Quality: 90:
        22.08
        21.88
        21.78
Clang 17 with -O3 -march=native -flto
        23.08
        22.90
        22.84


libc++ declares push_back always_inline and splits out the slow copying path. I
think the inlined part is still bit too large for inlining at -O2.

We could still try to get remaining approx 10% without increasing code size at 
-O2
However major part of the problem is solved.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (14 preceding siblings ...)
  2023-11-24 23:13 ` hubicka at gcc dot gnu.org
@ 2023-11-24 23:23 ` sjames at gcc dot gnu.org
  2023-11-25  6:22 ` aros at gmx dot com
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: sjames at gcc dot gnu.org @ 2023-11-24 23:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #16 from Sam James <sjames at gcc dot gnu.org> ---
Thank you all for the effort.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (15 preceding siblings ...)
  2023-11-24 23:23 ` sjames at gcc dot gnu.org
@ 2023-11-25  6:22 ` aros at gmx dot com
  2023-11-25 13:31 ` hubicka at gcc dot gnu.org
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: aros at gmx dot com @ 2023-11-25  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #17 from Artem S. Tashkinov <aros at gmx dot com> ---
Terrific results, thanks a ton!

Maybe this bug report could be closed now?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (16 preceding siblings ...)
  2023-11-25  6:22 ` aros at gmx dot com
@ 2023-11-25 13:31 ` hubicka at gcc dot gnu.org
  2024-01-05 21:31 ` hubicka at gcc dot gnu.org
  2024-01-20 17:23 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-11-25 13:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I made a typo:

Mainline with -O2 -flto  -march=native run manually since build machinery patch
is needed
        23.03
        22.85
        23.04

Should be 
Mainline with -O3 -flto  -march=native run manually since build machinery patch
is needed
        23.03
        22.85
        23.04

So with -O2 we still get slightly lower score than clang with -O3 we are
slightly better. push_back inlining does not seem to be a problem (as tested by
increasing limits) so perhaps more agressive unrolling/vectorization settings
clang has at -O2.

I think upstream jpegxl should use -O3 or -Ofast instead of -O2.  It is quite
typical kind of task that benefits from large optimization levels.

I filled in https://github.com/libjxl/libjxl/issues/2970

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (17 preceding siblings ...)
  2023-11-25 13:31 ` hubicka at gcc dot gnu.org
@ 2024-01-05 21:31 ` hubicka at gcc dot gnu.org
  2024-01-20 17:23 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2024-01-05 21:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I think we can declare this one fixed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
  2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
                   ` (18 preceding siblings ...)
  2024-01-05 21:31 ` hubicka at gcc dot gnu.org
@ 2024-01-20 17:23 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-20 17:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-01-20 17:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl 0.7 is a lot slower in GCC 13.1 vs Clang 16 aros at gmx dot com
2023-05-11 15:30 ` [Bug target/109811] " xry111 at gcc dot gnu.org
2023-05-11 15:49 ` aros at gmx dot com
2023-05-12  6:59 ` rguenth at gcc dot gnu.org
2023-05-17  8:42 ` hubicka at gcc dot gnu.org
2023-05-17  9:08 ` hubicka at gcc dot gnu.org
2023-05-17 12:35 ` [Bug target/109811] libjxl " hubicka at gcc dot gnu.org
2023-05-17 13:59 ` juzhe.zhong at rivai dot ai
2023-05-17 14:04 ` hubicka at gcc dot gnu.org
2023-05-17 14:10 ` jakub at gcc dot gnu.org
2023-05-17 14:58 ` hubicka at gcc dot gnu.org
2023-05-18 13:40 ` hubicka at gcc dot gnu.org
2023-06-19 16:28 ` cvs-commit at gcc dot gnu.org
2023-11-02 14:12 ` hubicka at gcc dot gnu.org
2023-11-21 14:17 ` cvs-commit at gcc dot gnu.org
2023-11-24 23:13 ` hubicka at gcc dot gnu.org
2023-11-24 23:23 ` sjames at gcc dot gnu.org
2023-11-25  6:22 ` aros at gmx dot com
2023-11-25 13:31 ` hubicka at gcc dot gnu.org
2024-01-05 21:31 ` hubicka at gcc dot gnu.org
2024-01-20 17:23 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).