public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "juzhe.zhong at rivai dot ai" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
Date: Wed, 17 May 2023 13:59:56 +0000	[thread overview]
Message-ID: <bug-109811-4-eGucVsuJOh@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-109811-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Jan Hubicka from comment #6)
> hottest loop in clang's profile is:
>   for (size_t y = 0; y < opsin.ysize(); y++) {
>     for (size_t x = 0; x < opsin.xsize(); x++) {
>       if (is_background_row[y * is_background_stride + x]) continue;
>       cc.clear();
>       stack.clear();
>       stack.emplace_back(x, y);
>       size_t min_x = x;
>       size_t max_x = x;
>       size_t min_y = y;
>       size_t max_y = y;
>       std::pair<uint32_t, uint32_t> reference;
>       bool found_border = false;
>       bool all_similar = true;
>       while (!stack.empty()) {
>         std::pair<uint32_t, uint32_t> cur = stack.back();
>         stack.pop_back();
>         if (visited_row[cur.second * visited_stride + cur.first]) continue;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> closed by this continue.
>         visited_row[cur.second * visited_stride + cur.first] = 1;
>         if (cur.first < min_x) min_x = cur.first;
>         if (cur.first > max_x) max_x = cur.first;
>         if (cur.second < min_y) min_y = cur.second;
>         if (cur.second > max_y) max_y = cur.second;
>         if (paint_ccs) {
>           cc.push_back(cur);
>         }
>         for (int dx = -kSearchRadius; dx <= kSearchRadius; dx++) {
>           for (int dy = -kSearchRadius; dy <= kSearchRadius; dy++) {
>             if (dx == 0 && dy == 0) continue;
>             int next_first = static_cast<int32_t>(cur.first) + dx;
>             int next_second = static_cast<int32_t>(cur.second) + dy;
>             if (next_first < 0 || next_second < 0 ||
>                 static_cast<uint32_t>(next_first) >= opsin.xsize() ||
>                 static_cast<uint32_t>(next_second) >= opsin.ysize()) {
>               continue;
>             }
>             std::pair<uint32_t, uint32_t> next{next_first, next_second};
>             if (!is_background_row[next.second * is_background_stride +
>                                    next.first]) {
>               stack.push_back(next);
>             } else {
>               if (!found_border) {
>                 reference = next;
>                 found_border = true;
>               } else {
>                 if (!is_similar_b(next, reference)) all_similar = false;
>               }
>             }
>           }
>         }
>       }
>       if (!found_border || !all_similar || max_x - min_x >= kMaxPatchSize ||
>           max_y - min_y >= kMaxPatchSize) {
>         continue;
>       }
>       size_t bpos = background_stride * reference.second + reference.first;
>       float ref[3] = {background_rows[0][bpos], background_rows[1][bpos],
>                       background_rows[2][bpos]};
>       bool has_similar = false;
>       for (size_t iy = std::max<int>(
>                static_cast<int32_t>(min_y) - kHasSimilarRadius, 0);
>            iy < std::min(max_y + kHasSimilarRadius + 1, opsin.ysize());
> iy++) {
>         for (size_t ix = std::max<int>(
>                  static_cast<int32_t>(min_x) - kHasSimilarRadius, 0);
>              ix < std::min(max_x + kHasSimilarRadius + 1, opsin.xsize());
>              ix++) {
>           size_t opos = opsin_stride * iy + ix;
>           float px[3] = {opsin_rows[0][opos], opsin_rows[1][opos],
>                          opsin_rows[2][opos]};
>           if (pci.is_similar_v(ref, px, kHasSimilarThreshold)) {
>             has_similar = true;
>           }
>         }
>       }
>       if (!has_similar) continue;
>       info.emplace_back();
>       info.back().second.emplace_back(min_x, min_y);
>       QuantizedPatch& patch = info.back().first;
>       patch.xsize = max_x - min_x + 1;
>       patch.ysize = max_y - min_y + 1;
>       int max_value = 0;
>       for (size_t c : {1, 0, 2}) {
>         for (size_t iy = min_y; iy <= max_y; iy++) {
>           for (size_t ix = min_x; ix <= max_x; ix++) {
>             size_t offset = (iy - min_y) * patch.xsize + ix - min_x;
>             patch.fpixels[c][offset] =
>                 opsin_rows[c][iy * opsin_stride + ix] - ref[c];
>             int val = pci.Quantize(patch.fpixels[c][offset], c);
>             patch.pixels[c][offset] = val;
>             if (std::abs(val) > max_value) max_value = std::abs(val);
>           }
>         }
>       }
>       if (max_value < kMinPeak) {
>         info.pop_back();
>         continue;
>       }
>       if (paint_ccs) {
>         float cc_color = rng.UniformF(0.5, 1.0);
>         for (std::pair<uint32_t, uint32_t> p : cc) {
>           ccs.Row(p.second)[p.first] = cc_color;
>         }
>       }
>     }
>   }
> 
> I guess such a large loop nest with hottest loop not being the innermost is
> bad for register pressure. 
> Clangs code is :
>   0.02 │1196:┌─→cmp          %r10,-0xb8(%rbp)                               
> ▒
>        │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
> jxl::PassesEncoderState*, JxlCmsInterface co▒
>        │     │while (!stack.empty()) {                                      
> ◆
>   1.39 │     │↓ je           1690                                           
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>        │11a3:│  mov          -0x8(%r10),%rbx                                
> ▒
>   9.80 │     │  mov          -0x230(%rbp),%r14                              
> ▒
>        │     │__gnu_cxx::__normal_iterator<std::pair<unsigned int, unsigned
> int>*, std::vector<std::pair<unsigned ▒
>        │     │{ return __normal_iterator(_M_current - __n); }               
> ▒
>   0.86 │     │  add          $0xfffffffffffffff8,%r10                       
> ▒
>        │     │jxl::FindBestPatchDictionary(jxl::Image3<float> const&,
> jxl::PassesEncoderState*, JxlCmsInterface co▒
>   0.36 │     │  mov          %rbx,%r13                                      
> ▒
>   0.01 │     │  shr          $0x20,%r13                                     
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                 ▒
>   2.69 │     │  mov          %ebx,%eax                                      
> ▒
>   0.00 │     │  mov          %r13,%rcx                                      
> ▒
>   1.12 │     │  imul         -0x180(%rbp),%rcx                              
> ▒
>   0.78 │     │  add          %rax,%rcx                                      
> ▒
>   2.73 │     ├──cmpb         $0x0,(%r14,%rcx,1)                             
> ▒
>  19.91 │     └──jne          1196                                           
> ▒
> 
> While GCC is longer:
> Percent│     Percent│      while (!stack.empty()) {                         
> ▒
>        │1241:┌─→cmp         %rax,-0x370(%rbp)                               
> ▒
>   0.70 │     │↓ je          1481                                            
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.02 │124e:│  mov         -0x8(%rax),%rdx                                 
> ▒
>        │     │std::vector<std::pair<unsigned int, unsigned int>,
> std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
>        │     │--this->_M_impl._M_finish;                                    
> ▒
>   0.63 │     │  sub         $0x8,%rax                                       
> ▒
>   0.74 │     │  mov         %rax,-0x368(%rbp)                               
> ▒
>        │     │FindTextLikePatches():                                        
> ▒
>   0.01 │     │  mov         %edx,%esi                                       
> ◆
>   0.35 │     │  mov         %rdx,%rdi                                       
> ▒
>   0.82 │     │  mov         %rdx,-0x490(%rbp)                               
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.00 │     │  mov         -0x5f8(%rbp),%rdx                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   1.32 │     │  shr         $0x20,%rdi                                      
> ▒
>   0.11 │     │  mov         %rsi,%r15                                       
> ▒
>   0.09 │     │  mov         %edi,%ecx                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.72 │     │  mov         -0x5f0(%rbp),%rdi                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.03 │     │  mov         %rcx,%r12                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.39 │     │  imul        %rcx,%rdx                                       
> ▒
>   1.00 │     │  add         %rsi,%rdx                                       
> ▒
>   1.35 │     │  add         %rdi,%rdx                                       
> ▒
>   1.37 │     ├──cmpb        $0x0,(%rdx)                                     
> ▒
>   9.56 │     └──jne         1241                                            
> ▒ while (!stack.empty()) {                                                  
> ▒
>        │1241:┌─→cmp         %rax,-0x370(%rbp)                               
> ▒
>   0.70 │     │↓ je          1481                                            
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.02 │124e:│  mov         -0x8(%rax),%rdx                                 
> ▒
>        │     │std::vector<std::pair<unsigned int, unsigned int>,
> std::allocator<std::pair<unsigned int, unsigned int> > >::pop_back▒
>        │     │--this->_M_impl._M_finish;                                    
> ▒
>   0.63 │     │  sub         $0x8,%rax                                       
> ▒
>   0.74 │     │  mov         %rax,-0x368(%rbp)                               
> ▒
>        │     │FindTextLikePatches():                                        
> ▒
>   0.01 │     │  mov         %edx,%esi                                       
> ◆
>   0.35 │     │  mov         %rdx,%rdi                                       
> ▒
>   0.82 │     │  mov         %rdx,-0x490(%rbp)                               
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.00 │     │  mov         -0x5f8(%rbp),%rdx                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   1.32 │     │  shr         $0x20,%rdi                                      
> ▒
>   0.11 │     │  mov         %rsi,%r15                                       
> ▒
>   0.09 │     │  mov         %edi,%ecx                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.72 │     │  mov         -0x5f0(%rbp),%rdi                               
> ▒
>        │     │std::pair<uint32_t, uint32_t> cur = stack.back();             
> ▒
>   0.03 │     │  mov         %rcx,%r12                                       
> ▒
>        │     │if (visited_row[cur.second * visited_stride + cur.first])
> continue;                                                  ▒
>   0.39 │     │  imul        %rcx,%rdx                                       
> ▒
>   1.00 │     │  add         %rsi,%rdx                                       
> ▒
>   1.35 │     │  add         %rdi,%rdx                                       
> ▒
>   1.37 │     ├──cmpb        $0x0,(%rdx)                                     
> ▒
>   9.56 │     └──jne         1241                                            
> ▒
> 
> 
> Moreover we have heuristics saying that continue usually closes loops with
> low trip count.
> -fno-ivopts improves performance with --num_threads=0 however makes number
> of instructions worse. Curiously enought with --num_threads=1 and more
> ivopts is benefical.


Hi, thanks for the analysis.
It seems that Clang has better performance than GCC in case of no vectorizer?
What about with vectorizer ?

Well, we have a lot testing benchmark between our downstream RISCV Clang vs
GCC.
From my observation, Clang do much better than GCC in case of scalar code.

For example, we found in mlperf, Clang 40% better than GCC when specifying
no-vectorizer. However, when enabling vectorization, gcc does better than clang
overall from my observation.

Thanks

  parent reply	other threads:[~2023-05-17 13:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11 14:24 [Bug tree-optimization/109811] New: libxjl " aros at gmx dot com
2023-05-11 15:30 ` [Bug target/109811] " xry111 at gcc dot gnu.org
2023-05-11 15:49 ` aros at gmx dot com
2023-05-12  6:59 ` rguenth at gcc dot gnu.org
2023-05-17  8:42 ` hubicka at gcc dot gnu.org
2023-05-17  9:08 ` hubicka at gcc dot gnu.org
2023-05-17 12:35 ` [Bug target/109811] libjxl " hubicka at gcc dot gnu.org
2023-05-17 13:59 ` juzhe.zhong at rivai dot ai [this message]
2023-05-17 14:04 ` hubicka at gcc dot gnu.org
2023-05-17 14:10 ` jakub at gcc dot gnu.org
2023-05-17 14:58 ` hubicka at gcc dot gnu.org
2023-05-18 13:40 ` hubicka at gcc dot gnu.org
2023-06-19 16:28 ` cvs-commit at gcc dot gnu.org
2023-11-02 14:12 ` hubicka at gcc dot gnu.org
2023-11-21 14:17 ` cvs-commit at gcc dot gnu.org
2023-11-24 23:13 ` hubicka at gcc dot gnu.org
2023-11-24 23:23 ` sjames at gcc dot gnu.org
2023-11-25  6:22 ` aros at gmx dot com
2023-11-25 13:31 ` hubicka at gcc dot gnu.org
2024-01-05 21:31 ` hubicka at gcc dot gnu.org
2024-01-20 17:23 ` pinskia at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-109811-4-eGucVsuJOh@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).