[Bug c/107905] New: 2x slowdown versus CLANG and ICL

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/107905] New: 2x slowdown versus CLANG and ICL
@ 2022-11-29  2:36 sanmayce at sanmayce dot com
  2022-11-29  3:22 ` [Bug middle-end/107905] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: sanmayce at sanmayce dot com @ 2022-11-29  2:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

            Bug ID: 107905
           Summary: 2x slowdown versus CLANG and ICL
           Product: gcc
           Version: 11.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: sanmayce at sanmayce dot com
  Target Milestone: ---

Hi,
I have encountered performance problem with MinGW GCC 11.3.0, nearly 2x
slowdown with a simple C function (being the 3rd fastest wildcard matching
function).

```
int Tcheburaschka_Wildcard_Iterative_Kaze_CaseSensitive (const char* mask,
const char* name) {
        const char* maskSTACK;
        const char* nameSTACK;
#pragma nounroll
        for (name, mask; *name; ++name, ++mask) {
                if (*mask == '*') {
                        goto Backtrack;
                //} else if ((*mask != '?') && (*name != *mask)) {
/*
                } else if ((*mask - '?') * (*name - *mask)) {
                        return 0;
                } 
*/
                }
                if ((*mask - '?') * (*name - *mask)) {
                        return 0;
                } 
        }
Backtrack:
#pragma nounroll
        for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK,
++maskSTACK) {
                if (*maskSTACK == '*') {
                        mask = maskSTACK+1;
                        if (!*mask) return 1;
                        name = nameSTACK;
                        goto Backtrack;
                //} else if ((*maskSTACK != '?') && (*nameSTACK != *maskSTACK))
{
/*
                } else if ((*maskSTACK - '?') * (*nameSTACK - *maskSTACK)) {
                        name++;
                        goto Backtrack;
                } 
*/
                }
                if ((*maskSTACK - '?') * (*nameSTACK - *maskSTACK)) {
                        name++;
                        goto Backtrack;
                } 
        }
        while (*maskSTACK == '*') ++maskSTACK;
        return (!*maskSTACK);
}
```

```
[CPU: AMD Zen2 Ryzen7 4800H, @2.9GHz, Max. Boost Clock Up to 4.2GHz]
+-------------------------------------------------------------------------------+-------------------+-------------------------+-----------------------+
| Function \ Compiler                                                          
| CLANG 14.0.1, -O3 | Intel's ICL 19.0.0, /O3 | MinGW gcc 11.3.0, -O3 |
+-------------------------------------------------------------------------------+-------------------+-------------------------+-----------------------+
| Dogan Kurt's 'Antimalware', 2016, Iterative (wild_iterative)                 
|       70.605000 s |            102.610000 s |           83.398000 s |
| Dogan Kurt's 'Antimalware', 2016, Iterative Optimised (wild_iterative_opt)   
|       61.322000 s |             74.243000 s |           66.538000 s |
| Tcheburaschka_r3, 2022, (Tcheburaschka_Wildcard_Iterative_Kaze_CaseSensitive)
|       72.990000 s |             76.161000 s |          127.717000 s |
| JackHandy_Iterative, 2005, (IterativeWildcards)                              
|       80.053000 s |             90.872000 s |           70.156000 s |
| Kirk J. Krauss, 2014, DrDobbs (FastWildCompare)                              
|       44.113000 s |             48.109000 s |           51.018000 s |
| Alessandro Cantatore, 2003, (szWildMatch7)                                   
|       98.729000 s |             85.986000 s |          121.965000 s |
| Nondeterministic Finite Automaton (wild_nfa)                                 
|      162.561000 s |            200.022000 s |          176.440000 s |
+-------------------------------------------------------------------------------+-------------------+-------------------------+-----------------------+
[Note1: All functions returned 1,075,000,000 Matches - that is TRUEs, kinda
means they passed the quality test, no, really, I printed all the 1's and 0's
after each run - the sequences matched.]
[Note2: It is well-known that Maximum Turbo Modes are maintained for some 15-30
seconds, so it is good that each function takes 30+ seconds, to emulate some 8
billion real-world searches.]
```

For more info:
https://github.com/kirkjkrauss/MatchingWildcards/issues/1#issue-1467311771

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
@ 2022-11-29  3:22 ` pinskia at gcc dot gnu.org
  2022-11-29  5:14 ` sanmayce at sanmayce dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29  3:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization, ra
          Component|target                      |middle-end

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I looked at the code generation and there is not much difference between GCC
and LLVM really.

The only difference I see some register allocation difference when it comes to
mask and name. GCC uses two extra registers to store those.

I don't see that causing a 2x slow down though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
  2022-11-29  3:22 ` [Bug middle-end/107905] " pinskia at gcc dot gnu.org
@ 2022-11-29  5:14 ` sanmayce at sanmayce dot com
  2022-11-29 19:19 ` amonakov at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: sanmayce at sanmayce dot com @ 2022-11-29  5:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #2 from Georgi <sanmayce at sanmayce dot com> ---
> I don't see that causing a 2x slow down though.

Same here, yet, the results prove that not only on Zen2, on top of that on
i5-7200U @2.5GHz and Windows 10, this function takes 203s, cannot figure it
out, could you run the wildtest_GCC executable (or compile to it) and see it is
so?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
  2022-11-29  3:22 ` [Bug middle-end/107905] " pinskia at gcc dot gnu.org
  2022-11-29  5:14 ` sanmayce at sanmayce dot com
@ 2022-11-29 19:19 ` amonakov at gcc dot gnu.org
  2022-11-29 19:31 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-29 19:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|ra                          |
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
LLVM does a better job at code layout, and massively wins on the amount of
executed branches (in particular unconditional jumps). With -fdisable-rtl-bbro
gcc achieves a similar performance.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
                   ` (2 preceding siblings ...)
  2022-11-29 19:19 ` amonakov at gcc dot gnu.org
@ 2022-11-29 19:31 ` pinskia at gcc dot gnu.org
  2022-11-30 13:57 ` amonakov at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 19:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #3)
> LLVM does a better job at code layout, and massively wins on the amount of
> executed branches (in particular unconditional jumps). With
> -fdisable-rtl-bbro gcc achieves a similar performance.

-freorder-blocks-algorithm=simple seems to improve it there too.
This does sound like all accidential really.
Throwing some [[unlikely]] around seems to get "better" layout.
I am I suspecting it all depends on the inputs and might not actually be a
performance difference if the inputs are better "weighted" ....

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
                   ` (3 preceding siblings ...)
  2022-11-29 19:31 ` pinskia at gcc dot gnu.org
@ 2022-11-30 13:57 ` amonakov at gcc dot gnu.org
  2022-11-30 14:36 ` amonakov at gcc dot gnu.org
  2022-11-30 15:01 ` sanmayce at sanmayce dot com
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-30 13:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Not sure what you don't like about the inputs, they appear quite reasonable.
Perhaps GCC's estimation of bb frequencies is off (with profile feedback we
achieve good performance).

Georgi: you'll likely see better results with profile-guided optimization. You
can first compile the benchmark with -O2 -fprofile-generate, run the output (it
will generate *.gcda files), then compile again with -O2 -fprofile-use. For
Clang the options are spelled -fprofile-instr-generate and -fprofile-instr-use,
respectively.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
                   ` (4 preceding siblings ...)
  2022-11-30 13:57 ` amonakov at gcc dot gnu.org
@ 2022-11-30 14:36 ` amonakov at gcc dot gnu.org
  2022-11-30 15:01 ` sanmayce at sanmayce dot com
  6 siblings, 0 replies; 8+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-11-30 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Let me add that Clang supports GCC's -fprofile-{generate,use} flags for
compatibility as well.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107905] 2x slowdown versus CLANG and ICL
  2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
                   ` (5 preceding siblings ...)
  2022-11-30 14:36 ` amonakov at gcc dot gnu.org
@ 2022-11-30 15:01 ` sanmayce at sanmayce dot com
  6 siblings, 0 replies; 8+ messages in thread
From: sanmayce at sanmayce dot com @ 2022-11-30 15:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905

--- Comment #7 from Georgi <sanmayce at sanmayce dot com> ---
(In reply to Alexander Monakov from comment #5)
> Not sure what you don't like about the inputs, they appear quite reasonable.
> Perhaps GCC's estimation of bb frequencies is off (with profile feedback we
> achieve good performance).
> 
> Georgi: you'll likely see better results with profile-guided optimization.
> You can first compile the benchmark with -O2 -fprofile-generate, run the
> output (it will generate *.gcda files), then compile again with -O2
> -fprofile-use. For Clang the options are spelled -fprofile-instr-generate
> and -fprofile-instr-use, respectively.

Thank you Alexander, your help is much appreciated. 
For the first time I see such problem, no idea what is the core-problem.
Regarding inputs, I agree, they are not only okay, they form one DEFINITIVE
benchmark, anyway, glad seeing the problem resolved when using your
-fdisable-rtl-bbro:

Rerunning, on Zen2 4800H:

D:\Wildtest_2022-Nov-29>3

D:\Wildtest_2022-Nov-29>wildtest_CLANG_14.0.1.exe
WILDTEST, wildcard benchmark written by Dogan Kurt (dogan.kurt@dodobyte.com),
modified by Kaze (twitter.com/Sanmayce), revision 2022-Nov-29.

Dogan Kurt's 'Antimalware', 2016, Iterative (wild_iterative):
70.441000 s
Dogan Kurt's 'Antimalware', 2016, Iterative Optimised (wild_iterative_opt):
61.099000 s
Tcheburaschka_r3, 2022, (Tcheburaschka_Wildcard_Iterative_Kaze_CaseSensitive):
72.853000 s
JackHandy_Iterative, 2005, (IterativeWildcards):
80.415000 s
Kirk J. Krauss, 2014, DrDobbs (FastWildCompare):
42.279000 s
Alessandro Cantatore, 2003, (szWildMatch7):
97.681000 s
Nondeterministic Finite Automaton (wild_nfa):
162.023000 s

D:\Wildtest_2022-Nov-29>wildtest_Intel_19.0.exe
WILDTEST, wildcard benchmark written by Dogan Kurt (dogan.kurt@dodobyte.com),
modified by Kaze (twitter.com/Sanmayce), revision 2022-Nov-29.

Dogan Kurt's 'Antimalware', 2016, Iterative (wild_iterative):
98.859000 s
Dogan Kurt's 'Antimalware', 2016, Iterative Optimised (wild_iterative_opt):
72.175000 s
Tcheburaschka_r3, 2022, (Tcheburaschka_Wildcard_Iterative_Kaze_CaseSensitive):
70.278000 s
JackHandy_Iterative, 2005, (IterativeWildcards):
107.693000 s
Kirk J. Krauss, 2014, DrDobbs (FastWildCompare):
45.988000 s
Alessandro Cantatore, 2003, (szWildMatch7):
79.309000 s
Nondeterministic Finite Automaton (wild_nfa):
170.198000 s

D:\Wildtest_2022-Nov-29>wildtest_GCC_11.3.0.exe
WILDTEST, wildcard benchmark written by Dogan Kurt (dogan.kurt@dodobyte.com),
modified by Kaze (twitter.com/Sanmayce), revision 2022-Nov-29.

Dogan Kurt's 'Antimalware', 2016, Iterative (wild_iterative):
113.664000 s
Dogan Kurt's 'Antimalware', 2016, Iterative Optimised (wild_iterative_opt):
72.522000 s
Tcheburaschka_r3, 2022, (Tcheburaschka_Wildcard_Iterative_Kaze_CaseSensitive):
78.138000 s
JackHandy_Iterative, 2005, (IterativeWildcards):
83.266000 s
Kirk J. Krauss, 2014, DrDobbs (FastWildCompare):
56.878000 s
Alessandro Cantatore, 2003, (szWildMatch7):
119.861000 s
Nondeterministic Finite Automaton (wild_nfa):
176.290000 s

Hope future GCCs to fix this, the Tcheburaschka function is good in my view.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-30 15:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29  2:36 [Bug c/107905] New: 2x slowdown versus CLANG and ICL sanmayce at sanmayce dot com
2022-11-29  3:22 ` [Bug middle-end/107905] " pinskia at gcc dot gnu.org
2022-11-29  5:14 ` sanmayce at sanmayce dot com
2022-11-29 19:19 ` amonakov at gcc dot gnu.org
2022-11-29 19:31 ` pinskia at gcc dot gnu.org
2022-11-30 13:57 ` amonakov at gcc dot gnu.org
2022-11-30 14:36 ` amonakov at gcc dot gnu.org
2022-11-30 15:01 ` sanmayce at sanmayce dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).