public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "hubicka at ucw dot cz" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled
Date: Wed, 16 Nov 2022 15:28:30 +0000 [thread overview]
Message-ID: <bug-107715-4-6KPNl9WpfU@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-107715-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
--- Comment #2 from Jan Hubicka <hubicka at ucw dot cz> ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
>
> --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Because store data races are allowed with -Ofast masked stores are not used so
> we instead get
>
> vect__ifc__80.24_114 = VEC_COND_EXPR <mask__58.15_104, vect__45.20_109,
> vect__ifc__78.23_113>;
> _ifc__80 = _58 ? _45 : _ifc__78;
> MEM <vector(8) double> [(double *)vectp_c.25_116] = vect__ifc__80.24_114;
>
> which somehow is later turned into masked stores? In fact we expand from
>
> vect__43.18_107 = MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1];
> vect__ifc__78.23_113 = MEM <vector(8) double> [(double *)&c + 8B +
> ivtmp.75_134 * 1];
> _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102,
> vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113);
> MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1] = _97;
> vect__38.29_121 = MEM <vector(8) double> [(double *)&c + ivtmp.75_134 * 1];
> vect__39.32_124 = MEM <vector(8) double> [(double *)&e + ivtmp.75_134 * 1];
> _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
> _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124,
> vect__38.29_121, vect__43.18_107);
> MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1] = _100;
>
> the vectorizer has optimize_mask_stores () which is supposed to replace
> .MASK_STORE with
>
> if (mask != { 0, 0, 0 ... })
> <code depending on the mask store>
>
> and thus optimize the mask == 0 case. But that only triggers for .MASK_STORE.
>
> You can see this when you force .MASK_STORE via -O3 -ffast-math (without
> -fallow-store-data-races) you get this effect:
Yep, -fno-allow-store-data-races fixes the problem
jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast
-march=native -lm
jh@alberti:~/tsvc/bin> perf stat ./a.out
Performance counter stats for './a.out':
37,289.50 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
431 page-faults:u # 11.558 /sec
137,411,365,539 cycles:u # 3.685 GHz
(83.33%)
991,673,172 stalled-cycles-frontend:u # 0.72% frontend cycles
idle (83.34%)
506,793 stalled-cycles-backend:u # 0.00% backend cycles
idle (83.34%)
3,400,375,204 instructions:u # 0.02 insn per cycle
# 0.29 stalled cycles per
insn (83.34%)
200,235,802 branches:u # 5.370 M/sec
(83.34%)
73,962 branch-misses:u # 0.04% of all branches
(83.33%)
37.305121352 seconds time elapsed
37.285467000 seconds user
0.000000000 seconds sys
jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast
-march=native -lm -fno-allow-store-data-races
jh@alberti:~/tsvc/bin> perf stat ./a.out
Performance counter stats for './a.out':
667.95 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
367 page-faults:u # 549.439 /sec
2,434,906,671 cycles:u # 3.645 GHz
(83.24%)
19,681 stalled-cycles-frontend:u # 0.00% frontend cycles
idle (83.24%)
12,495 stalled-cycles-backend:u # 0.00% backend cycles
idle (83.24%)
2,793,482,139 instructions:u # 1.15 insn per cycle
# 0.00 stalled cycles per
insn (83.24%)
598,879,536 branches:u # 896.588 M/sec
(83.78%)
50,649 branch-misses:u # 0.01% of all branches
(83.26%)
0.668807640 seconds time elapsed
0.668660000 seconds user
0.000000000 seconds sys
So I suppose it is L1 trashing. l1-dcache-loads goes up from
2,000,413,936 to 11,044,576,207
I suppose it would be too fancy for vectorizer to work out the overall
memory consumption here :) It sort of should have all the info...
Honza
next prev parent reply other threads:[~2022-11-16 15:28 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-16 14:10 [Bug tree-optimization/107715] New: " hubicka at gcc dot gnu.org
2022-11-16 15:08 ` [Bug tree-optimization/107715] " rguenth at gcc dot gnu.org
2022-11-16 15:28 ` hubicka at ucw dot cz [this message]
2022-11-16 15:35 ` amonakov at gcc dot gnu.org
2022-11-16 17:20 ` [Bug tree-optimization/107715] TSVC s161 and s277 " hubicka at gcc dot gnu.org
2022-11-21 10:02 ` marxin at gcc dot gnu.org
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bug-107715-4-6KPNl9WpfU@http.gcc.gnu.org/bugzilla/ \
--to=gcc-bugzilla@gcc.gnu.org \
--cc=gcc-bugs@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).