public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/108346] New: gather/scatter loops optimized too often for znver4 (and other zens)
@ 2023-01-09 19:42 hubicka at gcc dot gnu.org
2023-01-09 19:45 ` [Bug target/108346] " pinskia at gcc dot gnu.org
2023-01-16 15:00 ` hubicka at gcc dot gnu.org
0 siblings, 2 replies; 3+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-09 19:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108346
Bug ID: 108346
Summary: gather/scatter loops optimized too often for znver4
(and other zens)
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
The following two benchmarks tests gather/scatter codegen:
s4113.c:
#include <math.h>
#include <malloc.h>
//typedef float real_t;
#define iterations 1000000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];
real_t aa[LEN_2D][LEN_2D];
real_t bb[LEN_2D][LEN_2D];
real_t cc[LEN_2D][LEN_2D];
real_t qq;
int
main(void)
{
// reductions
// if to max reduction
real_t x;
int * __restrict__ ip = (int *) malloc(LEN_1D*sizeof(real_t));
for (int i = 0; i < LEN_1D; i = i+5){
(ip)[i] = (i+4);
(ip)[i+1] = (i+2);
(ip)[i+2] = (i);
(ip)[i+3] = (i+3);
(ip)[i+4] = (i+1);
}
for (int nl = 0; nl < 2*iterations; nl++) {
for (int i = 1; i < LEN_1D; i += 2) {
a[ip[i]] = b[ip[i]] + c[i];
}
asm("":::"memory");
}
return x;
}
s4115.c:
#include <math.h>
#include <malloc.h>
#define iterations 1000000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];
real_t aa[LEN_2D][LEN_2D];
real_t bb[LEN_2D][LEN_2D];
real_t cc[LEN_2D][LEN_2D];
real_t qq;
int
main(void)
{
// reductions
// if to max reduction
real_t x;
int * __restrict__ ip = (int *) malloc(LEN_1D*sizeof(real_t));
for (int i = 0; i < LEN_1D; i = i+5){
(ip)[i] = (i+4);
(ip)[i+1] = (i+2);
(ip)[i+2] = (i);
(ip)[i+3] = (i+3);
(ip)[i+4] = (i+1);
}
for (int nl = 0; nl < 2*iterations; nl++) {
for (int i = 1; i < LEN_1D; i += 2) {
x += a[i] * b[ip[i]];
}
asm("":::"memory");
}
return x;
}
On zver4 I get following times with disabling/enabling vectorization and
disabling/enabling gather&scatter use:
runtime
type optimization operation scalar nogather gather parts instruction
char avx256_optimal load+store 14.23 N/A N/A
char avx256_optimal load 14.25 N/A N/A
char ^avx256_optimal load+store 14.02 N/A N/A
char ^avx256_optimal load 14.25 N/A N/A
short avx256_optimal load+store*14.23 N/A N/A
short avx256_optimal load *14.23 N/A N/A
short ^avx256_optimal load+store 15.22 N/A N/A
short ^avx256_optimal load 14.23 N/A N/A
int avx256_optimal load+store*16.51 27.66 25.96 8 vpgatherdd
ymm,vpscatterdd ymm
int avx256_optimal load 14.13 13.17 *12.71 8 vpgatherdd
ymm
int ^avx256_optimal load+store*16.57 33.25 26.06 16 vpgatherdd
zmm,vpscatterdd zmm
int ^avx256_optimal load 14.14 16.81 *13.63 16 vpgatherdd
zmm
long avx256_optimal load+store*20.59 20.66 32.03 4 vpgatherdq
zmm,vpscatterdq zmm
long avx256_optimal load 15.36 *15.36 15.82 4 vpgatherdq
zmm
long ^avx256_optimal load+store 22.42 *20.96 30.54 8 vpgatherdq
zmm,vpscatterdq zmm
long ^avx256_optimal load *15.87 16.40 18.68 8 vpgatherdq
zmm
float avx256_optimal load+store 16.88 27.78 26.08 8 vgatherdps
ymm, vscatterdps ymm
float avx256_optimal load 26.01 *13.19 13.30 8 vgatherdps
ymm
float ^avx256_optimal load+store*16.89 33.22 26.19 16 vgatherdps
zmm, vscatterdps zmm
float ^avx256_optimal load 26.01 16.61 *13.85 16 vgatherdps
zmm
double avx256_optimal load+store 21.94 *20.81 31.43 4 vgatherdpd
ymm, vscatterdpd ymm
double avx256_optimal load 26.01 26.01 *15.20 4 vgatherdpd
ymm
double ^avx256_optimal load+store 21.44 *21.65 30.73 8 vgatherdpd
zmm, vscatterdpd zmm
double ^avx256_optimal load 26.01 26.01 *18.24 8 vgatherdpd
zmm
We incorrectly vectorize for int load+store loop causing 60% regression.
Vectorizing avx512 long load loop seems to be also slight loss, but not that
important. I will post patch todisable scatter instructions since they does
not seem to be win.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/108346] gather/scatter loops optimized too often for znver4 (and other zens)
2023-01-09 19:42 [Bug middle-end/108346] New: gather/scatter loops optimized too often for znver4 (and other zens) hubicka at gcc dot gnu.org
@ 2023-01-09 19:45 ` pinskia at gcc dot gnu.org
2023-01-16 15:00 ` hubicka at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-09 19:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108346
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |target
Blocks| |53947
Target| |x86_64-linux-gnu
Keywords| |missed-optimization
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a cost issue, either not having a decent way of expressing the cost or
the backend cost model needs to be improved (or both).
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/108346] gather/scatter loops optimized too often for znver4 (and other zens)
2023-01-09 19:42 [Bug middle-end/108346] New: gather/scatter loops optimized too often for znver4 (and other zens) hubicka at gcc dot gnu.org
2023-01-09 19:45 ` [Bug target/108346] " pinskia at gcc dot gnu.org
@ 2023-01-16 15:00 ` hubicka at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-16 15:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108346
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Sadly the win/loss cases does not seem to suggest a simple cost scheme.
We currently compute gather/scatter costs as static startup cost + cost per
lane and they are set to approximately match actual latencies. I am not sure
how much better we can do.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-01-16 15:00 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09 19:42 [Bug middle-end/108346] New: gather/scatter loops optimized too often for znver4 (and other zens) hubicka at gcc dot gnu.org
2023-01-09 19:45 ` [Bug target/108346] " pinskia at gcc dot gnu.org
2023-01-16 15:00 ` hubicka at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).