public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/97891] New: [x86] Consider using registers on large initializations
@ 2020-11-18 12:08 andysem at mail dot ru
2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-18 12:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
Bug ID: 97891
Summary: [x86] Consider using registers on large
initializations
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: andysem at mail dot ru
Target Milestone: ---
Consider the following example code:
struct A
{
long a;
short b;
int c;
char d;
long x;
bool y;
int z;
char* p;
A() :
a(0), b(0), c(0), d(0), x(0), y(false), z(0), p(0)
{}
};
void test(A* p, unsigned int count)
{
for (unsigned int i = 0; i < count; ++i)
{
p[i] = A();
}
}
When compiled with "-O3 -march=nehalem" the generated code is:
test(A*, unsigned int):
testl %esi, %esi
je .L1
leal -1(%rsi), %eax
leaq (%rax,%rax,2), %rax
salq $4, %rax
leaq 48(%rdi,%rax), %rax
.L3:
xorl %edx, %edx
movq $0, (%rdi)
addq $48, %rdi
movw %dx, -40(%rdi)
movl $0, -36(%rdi)
movb $0, -32(%rdi)
movq $0, -24(%rdi)
movb $0, -16(%rdi)
movl $0, -12(%rdi)
movq $0, -8(%rdi)
cmpq %rax, %rdi
jne .L3
.L1:
ret
https://gcc.godbolt.org/z/TrfWYr
Here, the main loop body between .L3 and .L1 is 60 bytes large, with a
significant amount of space wasted on the $0 constants encoded in mov
instructions. It would be more efficient to use a single zero register in all
member initializations, especially given that %edx is already used like that.
A loop rewritten like this:
for (unsigned int i = 0; i < count; ++i)
{
__asm__
(
"movq %q1, (%0)\n\t"
"movw %w1, 8(%0)\n\t"
"movl %1, 12(%0)\n\t"
"movb %b1, 16(%0)\n\t"
"movq %q1, 24(%0)\n\t"
"movb %b1, 32(%0)\n\t"
"movl %1, 36(%0)\n\t"
"movq %q1, 40(%0)\n\t"
: : "r" (p + i), "q" (0)
);
}
compiles to:
test(A*, unsigned int):
testl %esi, %esi
je .L1
leal -1(%rsi), %eax
leaq (%rax,%rax,2), %rax
salq $4, %rax
leaq 48(%rdi,%rax), %rdx
xorl %eax, %eax
.L3:
movq %rax, (%rdi)
movw %ax, 8(%rdi)
movl %eax, 12(%rdi)
movb %al, 16(%rdi)
movq %rax, 24(%rdi)
movb %al, 32(%rdi)
movl %eax, 36(%rdi)
movq %rax, 40(%rdi)
addq $48, %rdi
cmpq %rdx, %rdi
jne .L3
.L1:
ret
Here, the loop between .L3 and .L1 only takes 34 bytes, which is nearly half
the original size.
Constant (for example, zero) initialization is a frequently used pattern to
initialize structures, so the sequences like the above are quite wide spread.
Converting cases like this to the use of registers could save some code size
and reduce cache pressure.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
@ 2020-11-18 12:11 ` andysem at mail dot ru
2020-11-18 14:37 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-18 12:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
--- Comment #1 from andysem at mail dot ru ---
As a side note, the "xorl %edx, %edx" in the original code should have been
moved outside the loop, as it was in the code with __asm__ block.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
@ 2020-11-18 14:37 ` rguenth at gcc dot gnu.org
2020-11-19 2:59 ` crazylht at gmail dot com
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-18 14:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2020-11-18
Status|UNCONFIRMED |NEW
Target|x86-64 |x86_64-*-* i?86-*-*
Ever confirmed|0 |1
Keywords| |missed-optimization
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. I think we have (plenty?) duplicates.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
2020-11-18 14:37 ` rguenth at gcc dot gnu.org
@ 2020-11-19 2:59 ` crazylht at gmail dot com
2020-11-19 8:25 ` crazylht at gmail dot com
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-11-19 2:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
This problem is very similar to the one pass_rpad deals with.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
` (2 preceding siblings ...)
2020-11-19 2:59 ` crazylht at gmail dot com
@ 2020-11-19 8:25 ` crazylht at gmail dot com
2020-11-19 9:33 ` andysem at mail dot ru
2020-12-24 2:34 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-11-19 8:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> This problem is very similar to the one pass_rpad deals with.
We already have mov<mode>_xor for mov $0 to reg, so we only need to handle mov
$0 to mem.
and size for:
xorl --- 2 byte
$0 in movb ----- 1 byte
$0 in movw ----- 2 byte
$0 in mov{l,q} ----- 4 byte
0: 31 d2 xor %edx,%edx
2: 48 c7 07 00 00 00 00 movq $0x0,(%rdi)
9: 48 89 37 mov %rsi,(%rdi)
c: 88 17 mov %dl,(%rdi)
e: c6 07 00 movb $0x0,(%rdi)
11: 66 89 17 mov %dx,(%rdi)
14: 66 c7 07 00 00 movw $0x0,(%rdi)
19: 89 3f mov %edi,(%rdi)
1b: c7 07 00 00 00 00 movl $0x0,(%rdi)
As long as immediate $0 occupied more than 2 bytes, it would be codesize
beneficial to replace mov $0, mem with xor reg0, reg0 + mov reg0, mem.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
` (3 preceding siblings ...)
2020-11-19 8:25 ` crazylht at gmail dot com
@ 2020-11-19 9:33 ` andysem at mail dot ru
2020-12-24 2:34 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-19 9:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
--- Comment #5 from andysem at mail dot ru ---
Using a register is beneficial even for bytes and words if there are multiple
of mov instructions. But there has to be a single reg0 for all movs.
I'm not very knowlegeable about gcc internals, but would it be beneficial to
implement this on a higher level than instruction transformation? I.e. so that
instead of this:
a = 0;
b = 0;
c = 0;
we have:
any reg0 = 0; // any represents a type compatible with any fundamental or
enum type
a = reg0;
b = reg0;
c = reg0;
This way, reg0 would be in a single register, and that xorl instruction could
be subject to other tree optimizations.
With tree-level optimization, another thing to note is vectorizer. I know gcc
can sometimes merge adjacent initializations without padding to a larger single
instruction initializazion. For example:
struct A
{
long a1;
long a2;
A() :
a1(0), a2(0)
{
}
};
void test(A* p, unsigned int count)
{
for (unsigned int i = 0; i < count; ++i)
{
p[i] = A();
}
}
test(A*, unsigned int):
testl %esi, %esi
je .L1
leal -1(%rsi), %eax
pxor %xmm0, %xmm0
salq $4, %rax
leaq 16(%rdi,%rax), %rax
.L3:
movups %xmm0, (%rdi)
addq $16, %rdi
cmpq %rax, %rdi
jne .L3
.L1:
ret
I would like this to still work.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/97891] [x86] Consider using registers on large initializations
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
` (4 preceding siblings ...)
2020-11-19 9:33 ` andysem at mail dot ru
@ 2020-12-24 2:34 ` crazylht at gmail dot com
5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-12-24 2:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
cat test.c
typedef struct {
long a;
long b;
}TI;
extern TI r;
void
foo ()
{
r.a = 0;
r.b = 0;
}
gcc -Ofast -march=cascadelake -S got
foo:
.LFB0:
.cfi_startproc
movq $0, r(%rip)
movq $0, r+8(%rip)
ret
.cfi_endproc
.LFE0:
SLP failed to vectorize due to cost
test1.c:10:7: note: === vect_slp_analyze_instance_alignment ===
test1.c:10:7: note: vect_compute_data_ref_alignment:
test1.c:10:7: note: can't force alignment of ref: r.a
test1.c:10:7: note: === vect_slp_analyze_instance_dependence ===
test1.c:10:7: note: === vect_slp_analyze_operations ===
test1.c:10:7: note: ==> examining statement: r.a = 0;
test1.c:10:7: note: vect_is_simple_use: operand 0, type of def: constant
test1.c:10:7: note: Vectorizing an unaligned access.
test1.c:10:7: note: vect_model_store_cost: unaligned supported by hardware.
test1.c:10:7: note: vect_model_store_cost: inside_cost = 16, prologue_cost =
0 .
test1.c:10:7: note: === vect_bb_partition_graph ===
test1.c:10:7: note: ***** Analysis succeeded with vector mode V16QI
test1.c:10:7: note: SLPing BB part
0x48bc3d0 0 1 times unaligned_store (misalign -1) costs 16 in body
0x48bc3d0 <unknown> 1 times vector_load costs 12 in prologue
0x48bbf50 0 1 times scalar_store costs 12 in body
0x48bbf50 0 1 times scalar_store costs 12 in body
test1.c:10:7: note: Cost model analysis:
Vector inside of basic block cost: 16
Vector prologue cost: 12
Vector epilogue cost: 0
Scalar cost of basic block: 24
Shouldn't cost of zero vector CTOR be different from normal ones?
Since x86 could have pxor, similar for aarch64 (eor??).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-12-24 2:34 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
2020-11-18 14:37 ` rguenth at gcc dot gnu.org
2020-11-19 2:59 ` crazylht at gmail dot com
2020-11-19 8:25 ` crazylht at gmail dot com
2020-11-19 9:33 ` andysem at mail dot ru
2020-12-24 2:34 ` crazylht at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).