public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register
@ 2009-07-16 15:33 zsojka at seznam dot cz
2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:33 UTC (permalink / raw)
To: gcc-bugs
For the following code:
------------------------------------------------
uint8_t data[16];
static __attribute__((noinline)) void test(unsigned i)
{
unsigned j;
for (j = 0; j < 16; j++)
data[j] = ((i + j) & 0xFF00) >> 8;
}
------------------------------------------------
generated asm looks like (using -fno-tree-vectorize because of pr40771 )
# ./gcc tst2b.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize
------------------------------------------------
test:
.LFB11:
.cfi_startproc
movq %rdi, %rdx
movzbl %dh, %eax
movb %al, data(%rip)
leal 1(%rdi), %eax
movzbl %ah, %eax
movb %al, data+1(%rip)
leal 2(%rdi), %eax
movzbl %ah, %eax
movb %al, data+2(%rip)
leal 3(%rdi), %eax
movzbl %ah, %eax
movb %al, data+3(%rip)
.....
------------------------------------------------
When " movzbl %ah, %eax ; movb %al, data+1(%rip) " is replaced by " movb %ah,
data+1(%rip) ", code is faster. (other issue may be using lea even for
-march=pentium4 which would probably prefer add eax,1, but I can't verify that)
# ./gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --enable-languages=c,c++
--prefix=/mnt/svn/gcc-trunk/build/
Thread model: posix
gcc version 4.5.0 20090714 (experimental) (GCC)
CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz.
gcc's generated code runs in 19 ticks in average, code with "movzbl ; mov al"
replaced by "mov ah" runs in 16 ticks.
Attached is whole test code.
--
Summary: generating rendundant moves from second byte of 32b/64b
register
Product: gcc
Version: 4.5.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: zsojka at seznam dot cz
GCC host triplet: x86_64-pc-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
@ 2009-07-16 15:34 ` zsojka at seznam dot cz
2009-07-16 15:42 ` zsojka at seznam dot cz
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:34 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from zsojka at seznam dot cz 2009-07-16 15:34 -------
Created an attachment (id=18206)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18206&action=view)
preprocessed source of test code
Runs 1 << 24 iterations, prints average time in ticks.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
@ 2009-07-16 15:42 ` zsojka at seznam dot cz
2009-07-17 9:54 ` rguenth at gcc dot gnu dot org
2009-07-17 11:03 ` zsojka at seznam dot cz
3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:42 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from zsojka at seznam dot cz 2009-07-16 15:42 -------
When
data[j] = ((i + j) & 0xFF00) >> 8;
is replaced by
data[j] = (i + j) >> 8;
generated asm uses "shr eax, 8" instead of "movzx eax, ah", and runs in 19
ticks in average.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
2009-07-16 15:42 ` zsojka at seznam dot cz
@ 2009-07-17 9:54 ` rguenth at gcc dot gnu dot org
2009-07-17 11:03 ` zsojka at seznam dot cz
3 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-07-17 9:54 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from rguenth at gcc dot gnu dot org 2009-07-17 09:54 -------
The zero extension is done to avoid partial register stalls.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
` (2 preceding siblings ...)
2009-07-17 9:54 ` rguenth at gcc dot gnu dot org
@ 2009-07-17 11:03 ` zsojka at seznam dot cz
3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-17 11:03 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from zsojka at seznam dot cz 2009-07-17 11:03 -------
> The zero extension is done to avoid partial register stalls.
I am sorry, this is explanation to me that the generated code is supposedly
fastest, but only because of some "undocumented/unlucky" conditions the
benchmark shows different result? (and so this task can be possibly closed
because there is no way to determinically improve generated code)
Or do you say "the code responsible for eliminating partial register stalls
does bad job here because when using only 'ah' and 'eax', there is no _false_
register dependency"?
I wasn't sure if this has something to do with "partial register stall
elimination" because the following, very similiar (and functionally identical)
code:
------------------------------------------------
uint8_t data[16];
static __attribute__((noinline)) void bar(unsigned i)
{
unsigned j;
for (j = 0; j < 16; j++)
data[j] = (i + j) >> 8;
}
------------------------------------------------
Is compiled as:
------------------------------------------------
bar:
.LFB12:
.cfi_startproc
movl %edi, %eax
shrl $8, %eax
movb %al, data(%rip)
leal 1(%rdi), %eax
shrl $8, %eax
movb %al, data+1(%rip)
leal 2(%rdi), %eax
shrl $8, %eax
movb %al, data+2(%rip)
leal 3(%rdi), %eax
shrl $8, %eax
movb %al, data+3(%rip)
leal 4(%rdi), %eax
...
------------------------------------------------
There is no "partial register stall elimination", the only difference is "shr"
instead of "movzx".
So I thought that:
- the version with "mask & 0xFF00" is decyphered as 'only second byte is masked
out and then shifted right by 8b, so "ah" can be moved to "al" (resp. whole
eax)'
- the version without "mask" is not decyphered as reading only second byte, so
do just "shift right" of the working register
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
[not found] <bug-40772-4@http.gcc.gnu.org/bugzilla/>
@ 2021-06-06 10:58 ` roger at nextmovesoftware dot com
0 siblings, 0 replies; 6+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-06-06 10:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
CC| |roger at nextmovesoftware dot com
Target Milestone|--- |7.0
Status|UNCONFIRMED |RESOLVED
--- Comment #5 from Roger Sayle <roger at nextmovesoftware dot com> ---
This issue has been fixed since gcc 7; the compiler now stores the high-byte
register ah/bh/dh etc directly to memory. The original tst2b.c testcase when
compiled with -O3 -march=k8 -fno-tree-vectorize looks like:
test:
.LFB0:
.cfi_startproc
leal 1(%rdi), %edx
movl %edi, %eax
movb %ah, data(%rip)
addl $15, %eax
movb %dh, data+1(%rip)
leal 2(%rdi), %edx
movb %ah, data+15(%rip)
movb %dh, data+2(%rip)
leal 3(%rdi), %edx
movb %dh, data+3(%rip)
leal 4(%rdi), %edx
movb %dh, data+4(%rip)
leal 5(%rdi), %edx
movb %dh, data+5(%rip)
leal 6(%rdi), %edx
movb %dh, data+6(%rip)
leal 7(%rdi), %edx
movb %dh, data+7(%rip)
leal 8(%rdi), %edx
movb %dh, data+8(%rip)
leal 9(%rdi), %edx
movb %dh, data+9(%rip)
leal 10(%rdi), %edx
movb %dh, data+10(%rip)
leal 11(%rdi), %edx
movb %dh, data+11(%rip)
leal 12(%rdi), %edx
movb %dh, data+12(%rip)
leal 13(%rdi), %edx
movb %dh, data+13(%rip)
leal 14(%rdi), %edx
movb %dh, data+14(%rip)
ret
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-06-06 10:58 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
2009-07-16 15:42 ` zsojka at seznam dot cz
2009-07-17 9:54 ` rguenth at gcc dot gnu dot org
2009-07-17 11:03 ` zsojka at seznam dot cz
[not found] <bug-40772-4@http.gcc.gnu.org/bugzilla/>
2021-06-06 10:58 ` roger at nextmovesoftware dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).