[Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/40772]  New: generating rendundant moves from second byte of 32b/64b register
@ 2009-07-16 15:33 zsojka at seznam dot cz
  2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:33 UTC (permalink / raw)
  To: gcc-bugs

For the following code:
------------------------------------------------
uint8_t data[16];

static __attribute__((noinline)) void test(unsigned i)
{
        unsigned j;
        for (j = 0; j < 16; j++)
                data[j] = ((i + j) & 0xFF00) >> 8;
}
------------------------------------------------

generated asm looks like (using -fno-tree-vectorize because of pr40771 )
# ./gcc tst2b.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize
------------------------------------------------
test:
.LFB11:
        .cfi_startproc
        movq    %rdi, %rdx
        movzbl  %dh, %eax
        movb    %al, data(%rip)
        leal    1(%rdi), %eax
        movzbl  %ah, %eax
        movb    %al, data+1(%rip)
        leal    2(%rdi), %eax
        movzbl  %ah, %eax
        movb    %al, data+2(%rip)
        leal    3(%rdi), %eax
        movzbl  %ah, %eax
        movb    %al, data+3(%rip)
.....
------------------------------------------------
When "  movzbl %ah, %eax ; movb %al, data+1(%rip) " is replaced by " movb %ah,
data+1(%rip) ", code is faster. (other issue may be using lea even for
-march=pentium4 which would probably prefer add eax,1, but I can't verify that)

# ./gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --enable-languages=c,c++
--prefix=/mnt/svn/gcc-trunk/build/
Thread model: posix
gcc version 4.5.0 20090714 (experimental) (GCC)

CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz.

gcc's generated code runs in 19 ticks in average, code with "movzbl ; mov al"
replaced by "mov ah" runs in 16 ticks.

Attached is whole test code.


-- 
           Summary: generating rendundant moves from second byte of 32b/64b
                    register
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: zsojka at seznam dot cz
  GCC host triplet: x86_64-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
  2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
@ 2009-07-16 15:34 ` zsojka at seznam dot cz
  2009-07-16 15:42 ` zsojka at seznam dot cz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from zsojka at seznam dot cz  2009-07-16 15:34 -------
Created an attachment (id=18206)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18206&action=view)
preprocessed source of test code

Runs 1 << 24 iterations, prints average time in ticks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
  2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
  2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
@ 2009-07-16 15:42 ` zsojka at seznam dot cz
  2009-07-17  9:54 ` rguenth at gcc dot gnu dot org
  2009-07-17 11:03 ` zsojka at seznam dot cz
  3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-16 15:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from zsojka at seznam dot cz  2009-07-16 15:42 -------
When
                data[j] = ((i + j) & 0xFF00) >> 8;
is replaced by
                data[j] = (i + j) >> 8;

generated asm uses "shr eax, 8" instead of "movzx eax, ah", and runs in 19
ticks in average.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
  2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
  2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
  2009-07-16 15:42 ` zsojka at seznam dot cz
@ 2009-07-17  9:54 ` rguenth at gcc dot gnu dot org
  2009-07-17 11:03 ` zsojka at seznam dot cz
  3 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-07-17  9:54 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rguenth at gcc dot gnu dot org  2009-07-17 09:54 -------
The zero extension is done to avoid partial register stalls.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
  2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
                   ` (2 preceding siblings ...)
  2009-07-17  9:54 ` rguenth at gcc dot gnu dot org
@ 2009-07-17 11:03 ` zsojka at seznam dot cz
  3 siblings, 0 replies; 6+ messages in thread
From: zsojka at seznam dot cz @ 2009-07-17 11:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from zsojka at seznam dot cz  2009-07-17 11:03 -------
> The zero extension is done to avoid partial register stalls.

I am sorry, this is explanation to me that the generated code is supposedly
fastest, but only because of some "undocumented/unlucky" conditions the
benchmark shows different result? (and so this task can be possibly closed
because there is no way to determinically improve generated code)
Or do you say "the code responsible for eliminating partial register stalls
does bad job here because when using only 'ah' and 'eax', there is no _false_
register dependency"?

I wasn't sure if this has something to do with "partial register stall
elimination" because the following, very similiar (and functionally identical)
code:
------------------------------------------------
uint8_t data[16];

static __attribute__((noinline)) void bar(unsigned i)
{
        unsigned j;
        for (j = 0; j < 16; j++)
                data[j] = (i + j) >> 8;
}
------------------------------------------------

Is compiled as:
------------------------------------------------
bar:
.LFB12:
        .cfi_startproc
        movl    %edi, %eax
        shrl    $8, %eax
        movb    %al, data(%rip)
        leal    1(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+1(%rip)
        leal    2(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+2(%rip)
        leal    3(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+3(%rip)
        leal    4(%rdi), %eax
...
------------------------------------------------
There is no "partial register stall elimination", the only difference is "shr"
instead of "movzx".

So I thought that:
- the version with "mask & 0xFF00" is decyphered as 'only second byte is masked
out and then shifted right by 8b, so "ah" can be moved to "al" (resp. whole
eax)'
- the version without "mask" is not decyphered as reading only second byte, so
do just "shift right" of the working register


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/40772] generating rendundant moves from second byte of 32b/64b register
       [not found] <bug-40772-4@http.gcc.gnu.org/bugzilla/>
@ 2021-06-06 10:58 ` roger at nextmovesoftware dot com
  0 siblings, 0 replies; 6+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-06-06 10:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
                 CC|                            |roger at nextmovesoftware dot com
   Target Milestone|---                         |7.0
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #5 from Roger Sayle <roger at nextmovesoftware dot com> ---
This issue has been fixed since gcc 7; the compiler now stores the high-byte
register ah/bh/dh etc directly to memory.  The original tst2b.c testcase when
compiled with -O3 -march=k8 -fno-tree-vectorize looks like:
test:
.LFB0:
        .cfi_startproc
        leal    1(%rdi), %edx
        movl    %edi, %eax
        movb    %ah, data(%rip)
        addl    $15, %eax
        movb    %dh, data+1(%rip)
        leal    2(%rdi), %edx
        movb    %ah, data+15(%rip)
        movb    %dh, data+2(%rip)
        leal    3(%rdi), %edx
        movb    %dh, data+3(%rip)
        leal    4(%rdi), %edx
        movb    %dh, data+4(%rip)
        leal    5(%rdi), %edx
        movb    %dh, data+5(%rip)
        leal    6(%rdi), %edx
        movb    %dh, data+6(%rip)
        leal    7(%rdi), %edx
        movb    %dh, data+7(%rip)
        leal    8(%rdi), %edx
        movb    %dh, data+8(%rip)
        leal    9(%rdi), %edx
        movb    %dh, data+9(%rip)
        leal    10(%rdi), %edx
        movb    %dh, data+10(%rip)
        leal    11(%rdi), %edx
        movb    %dh, data+11(%rip)
        leal    12(%rdi), %edx
        movb    %dh, data+12(%rip)
        leal    13(%rdi), %edx
        movb    %dh, data+13(%rip)
        leal    14(%rdi), %edx
        movb    %dh, data+14(%rip)
        ret

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-06 10:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-16 15:33 [Bug rtl-optimization/40772] New: generating rendundant moves from second byte of 32b/64b register zsojka at seznam dot cz
2009-07-16 15:34 ` [Bug rtl-optimization/40772] " zsojka at seznam dot cz
2009-07-16 15:42 ` zsojka at seznam dot cz
2009-07-17  9:54 ` rguenth at gcc dot gnu dot org
2009-07-17 11:03 ` zsojka at seznam dot cz
     [not found] <bug-40772-4@http.gcc.gnu.org/bugzilla/>
2021-06-06 10:58 ` roger at nextmovesoftware dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).