[Bug target/97891] New: [x86] Consider using registers on large initializations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/97891] New: [x86] Consider using registers on large initializations
@ 2020-11-18 12:08 andysem at mail dot ru
  2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-18 12:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

            Bug ID: 97891
           Summary: [x86] Consider using registers on large
                    initializations
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following example code:

struct A
{
    long a;
    short b;
    int c;
    char d;
    long x;
    bool y;
    int z;
    char* p;

    A() :
        a(0), b(0), c(0), d(0), x(0), y(false), z(0), p(0)
    {}
};

void test(A* p, unsigned int count)
{
    for (unsigned int i = 0; i < count; ++i)
    {
        p[i] = A();
    }
}

When compiled with "-O3 -march=nehalem" the generated code is:

test(A*, unsigned int):
        testl   %esi, %esi
        je      .L1
        leal    -1(%rsi), %eax
        leaq    (%rax,%rax,2), %rax
        salq    $4, %rax
        leaq    48(%rdi,%rax), %rax
.L3:
        xorl    %edx, %edx
        movq    $0, (%rdi)
        addq    $48, %rdi
        movw    %dx, -40(%rdi)
        movl    $0, -36(%rdi)
        movb    $0, -32(%rdi)
        movq    $0, -24(%rdi)
        movb    $0, -16(%rdi)
        movl    $0, -12(%rdi)
        movq    $0, -8(%rdi)
        cmpq    %rax, %rdi
        jne     .L3
.L1:
        ret

https://gcc.godbolt.org/z/TrfWYr

Here, the main loop body between .L3 and .L1 is 60 bytes large, with a
significant amount of space wasted on the $0 constants encoded in mov
instructions. It would be more efficient to use a single zero register in all
member initializations, especially given that %edx is already used like that.

A loop rewritten like this:

    for (unsigned int i = 0; i < count; ++i)
    {
        __asm__
        (
            "movq    %q1, (%0)\n\t"
            "movw    %w1, 8(%0)\n\t"
            "movl    %1, 12(%0)\n\t"
            "movb    %b1, 16(%0)\n\t"
            "movq    %q1, 24(%0)\n\t"
            "movb    %b1, 32(%0)\n\t"
            "movl    %1, 36(%0)\n\t"
            "movq    %q1, 40(%0)\n\t"
            : : "r" (p + i), "q" (0)
        );
    }

compiles to:

test(A*, unsigned int):
        testl   %esi, %esi
        je      .L1
        leal    -1(%rsi), %eax
        leaq    (%rax,%rax,2), %rax
        salq    $4, %rax
        leaq    48(%rdi,%rax), %rdx
        xorl    %eax, %eax
.L3:
        movq    %rax, (%rdi)
        movw    %ax, 8(%rdi)
        movl    %eax, 12(%rdi)
        movb    %al, 16(%rdi)
        movq    %rax, 24(%rdi)
        movb    %al, 32(%rdi)
        movl    %eax, 36(%rdi)
        movq    %rax, 40(%rdi)

        addq    $48, %rdi
        cmpq    %rdx, %rdi
        jne     .L3
.L1:
        ret

Here, the loop between .L3 and .L1 only takes 34 bytes, which is nearly half
the original size.

Constant (for example, zero) initialization is a frequently used pattern to
initialize structures, so the sequences like the above are quite wide spread.
Converting cases like this to the use of registers could save some code size
and reduce cache pressure.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
@ 2020-11-18 12:11 ` andysem at mail dot ru
  2020-11-18 14:37 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-18 12:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

--- Comment #1 from andysem at mail dot ru ---
As a side note, the "xorl %edx, %edx" in the original code should have been
moved outside the loop, as it was in the code with __asm__ block.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
  2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
@ 2020-11-18 14:37 ` rguenth at gcc dot gnu.org
  2020-11-19  2:59 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-18 14:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-11-18
             Status|UNCONFIRMED                 |NEW
             Target|x86-64                      |x86_64-*-* i?86-*-*
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  I think we have (plenty?) duplicates.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
  2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
  2020-11-18 14:37 ` rguenth at gcc dot gnu.org
@ 2020-11-19  2:59 ` crazylht at gmail dot com
  2020-11-19  8:25 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-11-19  2:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
This problem is very similar to the one pass_rpad deals with.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
                   ` (2 preceding siblings ...)
  2020-11-19  2:59 ` crazylht at gmail dot com
@ 2020-11-19  8:25 ` crazylht at gmail dot com
  2020-11-19  9:33 ` andysem at mail dot ru
  2020-12-24  2:34 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-11-19  8:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #3)
> This problem is very similar to the one pass_rpad deals with.

We already have mov<mode>_xor for mov $0 to reg, so we only need to handle mov
$0 to mem.

and size for:

xorl --- 2 byte
$0 in movb ----- 1 byte
$0 in movw ----- 2 byte
$0 in mov{l,q} ----- 4 byte

   0:   31 d2                   xor    %edx,%edx
   2:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
   9:   48 89 37                mov    %rsi,(%rdi)
   c:   88 17                   mov    %dl,(%rdi)
   e:   c6 07 00                movb   $0x0,(%rdi)
  11:   66 89 17                mov    %dx,(%rdi)
  14:   66 c7 07 00 00          movw   $0x0,(%rdi)
  19:   89 3f                   mov    %edi,(%rdi)
  1b:   c7 07 00 00 00 00       movl   $0x0,(%rdi)


As long as immediate $0 occupied more than 2 bytes, it would be codesize
beneficial to replace mov $0, mem with xor reg0, reg0 + mov reg0, mem.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
                   ` (3 preceding siblings ...)
  2020-11-19  8:25 ` crazylht at gmail dot com
@ 2020-11-19  9:33 ` andysem at mail dot ru
  2020-12-24  2:34 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: andysem at mail dot ru @ 2020-11-19  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

--- Comment #5 from andysem at mail dot ru ---
Using a register is beneficial even for bytes and words if there are multiple
of mov instructions. But there has to be a single reg0 for all movs.

I'm not very knowlegeable about gcc internals, but would it be beneficial to
implement this on a higher level than instruction transformation? I.e. so that
instead of this:

    a = 0;
    b = 0;
    c = 0;

we have:

    any reg0 = 0; // any represents a type compatible with any fundamental or
enum type
    a = reg0;
    b = reg0;
    c = reg0;

This way, reg0 would be in a single register, and that xorl instruction could
be subject to other tree optimizations.

With tree-level optimization, another thing to note is vectorizer. I know gcc
can sometimes merge adjacent initializations without padding to a larger single
instruction initializazion. For example:

struct A
{
    long a1;
    long a2;

    A() :
        a1(0), a2(0)
    {
    }
};

void test(A* p, unsigned int count)
{
    for (unsigned int i = 0; i < count; ++i)
    {
        p[i] = A();
    }
}

test(A*, unsigned int):
        testl   %esi, %esi
        je      .L1
        leal    -1(%rsi), %eax
        pxor    %xmm0, %xmm0
        salq    $4, %rax
        leaq    16(%rdi,%rax), %rax
.L3:
        movups  %xmm0, (%rdi)
        addq    $16, %rdi
        cmpq    %rax, %rdi
        jne     .L3
.L1:
        ret

I would like this to still work.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/97891] [x86] Consider using registers on large initializations
  2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
                   ` (4 preceding siblings ...)
  2020-11-19  9:33 ` andysem at mail dot ru
@ 2020-12-24  2:34 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2020-12-24  2:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
cat test.c

typedef struct {
  long a;
  long b;
}TI;

extern TI r;
void
foo ()
{
  r.a = 0;
  r.b = 0;
}

gcc -Ofast -march=cascadelake -S got

foo:
.LFB0:
        .cfi_startproc
        movq    $0, r(%rip)
        movq    $0, r+8(%rip)
        ret
        .cfi_endproc
.LFE0:


SLP failed to vectorize due to cost

test1.c:10:7: note:   === vect_slp_analyze_instance_alignment ===
test1.c:10:7: note:   vect_compute_data_ref_alignment:
test1.c:10:7: note:   can't force alignment of ref: r.a
test1.c:10:7: note:   === vect_slp_analyze_instance_dependence ===
test1.c:10:7: note:   === vect_slp_analyze_operations ===
test1.c:10:7: note:   ==> examining statement: r.a = 0;
test1.c:10:7: note:   vect_is_simple_use: operand 0, type of def: constant
test1.c:10:7: note:   Vectorizing an unaligned access.
test1.c:10:7: note:   vect_model_store_cost: unaligned supported by hardware.
test1.c:10:7: note:   vect_model_store_cost: inside_cost = 16, prologue_cost =
0 .
test1.c:10:7: note:   === vect_bb_partition_graph ===
test1.c:10:7: note: ***** Analysis succeeded with vector mode V16QI
test1.c:10:7: note: SLPing BB part
0x48bc3d0 0 1 times unaligned_store (misalign -1) costs 16 in body
0x48bc3d0 <unknown> 1 times vector_load costs 12 in prologue
0x48bbf50 0 1 times scalar_store costs 12 in body
0x48bbf50 0 1 times scalar_store costs 12 in body
test1.c:10:7: note: Cost model analysis: 
  Vector inside of basic block cost: 16
  Vector prologue cost: 12
  Vector epilogue cost: 0
  Scalar cost of basic block: 24

Shouldn't cost of zero vector CTOR be different from normal ones?
Since x86 could have pxor, similar for aarch64 (eor??).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-12-24  2:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 12:08 [Bug target/97891] New: [x86] Consider using registers on large initializations andysem at mail dot ru
2020-11-18 12:11 ` [Bug target/97891] " andysem at mail dot ru
2020-11-18 14:37 ` rguenth at gcc dot gnu.org
2020-11-19  2:59 ` crazylht at gmail dot com
2020-11-19  8:25 ` crazylht at gmail dot com
2020-11-19  9:33 ` andysem at mail dot ru
2020-12-24  2:34 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).