public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/88492] SLP optimization generates ugly code
       [not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
@ 2021-04-14 16:36 ` ptomsich at gcc dot gnu.org
  2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: ptomsich at gcc dot gnu.org @ 2021-04-14 16:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

ptomsich at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ptomsich at gcc dot gnu.org

--- Comment #6 from ptomsich at gcc dot gnu.org ---
With the current master, the test case generates (with -mcpu=neoverse-n1):

        .arch armv8.2-a+crc+fp16+rcpc+dotprod+profile
        .file   "pr88492.c"
        .text
        .align  2
        .p2align 5,,15
        .global test_slp
        .type   test_slp, %function
test_slp:
.LFB0:
        .cfi_startproc
        ldr     q2, [x0]
        adrp    x1, .LC0
        ldr     q16, [x1, #:lo12:.LC0]
        uxtl    v4.8h, v2.8b
        uxtl2   v2.8h, v2.16b
        uxtl    v0.4s, v4.4h
        uxtl    v6.4s, v2.4h
        uxtl2   v4.4s, v4.8h
        uxtl2   v2.4s, v2.8h
        mov     v1.16b, v0.16b
        mov     v7.16b, v6.16b
        mov     v5.16b, v4.16b
        mov     v3.16b, v2.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v16.16b
        tbl     v6.16b, {v6.16b - v7.16b}, v16.16b
        tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
        tbl     v2.16b, {v2.16b - v3.16b}, v16.16b
        add     v0.4s, v0.4s, v4.4s
        add     v6.4s, v6.4s, v2.4s
        add     v0.4s, v0.4s, v6.4s
        addv    s0, v0.4s
        fmov    w0, s0
        ret
        .cfi_endproc
.LFE0:
        .size   test_slp, .-test_slp

which contrasts with LLVM13 (with -mcpu=neoverse-n1):

test_slp:                               // @test_slp
        .cfi_startproc
// %bb.0:                               // %entry
        ldr     q0, [x0]
        movi    v1.16b, #1
        movi    v2.2d, #0000000000000000
        udot    v2.4s, v0.16b, v1.16b
        addv    s0, v2.4s
        fmov    w0, s0
        ret
.Lfunc_end0:
        .size   test_slp, .Lfunc_end0-test_slp

or (LLVM13 w/o the mcpu-option):

        .type   test_slp,@function
test_slp:                               // @test_slp
        .cfi_startproc
// %bb.0:                               // %entry
        ldr     q0, [x0]
        ushll2  v1.8h, v0.16b, #0
        ushll   v0.8h, v0.8b, #0
        uaddl2  v2.4s, v0.8h, v1.8h
        uaddl   v0.4s, v0.4h, v1.4h
        add     v0.4s, v0.4s, v2.4s
        addv    s0, v0.4s
        fmov    w0, s0
        ret
.Lfunc_end0:
        .size   test_slp, .Lfunc_end0-test_slp

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/88492] SLP optimization generates ugly code
       [not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
  2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
@ 2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
  2022-01-04 19:23 ` pinskia at gcc dot gnu.org
  2024-02-27  7:52 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 4+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-04-14 17:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to ptomsich from comment #6)
> With the current master, the test case generates (with -mcpu=neoverse-n1):

> which contrasts with LLVM13 (with -mcpu=neoverse-n1):
> 
> test_slp:                               // @test_slp
> 	.cfi_startproc
> // %bb.0:                               // %entry
> 	ldr	q0, [x0]
> 	movi	v1.16b, #1
> 	movi	v2.2d, #0000000000000000
> 	udot	v2.4s, v0.16b, v1.16b
> 	addv	s0, v2.4s
> 	fmov	w0, s0
> 	ret
> .Lfunc_end0:
> 	.size	test_slp, .Lfunc_end0-test_slp
> 
> or (LLVM13 w/o the mcpu-option):
> 
> 	.type	test_slp,@function
> test_slp:                               // @test_slp
> 	.cfi_startproc
> // %bb.0:                               // %entry
> 	ldr	q0, [x0]
> 	ushll2	v1.8h, v0.16b, #0
> 	ushll	v0.8h, v0.8b, #0
> 	uaddl2	v2.4s, v0.8h, v1.8h
> 	uaddl	v0.4s, v0.4h, v1.4h
> 	add	v0.4s, v0.4s, v2.4s
> 	addv	s0, v0.4s
> 	fmov	w0, s0
> 	ret
> .Lfunc_end0:
> 	.size	test_slp, .Lfunc_end0-test_slp

It's definitely a neat trick, but correct me if I'm wrong: it's only possible
because addition is commutative.

Clang has just simply reordered the loads because the loop is very simple to
just

    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0];
            tmp[i][1] = b[1];
            tmp[i][2] = b[2];
            tmp[i][3] = b[3];
    }

Which GCC also handles fine. 

As Richi mentioned before

>I know the "real" code this testcase is from has actual operations
> in place of the b[N] reads, for the above vectorization looks somewhat
> pointless given we end up decomposing the result again.

It seems a bit of a too narrow focus to optimize for this particular example as
the real code does "other" things.

i.e.

Both GCC and Clang fall apart with

int test_slp( unsigned char *b )
{
    unsigned int tmp[4][4];
    int sum = 0;
    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0] - b[4];
            tmp[i][2] = b[1] + b[5];
            tmp[i][1] = b[2] - b[6];
            tmp[i][3] = b[3] + b[7];
    }
    for( int i = 0; i < 4; i++ )
    {
            sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
    }
    return sum;
}

which has about the same access pattern as the real code.

If you change the operations you'll notice that for others examples like

int test_slp( unsigned char *b )
{
    unsigned int tmp[4][4];
    int sum = 0;
    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0] - b[4];
            tmp[i][2] = b[1] - b[5];
            tmp[i][1] = b[2] - b[6];
            tmp[i][3] = b[3] - b[7];
    }
    for( int i = 0; i < 4; i++ )
    {
            sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
    }
    return sum;
}

GCC handles this better (but we are let down by register allocation).

To me it seems quite unlikely that actual code would be written like that, but
I guess there could be a case to be made to try to reassoc loads as well.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/88492] SLP optimization generates ugly code
       [not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
  2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
  2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
@ 2022-01-04 19:23 ` pinskia at gcc dot gnu.org
  2024-02-27  7:52 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-04 19:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=99412

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Similar to PR 99412 .

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/88492] SLP optimization generates ugly code
       [not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2022-01-04 19:23 ` pinskia at gcc dot gnu.org
@ 2024-02-27  7:52 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-27  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I noticed once I add V4QI and V2HI support to the aarch64 backend, this code
gets even worse.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-27  7:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-88492-4@http.gcc.gnu.org/bugzilla/>
2021-04-14 16:36 ` [Bug tree-optimization/88492] SLP optimization generates ugly code ptomsich at gcc dot gnu.org
2021-04-14 17:12 ` tnfchris at gcc dot gnu.org
2022-01-04 19:23 ` pinskia at gcc dot gnu.org
2024-02-27  7:52 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).