[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/114109] New: x264 satd vectorization vs LLVM
@ 2024-02-26 10:28 rdapp at gcc dot gnu.org
  2024-02-26 10:44 ` [Bug middle-end/114109] " juzhe.zhong at rivai dot ai
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-02-26 10:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

            Bug ID: 114109
           Summary: x264 satd vectorization vs LLVM
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-* riscv*-*-*

Looking at the following code of x264 (SPEC 2017):

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

static inline uint32_t abs2 (uint32_t a)
{
    uint32_t s = ((a >> 15) & 0x10001) * 0xffff;
    return (a + s) ^ s;
}

int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;

    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        {
          int t0 = a0 + a1;
          int t1 = a0 - a1;
          int t2 = a2 + a3;
          int t3 = a2 - a3;
          tmp[i][0] = t0 + t2;
          tmp[i][1] = t1 + t3;
          tmp[i][2] = t0 - t2;
          tmp[i][3] = t1 - t3;
        };
    }
    for( int i = 0; i < 4; i++ )
    {
        { int t0 = tmp[0][i] + tmp[1][i];
          int t1 = tmp[0][i] - tmp[1][i];
          int t2 = tmp[2][i] + tmp[3][i];
          int t3 = tmp[2][i] - tmp[3][i];
          a0 = t0 + t2;
          a2 = t0 - t2;
          a1 = t1 + t3;
          a3 = t1 - t3;
        };
        sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
    }
    return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1;
}

I first checked on riscv but x86 and aarch64 are pretty similar.  (Refer
https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f)

Vectorizing the first loop seems to be a costing issue.  By default we don't
vectorize and the code becomes much larger when disabling vector costing, so
the costing decision in itself seems correct.
Clang's version is significantly shorter and it looks like it just directly
vec_sets/vec_inits the individual elements.  On riscv it can be handled rather
elegantly with strided loads that we don't emit right now.
As there are only 4 active vector elements and the loop is likely load bound it
might be debatable whether LLVM's version is better?

The second loop we do vectorize (4 elements at a time) but end up with e.g.
four XORs for the four inlined abs2 calls while clang chooses a larger
vectorization factor and does all the xors in one.

On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM)
but I guess the general case is still interesting?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114109] x264 satd vectorization vs LLVM
  2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
@ 2024-02-26 10:44 ` juzhe.zhong at rivai dot ai
  2024-02-26 11:20 ` rdapp at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-26 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #1 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It seems RISC-V Clang didn't vectorize it ?

https://godbolt.org/z/G4han6vM3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114109] x264 satd vectorization vs LLVM
  2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
  2024-02-26 10:44 ` [Bug middle-end/114109] " juzhe.zhong at rivai dot ai
@ 2024-02-26 11:20 ` rdapp at gcc dot gnu.org
  2024-02-26 11:24 ` juzhe.zhong at rivai dot ai
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-02-26 11:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
It is vectorized with a higher zvl, e.g. zvl512b, refer
https://godbolt.org/z/vbfjYn5Kd.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114109] x264 satd vectorization vs LLVM
  2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
  2024-02-26 10:44 ` [Bug middle-end/114109] " juzhe.zhong at rivai dot ai
  2024-02-26 11:20 ` rdapp at gcc dot gnu.org
@ 2024-02-26 11:24 ` juzhe.zhong at rivai dot ai
  2024-02-26 11:26 ` rdapp at gcc dot gnu.org
  2024-02-26 15:08 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-26 11:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Robin Dapp from comment #2)
> It is vectorized with a higher zvl, e.g. zvl512b, refer
> https://godbolt.org/z/vbfjYn5Kd.

OK. I see. But Clang generates many slide instruction which are expensive in
real hardware.

And also vluxei64 is also expensive.

I am not sure which is better. It should be tested on real RISC-V hardware to
evaluate their performance rather than simply tested on SPIKE/QEMU dynamic
instructions count.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114109] x264 satd vectorization vs LLVM
  2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-02-26 11:24 ` juzhe.zhong at rivai dot ai
@ 2024-02-26 11:26 ` rdapp at gcc dot gnu.org
  2024-02-26 15:08 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-02-26 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #4 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Yes, as mentioned, vectorization of the first loop is debatable.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/114109] x264 satd vectorization vs LLVM
  2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-02-26 11:26 ` rdapp at gcc dot gnu.org
@ 2024-02-26 15:08 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-26 15:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's at least one other bug about this (or a similar) pattern.  Note using
-fno-vect-cost-model isn't really recommended.

Might want to relate the various x264 missed-opt bugs.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-26 15:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-26 10:28 [Bug middle-end/114109] New: x264 satd vectorization vs LLVM rdapp at gcc dot gnu.org
2024-02-26 10:44 ` [Bug middle-end/114109] " juzhe.zhong at rivai dot ai
2024-02-26 11:20 ` rdapp at gcc dot gnu.org
2024-02-26 11:24 ` juzhe.zhong at rivai dot ai
2024-02-26 11:26 ` rdapp at gcc dot gnu.org
2024-02-26 15:08 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).