[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "rdapp at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/114109] New: x264 satd vectorization vs LLVM
Date: Mon, 26 Feb 2024 10:28:50 +0000	[thread overview]
Message-ID: <bug-114109-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

            Bug ID: 114109
           Summary: x264 satd vectorization vs LLVM
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-* riscv*-*-*

Looking at the following code of x264 (SPEC 2017):

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

static inline uint32_t abs2 (uint32_t a)
{
    uint32_t s = ((a >> 15) & 0x10001) * 0xffff;
    return (a + s) ^ s;
}

int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;

    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        {
          int t0 = a0 + a1;
          int t1 = a0 - a1;
          int t2 = a2 + a3;
          int t3 = a2 - a3;
          tmp[i][0] = t0 + t2;
          tmp[i][1] = t1 + t3;
          tmp[i][2] = t0 - t2;
          tmp[i][3] = t1 - t3;
        };
    }
    for( int i = 0; i < 4; i++ )
    {
        { int t0 = tmp[0][i] + tmp[1][i];
          int t1 = tmp[0][i] - tmp[1][i];
          int t2 = tmp[2][i] + tmp[3][i];
          int t3 = tmp[2][i] - tmp[3][i];
          a0 = t0 + t2;
          a2 = t0 - t2;
          a1 = t1 + t3;
          a3 = t1 - t3;
        };
        sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
    }
    return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1;
}

I first checked on riscv but x86 and aarch64 are pretty similar.  (Refer
https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f)

Vectorizing the first loop seems to be a costing issue.  By default we don't
vectorize and the code becomes much larger when disabling vector costing, so
the costing decision in itself seems correct.
Clang's version is significantly shorter and it looks like it just directly
vec_sets/vec_inits the individual elements.  On riscv it can be handled rather
elegantly with strided loads that we don't emit right now.
As there are only 4 active vector elements and the loop is likely load bound it
might be debatable whether LLVM's version is better?

The second loop we do vectorize (4 elements at a time) but end up with e.g.
four XORs for the four inlined abs2 calls while clang chooses a larger
vectorization factor and does all the xors in one.

On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM)
but I guess the general case is still interesting?

next             reply	other threads:[~2024-02-26 10:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-26 10:28 rdapp at gcc dot gnu.org [this message]
2024-02-26 10:44 ` [Bug middle-end/114109] " juzhe.zhong at rivai dot ai
2024-02-26 11:20 ` rdapp at gcc dot gnu.org
2024-02-26 11:24 ` juzhe.zhong at rivai dot ai
2024-02-26 11:26 ` rdapp at gcc dot gnu.org
2024-02-26 15:08 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-114109-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).