[Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop
@ 2022-07-18 13:01 yyc1992 at gmail dot com
  2022-07-18 13:13 ` [Bug target/106340] " yyc1992 at gmail dot com
  2022-07-20 14:35 ` yyc1992 at gmail dot com
  0 siblings, 2 replies; 3+ messages in thread
From: yyc1992 at gmail dot com @ 2022-07-18 13:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

            Bug ID: 106340
           Summary: flag set from SVE svwhilelt intrinsic not reused in
                    loop
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I'm experimenting with manually writing VLA loops and trying to match the
assembly code I expect/from autovectorizer. One of the main area I can't get it
to work is when setting the loop predicate using the svwhilelt intrinsics. The
instruction it corresponds to set the flags and can be directly used to
terminate the loop. Indeed, when using the autovectorizer, this is exactly what
happens.

```
void set1(uint32_t *__restrict__ out, size_t m)
{
    for (size_t i = 0; i < m; i++) {
        out[i] = 1;
    }
}
```

compiles to

```
        cbz     x1, .L1
        mov     x2, 0
        cntw    x3
        whilelo p0.s, xzr, x1
        mov     z0.s, #1
        .p2align 3,,7
.L3:
        st1w    z0.s, p0, [x0, x2, lsl 2]
        add     x2, x2, x3
        whilelo p0.s, x2, x1
        b.any   .L3
.L1:
        ret
```

(Here I believe the flag set from the loop header whilelo could also be used
for the jump but that doesn't same much in this case.)

However, no matter how I trie to replicate this using manually written code
using the sve intrinsics, there is always an additional cmp instruction
generated. The closest I can get is by replicating the structure of the
auto-vectorized loop as much as possible with,

```
void set2(uint32_t *__restrict__ out, size_t m)
{
    auto svelen = svcntw();
    auto v = svdup_u32(1);
    if (m != 0) {
        auto pg = svwhilelt_b32(0ul, m);
        for (size_t i = 0; i < m; i += svelen, pg = svwhilelt_b32(i, m)) {
            svst1(pg, &out[i], v);
        }
    }
}
```

which is compiled to

```
        cbz     x1, .L9
        mov     x2, 0
        cntw    x3
        whilelo p0.s, xzr, x1
        mov     z0.s, #1
        .p2align 3,,7
.L11:
        st1w    z0.s, p0, [x0, x2, lsl 2]
        add     x2, x2, x3
        whilelo p0.s, x2, x1
        cmp     x1, x2
        bhi     .L11
.L9:
        ret
```

which is literally the same code down to register allocation except that the
branch following the `whilelo` instruction is replaced with another comparison
and branch.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop
  2022-07-18 13:01 [Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop yyc1992 at gmail dot com
@ 2022-07-18 13:13 ` yyc1992 at gmail dot com
  2022-07-20 14:35 ` yyc1992 at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: yyc1992 at gmail dot com @ 2022-07-18 13:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

--- Comment #1 from Yichao Yu <yyc1992 at gmail dot com> ---
Also note that this is for code I've tweaked to match what the finally code as
much as possible. For a complete implementation of this, I expect the loop
transformation done for normal loop should move the whilelt as well so that
source code like the following would generate pretty much the same code.

```
void set3(uint32_t *__restrict__ out, size_t m)
{
    auto svelen = svcntw();
    auto v = svdup_u32(1);
    for (size_t i = 0; i < m; i += svelen) {
        auto pg = svwhilelt_b32(i, m);
        svst1(pg, &out[i], v);
    }
}
```

Currently, while the cmp was moved to the end of the loop body and the loop
header, the whilelt that is meant to be paired with it did not so the flag from
the whilelt instruction isn't directly usable as is in the code.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop
  2022-07-18 13:01 [Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop yyc1992 at gmail dot com
  2022-07-18 13:13 ` [Bug target/106340] " yyc1992 at gmail dot com
@ 2022-07-20 14:35 ` yyc1992 at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: yyc1992 at gmail dot com @ 2022-07-20 14:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

Yichao Yu <yyc1992 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Yichao Yu <yyc1992 at gmail dot com> ---
Over at the llvm bug report, it was pointed out to me that the standard pattern
to use is to do the branch based on ptest intrinsics. It matches the flag
setting of the whilelt family of instructions better and gcc is already able to
omit the ptest instruction in such case.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-07-20 14:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 13:01 [Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop yyc1992 at gmail dot com
2022-07-18 13:13 ` [Bug target/106340] " yyc1992 at gmail dot com
2022-07-20 14:35 ` yyc1992 at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).