public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114814] New: Reduction sum of comparison should be better
@ 2024-04-22 22:15 pinskia at gcc dot gnu.org
  2024-06-20  9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-22 22:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814

            Bug ID: 114814
           Summary: Reduction sum of comparison should be better
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take the example from PR 114809:
```
#include <cstdint>
#include <cstdlib>

size_t count_chars(const char *src, size_t len, char c) {
    size_t count = 0;
    for (size_t i=0; i < len; i++) {
        count += src[i] == c;
    }

    return count;
}
```

For aarch64 we produce currently for the inner loop:
```
.L4:
        ldr     q31, [x3], 16
        cmeq    v31.16b, v31.16b, v22.16b
        and     v31.16b, v23.16b, v31.16b
        zip1    v27.16b, v31.16b, v29.16b
        zip2    v31.16b, v31.16b, v29.16b
        zip1    v25.8h, v27.8h, v29.8h
        zip2    v27.8h, v27.8h, v29.8h
        zip1    v26.8h, v31.8h, v29.8h
        zip2    v31.8h, v31.8h, v29.8h
        zip2    v30.4s, v25.4s, v29.4s
        zip2    v28.4s, v27.4s, v29.4s
        uaddw   v30.2d, v30.2d, v25.2s
        uaddw   v28.2d, v28.2d, v27.2s
        uaddw   v30.2d, v30.2d, v26.2s
        uaddw2  v28.2d, v28.2d, v26.4s
        uaddw   v30.2d, v30.2d, v31.2s
        uaddw2  v31.2d, v28.2d, v31.4s
        add     v31.2d, v30.2d, v31.2d
        add     v24.2d, v24.2d, v31.2d
        cmp     x5, x3
        bne     .L4
```

But instead we should be able to just do:
```
.L4:
        ldr     q31, [x3], 16
        cmeq    v31.16b, v31.16b, v22.16b
        and     v31.16b, v23.16b, v31.16b
        addv    b31, v31.16b
        fmov    x0, d31
        add     x1, x1, x0
        cmp     x5, x3
        bne     .L4
```

Instead. That is do the reduction of the sum of the compare inside the loop
rather than outside.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/114814] Reduction sum of comparison should be better
  2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org
@ 2024-06-20  9:52 ` ktkachov at gcc dot gnu.org
  2024-06-20 11:36 ` rguenth at gcc dot gnu.org
  2024-06-21  4:41 ` fxue at os dot amperecomputing.com
  2 siblings, 0 replies; 4+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2024-06-20  9:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ktkachov at gcc dot gnu.org
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-06-20

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed.
The SVE2 codegen isn't much better.
-O2 -mcpu=neoverse-v2 --param aarch64-autovec-preference=2

.L3:
        ld1b    z3.b, p5/z, [x0, x3]
        mov     p5.b, p13.b
        cmpeq   p14.b, p5/z, z30.b, z3.b
        mov     z26.b, p14/z, #1
        uunpklo z1.h, z26.b
        whilelo p5.b, x3, x4
        uunpklo z2.s, z1.h
        uunpkhi z26.h, z26.b
        uunpkhi z1.s, z1.h
        uunpklo z0.s, z26.h
        uunpklo z27.d, z2.s
        uunpklo z24.d, z1.s
        add     z28.d, p7/m, z28.d, z27.d
        uunpkhi z26.s, z26.h
        mov     p7.b, p15.b
        uunpklo z25.d, z0.s
        uunpklo z29.d, z26.s
        whilelo p15.d, x3, x11
        uunpkhi z2.d, z2.s
        uunpkhi z1.d, z1.s
        add     z28.d, p6/m, z28.d, z2.d
        uunpkhi z0.d, z0.s
        add     z28.d, p4/m, z28.d, z24.d
        whilelo p6.d, x3, x5
        add     z28.d, p3/m, z28.d, z1.d
        whilelo p4.d, x3, x6
        add     z28.d, p2/m, z28.d, z25.d
        whilelo p3.d, x3, x7
        add     z28.d, p1/m, z28.d, z0.d
        whilelo p2.d, x3, x8
        add     z28.d, p0/m, z28.d, z29.d
        whilelo p1.d, x3, x9
        whilelo p0.d, x3, x10
        mov     x1, x3
        uunpkhi z26.d, z26.s
        add     x3, x3, x12
        add     z28.d, p7/m, z28.d, z26.d
        whilelo p7.d, x1, x4
        b.any   .L3
        mov     p7.b, p13.b
        uaddv   d31, p7, z28.d
        fmov    x0, d31
        ret

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/114814] Reduction sum of comparison should be better
  2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org
  2024-06-20  9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org
@ 2024-06-20 11:36 ` rguenth at gcc dot gnu.org
  2024-06-21  4:41 ` fxue at os dot amperecomputing.com
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-20 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fxue at os dot amperecomputing.com

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is the high VF imposed on us and the required bool -> size_t
conversion.  What you get is of course massive parallelism.  What hurts
as well is the "linear" accumulation done of the vector IVs instead of
having multiple accumulators or accumulating them in a tree.

I think there's work to improve that part in progress.

Using a widen-sum for part of the accumulation might be another improvement,
currrently we fail here because QI -> DI widen sum isn't available but both
SI -> DI widen sum with earlier QI -> SI widening or QI -> HI widen sum
with later HI -> DI widening would be possible.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/114814] Reduction sum of comparison should be better
  2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org
  2024-06-20  9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org
  2024-06-20 11:36 ` rguenth at gcc dot gnu.org
@ 2024-06-21  4:41 ` fxue at os dot amperecomputing.com
  2 siblings, 0 replies; 4+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-06-21  4:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814

--- Comment #3 from Feng Xue <fxue at os dot amperecomputing.com> ---
The pattern to match the code belongs to a generic dot-product category, we
could consider mapping it to native dot-product instruction with a constant "1"
operand.

        movi    v29.16b, 0x1
.L4:
        ldr     q31, [x1], 16
        cmeq    v31.16b, v28.16b, v31.16b
        and     v31.16b, v29.16b, v31.16b
        udot    v30.4s, v31.16b, v29.16b
        cmp     x5, x1
        bne     .L4
        addv    s31, v30.4s
        fmov    w1, s31

And if value accumulation does not require widening, as in this case, then
REDUC_PLUS finds its usage, which could be seen as a special instance of
dot-product instruction. But here is one point to note: we should think this
kind of REDUC_PLUS touches whole vector register, modifying the 1st element and
clearing the rest part. Anyway, it would become an addv.

For SVE, since element count is variant, element type may not hold accumulation
result, only dot-product could be used.

Moreover, it is possible to extend the means to handle conditional accumulation
as:

   for (i) {
     if (cond)
       sum += a;   // => sum += cond * a;
   }

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-21  4:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org
2024-06-20  9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org
2024-06-20 11:36 ` rguenth at gcc dot gnu.org
2024-06-21  4:41 ` fxue at os dot amperecomputing.com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).