public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/114814] New: Reduction sum of comparison should be better @ 2024-04-22 22:15 pinskia at gcc dot gnu.org 2024-06-20 9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org ` (2 more replies) 0 siblings, 3 replies; 4+ messages in thread From: pinskia at gcc dot gnu.org @ 2024-04-22 22:15 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814 Bug ID: 114814 Summary: Reduction sum of comparison should be better Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take the example from PR 114809: ``` #include <cstdint> #include <cstdlib> size_t count_chars(const char *src, size_t len, char c) { size_t count = 0; for (size_t i=0; i < len; i++) { count += src[i] == c; } return count; } ``` For aarch64 we produce currently for the inner loop: ``` .L4: ldr q31, [x3], 16 cmeq v31.16b, v31.16b, v22.16b and v31.16b, v23.16b, v31.16b zip1 v27.16b, v31.16b, v29.16b zip2 v31.16b, v31.16b, v29.16b zip1 v25.8h, v27.8h, v29.8h zip2 v27.8h, v27.8h, v29.8h zip1 v26.8h, v31.8h, v29.8h zip2 v31.8h, v31.8h, v29.8h zip2 v30.4s, v25.4s, v29.4s zip2 v28.4s, v27.4s, v29.4s uaddw v30.2d, v30.2d, v25.2s uaddw v28.2d, v28.2d, v27.2s uaddw v30.2d, v30.2d, v26.2s uaddw2 v28.2d, v28.2d, v26.4s uaddw v30.2d, v30.2d, v31.2s uaddw2 v31.2d, v28.2d, v31.4s add v31.2d, v30.2d, v31.2d add v24.2d, v24.2d, v31.2d cmp x5, x3 bne .L4 ``` But instead we should be able to just do: ``` .L4: ldr q31, [x3], 16 cmeq v31.16b, v31.16b, v22.16b and v31.16b, v23.16b, v31.16b addv b31, v31.16b fmov x0, d31 add x1, x1, x0 cmp x5, x3 bne .L4 ``` Instead. That is do the reduction of the sum of the compare inside the loop rather than outside. ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/114814] Reduction sum of comparison should be better 2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org @ 2024-06-20 9:52 ` ktkachov at gcc dot gnu.org 2024-06-20 11:36 ` rguenth at gcc dot gnu.org 2024-06-21 4:41 ` fxue at os dot amperecomputing.com 2 siblings, 0 replies; 4+ messages in thread From: ktkachov at gcc dot gnu.org @ 2024-06-20 9:52 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814 ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed| |2024-06-20 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. The SVE2 codegen isn't much better. -O2 -mcpu=neoverse-v2 --param aarch64-autovec-preference=2 .L3: ld1b z3.b, p5/z, [x0, x3] mov p5.b, p13.b cmpeq p14.b, p5/z, z30.b, z3.b mov z26.b, p14/z, #1 uunpklo z1.h, z26.b whilelo p5.b, x3, x4 uunpklo z2.s, z1.h uunpkhi z26.h, z26.b uunpkhi z1.s, z1.h uunpklo z0.s, z26.h uunpklo z27.d, z2.s uunpklo z24.d, z1.s add z28.d, p7/m, z28.d, z27.d uunpkhi z26.s, z26.h mov p7.b, p15.b uunpklo z25.d, z0.s uunpklo z29.d, z26.s whilelo p15.d, x3, x11 uunpkhi z2.d, z2.s uunpkhi z1.d, z1.s add z28.d, p6/m, z28.d, z2.d uunpkhi z0.d, z0.s add z28.d, p4/m, z28.d, z24.d whilelo p6.d, x3, x5 add z28.d, p3/m, z28.d, z1.d whilelo p4.d, x3, x6 add z28.d, p2/m, z28.d, z25.d whilelo p3.d, x3, x7 add z28.d, p1/m, z28.d, z0.d whilelo p2.d, x3, x8 add z28.d, p0/m, z28.d, z29.d whilelo p1.d, x3, x9 whilelo p0.d, x3, x10 mov x1, x3 uunpkhi z26.d, z26.s add x3, x3, x12 add z28.d, p7/m, z28.d, z26.d whilelo p7.d, x1, x4 b.any .L3 mov p7.b, p13.b uaddv d31, p7, z28.d fmov x0, d31 ret ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/114814] Reduction sum of comparison should be better 2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org 2024-06-20 9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org @ 2024-06-20 11:36 ` rguenth at gcc dot gnu.org 2024-06-21 4:41 ` fxue at os dot amperecomputing.com 2 siblings, 0 replies; 4+ messages in thread From: rguenth at gcc dot gnu.org @ 2024-06-20 11:36 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fxue at os dot amperecomputing.com --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is the high VF imposed on us and the required bool -> size_t conversion. What you get is of course massive parallelism. What hurts as well is the "linear" accumulation done of the vector IVs instead of having multiple accumulators or accumulating them in a tree. I think there's work to improve that part in progress. Using a widen-sum for part of the accumulation might be another improvement, currrently we fail here because QI -> DI widen sum isn't available but both SI -> DI widen sum with earlier QI -> SI widening or QI -> HI widen sum with later HI -> DI widening would be possible. ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/114814] Reduction sum of comparison should be better 2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org 2024-06-20 9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org 2024-06-20 11:36 ` rguenth at gcc dot gnu.org @ 2024-06-21 4:41 ` fxue at os dot amperecomputing.com 2 siblings, 0 replies; 4+ messages in thread From: fxue at os dot amperecomputing.com @ 2024-06-21 4:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814 --- Comment #3 from Feng Xue <fxue at os dot amperecomputing.com> --- The pattern to match the code belongs to a generic dot-product category, we could consider mapping it to native dot-product instruction with a constant "1" operand. movi v29.16b, 0x1 .L4: ldr q31, [x1], 16 cmeq v31.16b, v28.16b, v31.16b and v31.16b, v29.16b, v31.16b udot v30.4s, v31.16b, v29.16b cmp x5, x1 bne .L4 addv s31, v30.4s fmov w1, s31 And if value accumulation does not require widening, as in this case, then REDUC_PLUS finds its usage, which could be seen as a special instance of dot-product instruction. But here is one point to note: we should think this kind of REDUC_PLUS touches whole vector register, modifying the 1st element and clearing the rest part. Anyway, it would become an addv. For SVE, since element count is variant, element type may not hold accumulation result, only dot-product could be used. Moreover, it is possible to extend the means to handle conditional accumulation as: for (i) { if (cond) sum += a; // => sum += cond * a; } ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-06-21 4:41 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-04-22 22:15 [Bug tree-optimization/114814] New: Reduction sum of comparison should be better pinskia at gcc dot gnu.org 2024-06-20 9:52 ` [Bug tree-optimization/114814] " ktkachov at gcc dot gnu.org 2024-06-20 11:36 ` rguenth at gcc dot gnu.org 2024-06-21 4:41 ` fxue at os dot amperecomputing.com
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).