[Bug rtl-optimization/113903] New: sched1 should schedule across EBBS

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/113903] New: sched1 should schedule across EBBS
@ 2024-02-13 10:40 tnfchris at gcc dot gnu.org
  2024-02-13 11:26 ` [Bug rtl-optimization/113903] " amonakov at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-13 10:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903

            Bug ID: 113903
           Summary: sched1 should schedule across EBBS
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

#define N 306
#define NEEDLE 136
int table[N];

int foo (int i, unsigned short parse_tables_n)
{
  parse_tables_n >>= 9;
  parse_tables_n += 11;
  while (i < N && parse_tables_n--)
    table[i++] = 0;
  return table[NEEDLE];
}

compiled at -O3 shows an issue we've started getting with the support for early
break vectorization.

sched1 doesn't seem to be able to schedule across EBBs, which is logical since
we never really needed to before.

However the above code generates:

.L10:
        st1w    z28.s, p7, [x1, #1, mul vl]
        st1w    z28.s, p7, [x1]
        add     x1, x1, x5
        cmp     w0, w2
        bcc     .L17
.L8:
        cmpne   p15.h, p7/z, z31.h, #0
        mov     z29.d, z31.d
        not     p15.b, p14/z, p15.b
        mov     z27.d, z30.d
        add     w2, w2, w4
        dech    z31.h
        ptest   p14, p15.b
        incw    z30.s, all, mul #2
        b.none  .L10
        umov    w1, v29.h[0]
        umov    w20, v27.s[0]
        and     w3, w1, 65535
        b       .L6

and the AArch64 codegen inefficiencies aside (which I will tackle myself) shows
that we're copying the old value of the induction variables in every loop
iteration to keep them for the reductions if we exit.

However the new values are not live in L8 and so the operations can be moved to
L10:

.L10:
        incw    z30.s, all, mul #2
        dech    z31.h
        st1w    z28.s, p7, [x1, #1, mul vl]
        st1w    z28.s, p7, [x1]
        add     x1, x1, x5
        cmp     w0, w2
        bcc     .L17
.L8:
        cmpne   p15.h, p7/z, z31.h, #0
        not     p15.b, p14/z, p15.b
        add     w2, w2, w4
        ptest   p14, p15.b
        b.none  .L10
        umov    w1, v31.h[0]
        umov    w20, v30.s[0]
        and     w3, w1, 65535
        b       .L6

and thus decreasing the live ranges.  The optimal codegen for this sequence is:

.L10:
        dech    z31.h
        incw    z30.s, all, mul #2
        st1w    z28.s, p7, [x1, #1, mul vl]
        st1w    z28.s, p7, [x1]
        add     x1, x1, x5
        cmp     w0, w2
        bcc     .L17
.L8:
        cmpeq   p15.h, p7/z, z31.h, #0
        add     w2, w2, w4
        b.none  .L10
        umov    w1, v31.h[0]
        umov    w20, v30.s[0]
        b       .L6

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/113903] sched1 should schedule across EBBS
  2024-02-13 10:40 [Bug rtl-optimization/113903] New: sched1 should schedule across EBBS tnfchris at gcc dot gnu.org
@ 2024-02-13 11:26 ` amonakov at gcc dot gnu.org
  2024-02-13 13:45 ` tnfchris at gcc dot gnu.org
  2024-02-13 19:09 ` pinskia at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: amonakov at gcc dot gnu.org @ 2024-02-13 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Lifting those insns from the L8 BB to the L10 BB requires duplicating them on
all incoming edges targeting L8, doesn't it?

Why is decreasing live ranges important here?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/113903] sched1 should schedule across EBBS
  2024-02-13 10:40 [Bug rtl-optimization/113903] New: sched1 should schedule across EBBS tnfchris at gcc dot gnu.org
  2024-02-13 11:26 ` [Bug rtl-optimization/113903] " amonakov at gcc dot gnu.org
@ 2024-02-13 13:45 ` tnfchris at gcc dot gnu.org
  2024-02-13 19:09 ` pinskia at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-13 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #1)
> Lifting those insns from the L8 BB to the L10 BB requires duplicating them
> on all incoming edges targeting L8, doesn't it?
> 

No, because they're unused before L10.  If they are used then they can't be
moved. (note that L10 is only reachable from L8 as it's a branch in the loop).

> Why is decreasing live ranges important here?

two reasons, first we have to avoid prematurely creating the copies.

The loop has multiple exits, and the values are not relevant for all exits.

        mov     z29.d, z31.d
        mov     z27.d, z30.d

is being done because we increment the inductions in the same basic block.  But
the incremented value is not needed in L8.

for loop induction variables I suppose we can change the materialization point
in the vectorizer to deal with them that way, but that only takes care of
inductions and ideally we shouldn't perform operations before an exit if it's
not needed for that exit.

At the moment the vectorizer only deals with moving statements that are needed
for correctness.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/113903] sched1 should schedule across EBBS
  2024-02-13 10:40 [Bug rtl-optimization/113903] New: sched1 should schedule across EBBS tnfchris at gcc dot gnu.org
  2024-02-13 11:26 ` [Bug rtl-optimization/113903] " amonakov at gcc dot gnu.org
  2024-02-13 13:45 ` tnfchris at gcc dot gnu.org
@ 2024-02-13 19:09 ` pinskia at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-13 19:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
                 CC|                            |pinskia at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-13 19:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-13 10:40 [Bug rtl-optimization/113903] New: sched1 should schedule across EBBS tnfchris at gcc dot gnu.org
2024-02-13 11:26 ` [Bug rtl-optimization/113903] " amonakov at gcc dot gnu.org
2024-02-13 13:45 ` tnfchris at gcc dot gnu.org
2024-02-13 19:09 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).