[Bug rtl-optimization/99462] New: Enhance scheduling to split instructions

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/99462] New: Enhance scheduling to split instructions
@ 2021-03-08 10:49 rguenth at gcc dot gnu.org
  2021-03-08 10:50 ` [Bug rtl-optimization/99462] " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 10:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

            Bug ID: 99462
           Summary: Enhance scheduling to split instructions
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Maybe the scheduler(s) can already do this (I have zero knowledge here).  For
example the x86 vec_concatv2di insn has alternatives that cause the instruction
to be split into multiple uops (vpinsrq, movhpd) when the 'insert' operand
is not XMM (but GPR or MEM).  We now have a peephole2 to split such cases:

+;; Further split pinsrq variants of vec_concatv2di to hide the latency
+;; the GPR->XMM transition(s).
+(define_peephole2
+  [(match_scratch:DI 3 "Yv")
+   (set (match_operand:V2DI 0 "sse_reg_operand")
+       (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
+                        (match_operand:DI 2 "nonimmediate_gr_operand")))]
+  "TARGET_64BIT && TARGET_SSE4_1
+   && !optimize_insn_for_size_p ()"
+  [(set (match_dup 3)
+        (match_dup 2))
+   (set (match_dup 0)
+       (vec_concat:V2DI (match_dup 1)
+                        (match_dup 3)))])

but in reality this is only profitable when we either can execute
two "bad" move uops in parallel (thus when originally composing
two GPRs or two MEMs) or when we can schedule one "bad" move much
earlier.

Thus, can the scheduler already "split" an instruction - say split
away a load uop and issue it early when a scratch register is available?

(the reverse alternative is to not expose multi-uop insns before scheduling
and only merge them later - during scheduling?)

How does GCC deal with situations like this?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/99462] Enhance scheduling to split instructions
  2021-03-08 10:49 [Bug rtl-optimization/99462] New: Enhance scheduling to split instructions rguenth at gcc dot gnu.org
@ 2021-03-08 10:50 ` rguenth at gcc dot gnu.org
  2021-03-08 14:15 ` jakub at gcc dot gnu.org
  2021-03-08 16:13 ` amonakov at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08 10:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
                 CC|                            |amonakov at gcc dot gnu.org,
                   |                            |law at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
CCing scheduler maintainers

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/99462] Enhance scheduling to split instructions
  2021-03-08 10:49 [Bug rtl-optimization/99462] New: Enhance scheduling to split instructions rguenth at gcc dot gnu.org
  2021-03-08 10:50 ` [Bug rtl-optimization/99462] " rguenth at gcc dot gnu.org
@ 2021-03-08 14:15 ` jakub at gcc dot gnu.org
  2021-03-08 16:13 ` amonakov at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-08 14:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I think only sel-sched splits any instructions (and it caused recently some
bugs).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug rtl-optimization/99462] Enhance scheduling to split instructions
  2021-03-08 10:49 [Bug rtl-optimization/99462] New: Enhance scheduling to split instructions rguenth at gcc dot gnu.org
  2021-03-08 10:50 ` [Bug rtl-optimization/99462] " rguenth at gcc dot gnu.org
  2021-03-08 14:15 ` jakub at gcc dot gnu.org
@ 2021-03-08 16:13 ` amonakov at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: amonakov at gcc dot gnu.org @ 2021-03-08 16:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(for context, the above patch was for PR 98856, but it's based on incorrect
latency analysis, see bug 98856 comment #38 )

Right now schedulers cannot easily split instructions for that purpose, it
would require computing dependency graph more accurately. Right now
dependencies and priorities are computed with respect to instructions as a
whole, intelligent splitting would require tracking latencies with respect to
individual inputs.

sel-sched does not split, but it can perform "renaming" which basically
overcomes anti-dependencies by scheduling the desired instruction before the
conflicting write (by choosing a different output register), and a reg-reg move
later.

I think on modern x86 profitability of such splitting is quite dubious, because
it would increase the amount of instructions and uops flowing in the CPU
front-end and entering the renamer (which is one of narrowest points in the
pipeline). Especially on AMD, where not only load-op, but also load-op-store
instructions are renamed as a single uop (which is then sent to two or three
execution units).

I think in common cases where overall critical path is unchanged (like in given
examples of pinsrq and various load-op instruction) GCC should simply continue
emitting the combined form.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-03-08 16:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 10:49 [Bug rtl-optimization/99462] New: Enhance scheduling to split instructions rguenth at gcc dot gnu.org
2021-03-08 10:50 ` [Bug rtl-optimization/99462] " rguenth at gcc dot gnu.org
2021-03-08 14:15 ` jakub at gcc dot gnu.org
2021-03-08 16:13 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).