I see, that refreshes my understanding on that. I have no concern now.

Thx,
Haochen

From: H.J. Lu <hjl.tools@gmail.com>
Sent: Wednesday, June 19, 2024 4:38 AM
To: Jiang, Haochen <haochen.jiang@intel.com>
Cc: Beulich, Jan <JBeulich@suse.com>; Cui, Lili <lili.cui@intel.com>; Binutils <binutils@sourceware.org>
Subject: Re: [PATCH 6/6] x86: optimize {,V}PEXTR{D,Q} with immediate of 0

-Os should optimize for code size. Other optimizations should
take performance into account.

On Tue, Jun 18, 2024, 2:23 PM Jiang, Haochen <haochen.jiang@intel.com<mailto:haochen.jiang@intel.com>> wrote:
> >> Wait. While the compiler may use PSRLDQ here, based on knowing
> >> assumptions
> >> made elsewhere, the assembler can't: The replacement insn must generate the
> >> exact same result in the destination register. PSRLDQ with an immediate of
> >> 0 (which effectively you're suggesting to use here) doesn't alter the
> >> destination register at all, though. When really we want the upper bits of
> >> the register cleared.
> >
> > pextrd/q also doesn't clear them at all. For vpextrd/q and vpsrldq, they will
> > both clear higher bits. So they will be the same.
>
> Wait - your suggestion is even more confusing: The destination of PSRLDQ is
> an XMM register, whereas the destination of PEXTR* is a GPR or memory. This
> is properly expressed in the constraints in the compiler, but clearly we
> can't replace insns like this in the assembler.

Yes, I realized that I am wrong here, there are no constraints. vmovd/q would be
definitely better and doable here if we would like to do something.

> >>> Also, I suppose the optimization related to latency should not be done in
> >>> assembler.
> >>
> >> Why? We have -O, -O1, and -O2 alongside -Os for a reason.
> >
> > I am quite conservative on the optimization in assembler. If we are also going to
> > optimize those hand-written code, the optimization could work.
> >
> > However, when they hand write some code, are we supposed to change them?
>
> Well, if we aren't to, people simply don't pass -O.
>
> > For -Os, we could give them all the optimizations we have, but for -O, I am not
> > that sure.
> >
> > And I suppose we might add too much burden for the assembler if we are going
> > to add too much optimizations related to latency. It will become another compiler.
> > Are we supposed to copy all the optimizations from compiler?
>
> Probably not all (and many aren't the the insn level anyway, nor do we - so
> far at least - optimize for latency/throughput at the expense of code size).
> But yes - this specific aspect is why I keep raising questions on what
> optimizations are worth it vs where we'd better leave code alone.

H.J., what is your opinion on that?

Thx,
Haochen

>
> Jan
>
> > IMO, optimization to
> > codesize is ok, but for latency, I am a little concerned.
> >
> > Thx,
> > Haochen
> >
> >>
> >> Jan