* x86: further optimization opportunities
@ 2022-08-26 12:12 Jan Beulich
0 siblings, 0 replies; only message in thread
From: Jan Beulich @ 2022-08-26 12:12 UTC (permalink / raw)
To: H.J. Lu; +Cc: Binutils
H.J.,
over time I've accumulated a list of possible transformations we could
do in addition to what we do already. Some are a little exotic, so may
not be worth it. Hence I'd like to ask for your view on things, if you
don't mind.
1) {,X}OR r<N>,0 and AND/TEST r<N>,~0 --> TEST r<N>,r<N>
Except for 32-bit forms in 64-bit mode. Note that ADD/CMP/SUB can't
be replaced this way, because TEST leaves AF undefined. But perhaps
IMUL r<N>,1 can be, unless we feared people depending on a particular
implementation's setting of PF, SF, and ZF.
2) AND r<N>,0 and perhaps IMUL r<N>,r<M>,0 --> XOR r<N>,r<N>
3) {,V}PCMPEQQ --> e.g. {,V}PCMPEQD
{,V}PCMPGTQ --> {,V}PXOR.
when both source operands match, for being a 1 byte shorter encoding.
Some of the respective AVX512 forms can be transformed into KX{,N}OR*.
4) P{AND{,N},{,X}OR} and {AND{,N},{,X}OR}PD --> {AND{,N},{,X}OR}PS
MOVDQ{A,U} and MOV{A,U}PD --> MOV{A,U}PS
for saving the prefix byte. Perhaps only when -Os.
5) PSHUFD --> SHUFPS
with identical register operands, and again perhaps only when -Os.
6) VPCMP{,U}{B,W,D,Q} and VPCOM{,U}{B,W,D,Q} --> VPCMP{EQ,GT}{B,W,D,Q}
where suitable, saving the immediate byte and in the latter case
also possibly allowing for the shorter VEX2 encoding.
7) VPSUB{,U}S{B,W,D,Q} --> VPXOR
VPCMPGT{B,W,D,Q} (pre-AVX512) --> VPXOR
when both source operands are identical.
8) VFMADD{P,S}{S,D} et al --> VFMADD{132,231,213}{P,S}{S,D}
when one operand is suitably repeated. (This requires CpuFMA to be
explicitly enabled, as that's not a prereq to CpuFMA4.)
9) MOVZX
with 64-bit destination to drop the REX64 prefix.
10) RET/RETF/LRET
with immediate of zero to immediate-less form.
11) 32-bit TEST
with {8..15}-bit immediate in 16-bit mode.
12) MOVABS
displacement optimization with -Os, using 32-bit addressing mode as
applicable.
13) BT{,C,R,S}
with in-range immediate to operand-size-prefix-less forms. For memory
operands only by reducing nominal operand size (for register operands
going from 16- to 32-bit operand size is okay) and with an adjustment
to the displacement as necessary (perhaps leaving alone ones with LOCK
prefix).
14) BT{,C,R,S}
with memory operand and out-of-range immediate, transforming the upper
immediate bits into an adjustment to the displacement. Accompanied by
a warning, as the upper bits would no longer end up being ignored. The
SDM in fact suggests this as a model assemblers might follow.
Note that examples of 4 and 5 can actually be found in Linux'es crypto
code.
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-08-26 12:12 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26 12:12 x86: further optimization opportunities Jan Beulich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).