Hi, On 2021/6/9 07:25, Segher Boessenkool wrote: > On Mon, May 24, 2021 at 04:02:13AM -0500, Xionghu Luo wrote: >> vmrghb only accepts permute index {0, 16, 1, 17, 2, 18, 3, 19, 4, 20, >> 5, 21, 6, 22, 7, 23} no matter for BE or LE in ISA, similarly for vmrghlb. > > (vmrglb) > >> + if (BYTES_BIG_ENDIAN) >> + emit_insn ( >> + gen_altivec_vmrghb_direct (operands[0], operands[1], operands[2])); >> + else >> + emit_insn ( >> + gen_altivec_vmrglb_direct (operands[0], operands[2], operands[1])); > > Please don't indent like that, it doesn't match what we do elsewhere. > For better or for worse (for worse imo), we use deep hanging indents. > If you have to, you can do something like > > rtx insn; > if (BYTES_BIG_ENDIAN) > insn = gen_altivec_vmrghb_direct (operands[0], operands[1], operands[2]); > else > insn = gen_altivec_vmrglb_direct (operands[0], operands[2], operands[1]); > emit_insn (insn); > > (this is better even, in that it has only one emit_insn), or even > > rtx (*fun) () = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct > : gen_altivec_vmrglb_direct; > if (!BYTES_BIG_ENDIAN) > std::swap (operands[1], operands[2]); > emit_insn (fun (operands[0], operands[1], operands[2])); > > Well, C++ does not allow that last example like that, sigh, so > rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct > : gen_altivec_vmrglb_direct; > > This is shorter than the other two options ;-) Changed. > >> +(define_insn "altivec_vmrghb_direct" >> [(set (match_operand:V16QI 0 "register_operand" "=v") >> + (vec_select:V16QI > > This should be indented one space more. > >> "TARGET_ALTIVEC" >> "@ >> - xxmrghw %x0,%x1,%x2 >> - vmrghw %0,%1,%2" >> + xxmrghw %x0,%x1,%x2 >> + vmrghw %0,%1,%2" > > The original indent was correct, please restore. > >> - emit_insn (gen_altivec_vmrghw_direct (operands[0], ve, vo)); >> + emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], ve, vo)); > > When you see a mode as part of a pattern name, chances are that it will > be a good candidate for using parameterized names with. (But don't do > that now, just keep it in mind as a nice cleanup to do). OK. > >> @@ -23022,8 +23022,8 @@ altivec_expand_vec_perm_const (rtx target, rtx op0, rtx op1, >> : CODE_FOR_altivec_vmrglh_direct), >> { 0, 1, 16, 17, 2, 3, 18, 19, 4, 5, 20, 21, 6, 7, 22, 23 } }, >> { OPTION_MASK_ALTIVEC, >> - (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghw_direct >> - : CODE_FOR_altivec_vmrglw_direct), >> + (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghw_direct_v4si >> + : CODE_FOR_altivec_vmrglw_direct_v4si), > > The correct way is to align the ? and the : (or put everything on one > line of course, if that fits) > > The parens around this are not needed btw, and are a distraction. Changed. > >> --- a/gcc/testsuite/gcc.target/powerpc/builtins-1.c >> +++ b/gcc/testsuite/gcc.target/powerpc/builtins-1.c >> @@ -317,10 +317,10 @@ int main () >> /* { dg-final { scan-assembler-times "vctuxs" 2 } } */ >> >> /* { dg-final { scan-assembler-times "vmrghb" 4 { target be } } } */ >> -/* { dg-final { scan-assembler-times "vmrghb" 5 { target le } } } */ >> +/* { dg-final { scan-assembler-times "vmrghb" 6 { target le } } } */ >> /* { dg-final { scan-assembler-times "vmrghh" 8 } } */ >> -/* { dg-final { scan-assembler-times "xxmrghw" 8 } } */ >> -/* { dg-final { scan-assembler-times "xxmrglw" 8 } } */ >> +/* { dg-final { scan-assembler-times "xxmrghw" 4 } } */ >> +/* { dg-final { scan-assembler-times "xxmrglw" 4 } } */ >> /* { dg-final { scan-assembler-times "vmrglh" 8 } } */ >> /* { dg-final { scan-assembler-times "xxlnor" 6 } } */ >> /* { dg-final { scan-assembler-times {\mvpkudus\M} 1 } } */ >> @@ -347,7 +347,7 @@ int main () >> /* { dg-final { scan-assembler-times "vspltb" 6 } } */ >> /* { dg-final { scan-assembler-times "vspltw" 0 } } */ >> /* { dg-final { scan-assembler-times "vmrgow" 8 } } */ >> -/* { dg-final { scan-assembler-times "vmrglb" 5 { target le } } } */ >> +/* { dg-final { scan-assembler-times "vmrglb" 4 { target le } } } */ >> /* { dg-final { scan-assembler-times "vmrglb" 6 { target be } } } */ >> /* { dg-final { scan-assembler-times "vmrgew" 8 } } */ >> /* { dg-final { scan-assembler-times "vsplth" 8 } } */ > > Are those changes correct? It looks like a vmrglb became a vmrghb, and > that 4 each of xxmrghw and xxmrglw disappeared? Both seem wrong? This case is built with "-mdejagnu-cpu=power8 -O0 -mno-fold-gimple -dp" and it also counted the generated instruction patterns. 1) "vsx_xxmrghw_v4si" is replaced by "altivec_vmrglw_direct_v4si/0", so it decreases from 8 to 4. (Likewise for vsx_xxmrglw_v4si.) li 9,48 # 1282 [c=4 l=4] *movdi_internal64/3 - lxvd2x 0,31,9 # 31 [c=8 l=4] *vsx_lxvd2x4_le_v4si - xxpermdi 0,0,0,2 # 32 [c=4 l=4] xxswapd_v4si - xxmrglw 0,0,12 # 33 [c=4 l=4] vsx_xxmrghw_v4si + lxvd2x 12,31,9 # 31 [c=8 l=4] *vsx_lxvd2x4_le_v4si + xxpermdi 12,12,12,2 # 32 [c=4 l=4] xxswapd_v4si + xxmrglw 0,12,0 # 33 [c=4 l=4] altivec_vmrglw_direct_v4si/0 xxpermdi 0,0,0,2 # 35 [c=4 l=4] xxswapd_v4sf Note that v0 and v12 is swapped in lxvd2x, these new 3 instructions produces same result than before. 2) "*altivec_vmrglb_internal" is replaced by "altivec_vmrghb_direct" with this patch, then vmrglb count decreases from 5 to 4 and vmrghb increases from 5 to 6. (BYTES_BIG_ENDIAN is checked early in RTL generation instead of final to remove the UNSPECs for potential optimization through backend.) li 9,928 # 1424 [c=4 l=4] *movdi_internal64/3 lxvd2x 0,31,9 # 416 [c=8 l=4] *vsx_lxvd2x16_le_V16QI - xxpermdi 33,0,0,2 # 417 [c=4 l=4] xxswapd_v16qi + xxpermdi 32,0,0,2 # 417 [c=4 l=4] xxswapd_v16qi li 9,944 # 1425 [c=4 l=4] *movdi_internal64/3 lxvd2x 0,31,9 # 418 [c=8 l=4] *vsx_lxvd2x16_le_V16QI - xxpermdi 32,0,0,2 # 419 [c=4 l=4] xxswapd_v16qi - vmrghb 0,0,1 # 420 [c=4 l=4] *altivec_vmrglb_internal + xxpermdi 33,0,0,2 # 419 [c=4 l=4] xxswapd_v16qi + vmrghb 0,1,0 # 420 [c=4 l=4] altivec_vmrghb_direct xxpermdi 0,32,32,2 # 421 [c=4 l=4] xxswapd_v16qi Seems not necessary to also use \m and \M here to count only ASM here? Update the patch as attached. -- Thanks, Xionghu