Hi,

On 2021/6/9 07:25, Segher Boessenkool wrote:
> On Mon, May 24, 2021 at 04:02:13AM -0500, Xionghu Luo wrote:
>> vmrghb only accepts permute index {0, 16, 1, 17, 2, 18, 3, 19, 4, 20,
>> 5, 21, 6, 22, 7, 23} no matter for BE or LE in ISA, similarly for vmrghlb.
> 
> (vmrglb)
> 
>> +  if (BYTES_BIG_ENDIAN)
>> +    emit_insn (
>> +      gen_altivec_vmrghb_direct (operands[0], operands[1], operands[2]));
>> +  else
>> +    emit_insn (
>> +      gen_altivec_vmrglb_direct (operands[0], operands[2], operands[1]));
> 
> Please don't indent like that, it doesn't match what we do elsewhere.
> For better or for worse (for worse imo), we use deep hanging indents.
> If you have to, you can do something like
> 
>    rtx insn;
>    if (BYTES_BIG_ENDIAN)
>      insn = gen_altivec_vmrghb_direct (operands[0], operands[1], operands[2]);
>    else
>      insn = gen_altivec_vmrglb_direct (operands[0], operands[2], operands[1]);
>    emit_insn (insn);
> 
> (this is better even, in that it has only one emit_insn), or even
> 
>    rtx (*fun) () = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct
> 				   : gen_altivec_vmrglb_direct;
>    if (!BYTES_BIG_ENDIAN)
>      std::swap (operands[1], operands[2]);
>    emit_insn (fun (operands[0], operands[1], operands[2]));
> 
> Well, C++ does not allow that last example like that, sigh, so
>    rtx (*fun) (rtx, rtx, rtx) = BYTES_BIG_ENDIAN ? gen_altivec_vmrghb_direct
> 						: gen_altivec_vmrglb_direct;
> 
> This is shorter than the other two options ;-)

Changed.

> 
>> +(define_insn "altivec_vmrghb_direct"
>>     [(set (match_operand:V16QI 0 "register_operand" "=v")
>> +    (vec_select:V16QI
> 
> This should be indented one space more.
> 
>>     "TARGET_ALTIVEC"
>>     "@
>> -   xxmrghw %x0,%x1,%x2
>> -   vmrghw %0,%1,%2"
>> +  xxmrghw %x0,%x1,%x2
>> +  vmrghw %0,%1,%2"
> 
> The original indent was correct, please restore.
> 
>> -      emit_insn (gen_altivec_vmrghw_direct (operands[0], ve, vo));
>> +      emit_insn (gen_altivec_vmrghw_direct_v4si (operands[0], ve, vo));
> 
> When you see a mode as part of a pattern name, chances are that it will
> be a good candidate for using parameterized names with.  (But don't do
> that now, just keep it in mind as a nice cleanup to do).

OK.

> 
>> @@ -23022,8 +23022,8 @@ altivec_expand_vec_perm_const (rtx target, rtx op0, rtx op1,
>>          : CODE_FOR_altivec_vmrglh_direct),
>>         {  0,  1, 16, 17,  2,  3, 18, 19,  4,  5, 20, 21,  6,  7, 22, 23 } },
>>       { OPTION_MASK_ALTIVEC,
>> -      (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghw_direct
>> -       : CODE_FOR_altivec_vmrglw_direct),
>> +      (BYTES_BIG_ENDIAN ? CODE_FOR_altivec_vmrghw_direct_v4si
>> +       : CODE_FOR_altivec_vmrglw_direct_v4si),
> 
> The correct way is to align the ? and the : (or put everything on one
> line of course, if that fits)
> 
> The parens around this are not needed btw, and are a distraction.

Changed.

> 
>> --- a/gcc/testsuite/gcc.target/powerpc/builtins-1.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/builtins-1.c
>> @@ -317,10 +317,10 @@ int main ()
>>   /* { dg-final { scan-assembler-times "vctuxs" 2 } } */
>>   
>>   /* { dg-final { scan-assembler-times "vmrghb" 4 { target be } } } */
>> -/* { dg-final { scan-assembler-times "vmrghb" 5 { target le } } } */
>> +/* { dg-final { scan-assembler-times "vmrghb" 6 { target le } } } */
>>   /* { dg-final { scan-assembler-times "vmrghh" 8 } } */
>> -/* { dg-final { scan-assembler-times "xxmrghw" 8 } } */
>> -/* { dg-final { scan-assembler-times "xxmrglw" 8 } } */
>> +/* { dg-final { scan-assembler-times "xxmrghw" 4 } } */
>> +/* { dg-final { scan-assembler-times "xxmrglw" 4 } } */
>>   /* { dg-final { scan-assembler-times "vmrglh" 8 } } */
>>   /* { dg-final { scan-assembler-times "xxlnor" 6 } } */
>>   /* { dg-final { scan-assembler-times {\mvpkudus\M} 1 } } */
>> @@ -347,7 +347,7 @@ int main ()
>>   /* { dg-final { scan-assembler-times "vspltb" 6 } } */
>>   /* { dg-final { scan-assembler-times "vspltw" 0 } } */
>>   /* { dg-final { scan-assembler-times "vmrgow" 8 } } */
>> -/* { dg-final { scan-assembler-times "vmrglb" 5 { target le } } } */
>> +/* { dg-final { scan-assembler-times "vmrglb" 4 { target le } } } */
>>   /* { dg-final { scan-assembler-times "vmrglb" 6 { target be } } } */
>>   /* { dg-final { scan-assembler-times "vmrgew" 8 } } */
>>   /* { dg-final { scan-assembler-times "vsplth" 8 } } */
> 
> Are those changes correct?  It looks like a vmrglb became a vmrghb, and
> that 4 each of xxmrghw and xxmrglw disappeared?  Both seem wrong?


This case is built with "-mdejagnu-cpu=power8 -O0 -mno-fold-gimple -dp"
and it also counted the generated instruction patterns.

1) "vsx_xxmrghw_v4si" is replaced by "altivec_vmrglw_direct_v4si/0", so 
it decreases from 8 to 4. (Likewise for vsx_xxmrglw_v4si.)

         li 9,48          # 1282 [c=4 l=4]  *movdi_internal64/3
-       lxvd2x 0,31,9    # 31   [c=8 l=4]  *vsx_lxvd2x4_le_v4si
-       xxpermdi 0,0,0,2         # 32   [c=4 l=4]  xxswapd_v4si
-       xxmrglw 0,0,12   # 33   [c=4 l=4]  vsx_xxmrghw_v4si
+       lxvd2x 12,31,9   # 31   [c=8 l=4]  *vsx_lxvd2x4_le_v4si
+       xxpermdi 12,12,12,2      # 32   [c=4 l=4]  xxswapd_v4si
+       xxmrglw 0,12,0   # 33   [c=4 l=4]  altivec_vmrglw_direct_v4si/0
         xxpermdi 0,0,0,2         # 35   [c=4 l=4]  xxswapd_v4sf

Note that v0 and v12 is swapped in lxvd2x, these new 3 instructions
produces same result than before.

2) "*altivec_vmrglb_internal" is replaced by "altivec_vmrghb_direct" 
with this patch, then vmrglb count decreases from 5 to 4 and vmrghb
increases from 5 to 6. (BYTES_BIG_ENDIAN is checked early in RTL 
generation instead of final to remove the UNSPECs for potential 
optimization through backend.)

         li 9,928                 # 1424 [c=4 l=4]  *movdi_internal64/3
         lxvd2x 0,31,9    # 416  [c=8 l=4]  *vsx_lxvd2x16_le_V16QI
-       xxpermdi 33,0,0,2        # 417  [c=4 l=4]  xxswapd_v16qi
+       xxpermdi 32,0,0,2        # 417  [c=4 l=4]  xxswapd_v16qi
         li 9,944                 # 1425 [c=4 l=4]  *movdi_internal64/3
         lxvd2x 0,31,9    # 418  [c=8 l=4]  *vsx_lxvd2x16_le_V16QI
-       xxpermdi 32,0,0,2        # 419  [c=4 l=4]  xxswapd_v16qi
-       vmrghb 0,0,1     # 420  [c=4 l=4]  *altivec_vmrglb_internal
+       xxpermdi 33,0,0,2        # 419  [c=4 l=4]  xxswapd_v16qi
+       vmrghb 0,1,0     # 420  [c=4 l=4]  altivec_vmrghb_direct
         xxpermdi 0,32,32,2       # 421  [c=4 l=4]  xxswapd_v16qi

Seems not necessary to also use \m and \M here to count only ASM here?
Update the patch as attached.

-- 
Thanks,
Xionghu